# Almost Perfect: A Discussion on Quasi-Experiments Techniques

Quasi-experiments are experiments that leverage the principle from randomized tests, but are not equivalent

Any technique that can be used to estimate causal effects from observational data can be used to extract the causal effect from an quasi-experiment. The use of these causal inferences techniques in quasi-experiments is the reduction in the variance and bias of the calculated ATT (or ATE), similarly to the effect these techniques in randomized experiments. 

However, one of the biggest problems with using causal inference techniques is that they inevitably rely on assumptions about the causal links between variables. While there are advancements in causal discovery, in practice one never consider all possible configurations between cofounding, treatment, and target variables. Instead, we basically always create a Directed Acyclic Graph (DAG) to lay out the causal relationships in such way that the scientists behind, their piers, and clients are satisfied with.



"CUPED is just linear regression using a pre-experimental covariate."[2]

Following, we give a quick overview of the methods we cover in this benchmark, for a better in-depth understading of each method, we provide multiple contents where you can learn more about them

TL:DR:
- the best technique is XXXXXXX
- but it is still worse than when using an ensemble of (XXXXXXXXX) by XXXXXXX
- backtest with historical data to assess accuracy of ATT estimating model
- you can use previous randomized tests to calibrate hyperparameters (and possible even the parameters themselves) of your models

# Techniques Overview

## Matching + Differences-in-Differences (CausalPy)

### Propensity Score

### Mahalanobis Distance

## (Augmented) Synthetic Control (CausalPy & GeoLift)

## Meta-Learners (CausalML)
    
## Double ML (EconML)

## Uplift-Trees (CausalML)

## Do Method (DoWhy)

# Comparisons
## Methodology

## Datasets
- [Iowa Licor Sales](https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales)
- [Wallmart Dataset](https://www.kaggle.com/datasets/yasserh/walmart-dataset)
- [Supermarket Sales](https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales)
- [Superstore Sales Dataset](https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting)
- [Lifetime Value](https://www.kaggle.com/datasets/baetulo/lifetime-value)

## Example: Iowa Licor Sales


# Hacks: Improving your models

## Backtest using historic data

## Calibrate using previous randomized tests

## Don't limit yourself with just one model
Similar to how in typical machine-learning contests the winning contestant usually consists of an ensemble model of distinct methodologies (e.g. neural-networks and tree-based models), we also reduce performance of ATT when using multiple models. Below is a comparison between using either XXXXXXX or XXXXX to using both.

# References
1) [Causal Inference, The Mixtape](https://mixtape.scunning.com)
2) [Causality, Judea Pearl](https://www.amazon.co.uk/Causality-Judea-Pearl/dp/052189560X/ref=sr_1_1?crid=1KVB0KSO1OWMO&keywords=causality+judea&qid=1705423557&sprefix=causality+judea%2Caps%2C78&sr=8-1)
3) [Causal Inference in Statistics, Judea Pearl, Madelyn Glymour, Nicholas P. Jewell](https://www.amazon.co.uk/Causal-Inference-Statistics-Judea-Pearl/dp/1119186846/ref=sr_1_1?crid=1SP7ANTNKW60K&keywords=causal+inference+in+statistics&qid=1705423576&sprefix=causal+inference+in+%2Caps%2C81&sr=8-1)
4) [Variance reduction in experiments using covariate adjustment techniques](https://medium.com/glovo-engineering/variance-reduction-in-experiments-using-covariate-adjustment-techniques-717b1e450185)
5) [How Booking.com increases the power of online experiments with CUPED](https://booking.ai/how-booking-com-increases-the-power-of-online-experiments-with-cuped-995d186fff1d)
6) [CausalML](https://causalml.readthedocs.io/en/latest/index.html)
7) [EconML](https://econml.azurewebsites.net/index.html)
8) [CausalPy](https://causalpy.readthedocs.io/en/latest/)
9) [DoWhy](https://www.pywhy.org/dowhy/v0.11.1/#)

In [1]:
from src.data.load import DataLoader
from pathlib import Path

loader = DataLoader('data')
loader.download_data()


Getting iowa_licor_sales dataset
Dataset iowa_licor_sales already present

Getting wallmart_sales dataset
Dataset wallmart_sales already present

Getting supermarket_sales dataset
Dataset supermarket_sales already present

Getting superstore_sales dataset
Dataset superstore_sales already present

Getting lifetime_value dataset
Dataset lifetime_value already present



In [2]:
loader.load_dataset('supermarket_sales').head()

Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
str,str,str,str,str,str,f64,i64,f64,f64,str,str,str,f64,f64,f64,f64
"""750-67-8428""","""A""","""Yangon""","""Member""","""Female""","""Health and bea…",74.69,7,26.1415,548.9715,"""1/5/2019""","""13:08""","""Ewallet""",522.83,4.761905,26.1415,9.1
"""226-31-3081""","""C""","""Naypyitaw""","""Normal""","""Female""","""Electronic acc…",15.28,5,3.82,80.22,"""3/8/2019""","""10:29""","""Cash""",76.4,4.761905,3.82,9.6
"""631-41-3108""","""A""","""Yangon""","""Normal""","""Male""","""Home and lifes…",46.33,7,16.2155,340.5255,"""3/3/2019""","""13:23""","""Credit card""",324.31,4.761905,16.2155,7.4
"""123-19-1176""","""A""","""Yangon""","""Member""","""Male""","""Health and bea…",58.22,8,23.288,489.048,"""1/27/2019""","""20:33""","""Ewallet""",465.76,4.761905,23.288,8.4
"""373-73-7910""","""A""","""Yangon""","""Normal""","""Male""","""Sports and tra…",86.31,7,30.2085,634.3785,"""2/8/2019""","""10:37""","""Ewallet""",604.17,4.761905,30.2085,5.3


In [85]:
import polars as pl
import numpy as np
from datetime import timedelta

class DataPreProcessing:

    def __init__(
            self,
            data: pl.DataFrame,
            id_col :str,
            date_col: str,
            metric_col: str,
            date_format: str='yyyy-MM-dd'
            ) -> None:
        
        # Required parameters
        self.data = data
        self.id_col = id_col
        self.date_col = date_col
        self.metric_col = metric_col

        # Facultative parameters
        self.date_format = date_format

        # Constants
        self.date_start = None
        self.date_end = None

    def _cast_date_column(self, data) -> None:
        """
        Cast the date column to date
        """
        data = (
            data
            .with_columns(
                pl.col(self.date_col).str.to_datetime(self.date_format)
                )
        )
        return data

    def _get_date_time(self, data: pl.DataFrame) -> None:
        """
        Get start and end date of the dataset
        """
        self.date_start = data[self.date_col].min()
        self.date_end = data[self.date_col].max()

    def _group_data(self, data: pl.DataFrame) -> None:
        """
        Group data based on the ID and date columns to remove duplicates (or to just regroup on a new granularity)
        """
        return (
            data
            .groupby([self.id_col, self.date_col])
            .agg(pl.col(self.metric_col).sum())
        )
    
    def _normalize_data(self, data: pl.DataFrame, pre_treatment_share: float):
        """
        Normalize the data based in the pre-treatment period
        """
        post_treatment_date_start = self.date_start + timedelta(
            days=np.round(pre_treatment_share * (self.date_end - self.date_start).days)
            )
        data_stats = (
            data
            .filter(pl.col(self.date_col) < post_treatment_date_start)
            .groupby([self.id_col])
            .agg(
                pl.col(self.metric_col).mean().alias('avg'),
                pl.col(self.metric_col).std().alias('std')
            )
        )
        return (
            data
            .join(data_stats, on=[self.id_col], how='inner')
            .with_columns(
                ((pl.col(self.metric_col) - pl.col('avg')) / pl.col('std')).alias(self.metric_col)
            )
            .with_columns((pl.col(self.date_col) >= pl.lit(post_treatment_date_start)).alias('treatment_period'))
            .drop(['avg', 'std'])
        )


    def _apply_default_names(self, data: pl.DataFrame) -> None:
        return (
            data
            .rename({
                self.id_col: 'id',
                self.date_col: 'date',
                self.metric_col: 'value'
                })
        )

    def get_preprocessed_data(self):
        """
        Apply all the steps to load and pre-process data
        1) Load data from storage
        2) Group data in the desired granularity
        3) Get the start and end dates
        4) Normalize data based on pre-treatment period
        5) Rename columns to the default of names for ID, Date, and Metric columns

        Observation:
        pre_treatment_share=1 as default, because this method is not meant to be used to analyze the effect of a treatment,
        so there is no problem of information leakage if we normalize again (using a fraction of the dataset) afterwards 
        when we apply the treatment effect
        """
        # Prepare the data 
        self.data = self._cast_date_column(self.data)
        self._get_date_time(self.data)

        # # group and normalize data
        self.data = self._group_data(self.data)
        self.data = self._normalize_data(self.data, 1.0)
        self.data = self._apply_default_names(self.data)
        return self.data

# from src.data.preprocessing import DataPreProcessing

pre = DataPreProcessing(
    loader.load_dataset('supermarket_sales').clone(),
    ['City', 'Branch'], 
    'Date', 
    

In [86]:
# from src.data.experiment_setup import SetupExperiment
#             id_col :str,
#             date_col: str,
#             metric_col: str,
#             date_format: str='yyyy-MM-dd'

In [87]:
# from src.data.preprocessing import DataPreProcessing

pre = DataPreProcessing(
    loader.load_dataset('supermarket_sales').clone(),
    'City', 
    'Date', 
    'gross income',
    date_format="%m/%d/%Y"
    )

z = pre.get_preprocessed_data()
    

In [88]:
z

id,date,value,treatment_period
str,datetime[μs],f64,bool
"""Mandalay""",2019-03-04 00:00:00,-1.162996,false
"""Mandalay""",2019-02-09 00:00:00,-0.204308,false
"""Yangon""",2019-01-29 00:00:00,0.448462,false
"""Naypyitaw""",2019-02-06 00:00:00,0.191929,false
"""Naypyitaw""",2019-02-04 00:00:00,-1.445386,false
"""Yangon""",2019-02-12 00:00:00,-0.854509,false
"""Naypyitaw""",2019-03-20 00:00:00,0.192409,false
"""Naypyitaw""",2019-03-10 00:00:00,0.115292,false
"""Mandalay""",2019-02-12 00:00:00,0.270134,false
"""Mandalay""",2019-02-11 00:00:00,0.687207,false
