# When association is causation

In [1]:
import pandas as pd
import numpy as np
from scipy.special import expit
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
import plotly.express as px

style.use("fivethirtyeight")

np.random.seed(123)
n = 100
tuition = np.random.normal(1000, 300, n).round()
tablet = np.random.binomial(1, expit((tuition - tuition.mean()) / tuition.std())).astype(bool)
enem_score = np.random.normal(200 - 50 * tablet + 0.7 * tuition, 200)
enem_score = (enem_score - enem_score.min()) / enem_score.max()
enem_score *= 1000

data = pd.DataFrame(dict(enem_score=enem_score, Tuition=tuition, Tablet=tablet))

In [2]:
fig = px.box(data, x="Tablet", y="enem_score", color="Tablet")
fig.show()

## Notes:

Individual Treatmeent effect:
Y<sub>1i</sub> - Y<sub>0i</sub>

- However at a unit level we would never be able to observe both states of "treated" and not at the same time to know the value of the individual treatment effect accurately.
- We can instead focus on the average treatment effect

Average Treatment Effect:

ATE = E[Y<sub>1</sub> - Y<sub>0</sub>]




In [3]:
# 4 schools data collected with 2 treatment conditions
pd.DataFrame(dict(
    i= [1,2,3,4],
    Y0=[500,600,800,700],
    Y1=[450,600,600,750],
    T= [0,0,1,1],
    Y= [500,600,600,750],
    TE=[-50,0,-200,50],
))

Unnamed: 0,i,Y0,Y1,T,Y,TE
0,1,500,450,0,500,-50
1,2,600,600,0,600,0
2,3,800,600,1,600,-200
3,4,700,750,1,750,50


In [4]:
# Real world scenario of 4 schools data
pd.DataFrame(dict(
    i= [1,2,3,4],
    Y0=[500,600,np.nan,np.nan],
    Y1=[np.nan,np.nan,600,750],
    T= [0,0,1,1],
    Y= [500,600,600,750],
    TE=[np.nan,np.nan,np.nan,np.nan],
))

Unnamed: 0,i,Y0,Y1,T,Y,TE
0,1,500.0,,0,500,
1,2,600.0,,0,600,
2,3,,600.0,1,600,
3,4,,750.0,1,750,


## Notes:

It is important to understand that association/correlation must not be confused with causation. In the examples of the data that we inspect above, when we don't have presence of counterfactuals, the correlation informs that the schools with the tablet (treatment) observe higher values compared to the schools without. We can see that that might not be the case when observing the synthetic counterfactuals data in the cell above.

However, if we try and measure association it would be given by E[Y|T = 1] - E[Y|T = 0] (expected value of outcome for control sample subtracted from expected value of outcome for treated sample).

If we apply this association measure to the scenario of treated and untreated data and perform some elementary math (adding counterfactual for treated group and subtracting it), we arrive at this formula:

E[Y|T = 1] - E[Y|T = 0] = E[Y<sub>1</sub> - Y<sub>0</sub>|T = 1]{Average Treatment effect on Treated} + E[Y<sub>0</sub>|T = 1] - E[Y<sub>0</sub>|T = 0]{<i>Bias</i>}

We see instead that association comes with a bias term implicit. The bias term here reads as the difference in the potential outcomes when neither the treated and control samples were not treated.

