## Theory
This notebook tests some of the basic concepts of causal relationships and statistical modelling, as laid out in Pearl & MacKenzie (2018). For example, controlling for a "collider" will introduce bias where there was none. Subsequent notebooks will test out how well different meta-learners work, and what are the practical implications of accounting for/ ignoring causal relationships in predictive models.

### Approach
I will be using simulated datasets to lay out "true" causal relationships. For simplicity, I will be using linear regressions. To aid understanding, we will be trying to discern the effect of number of years of education on wages, with various confounding causal relationships

### Contents
1) Basic confounding - treatment and effect share a common cause
2) Collider
3) Mediation
4) Back-door path with collider

In [106]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [107]:
# global parameters

education_mean = 12
education_std = 3

wage_mean = 36000
wage_std = 3000

education_effect_wage = 1000

## 1. Basic confounding

education + family wealth ----> wage

family_wealth ----> education

In [108]:
def basic_confounding_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage:int = education_effect_wage
        , sample_family_wealth_effect_wage:int = 5000
        , sample_family_wealth_effect_education:int = 1
        , **kwargs
):
    
    # create randomised data    
    _family_wealth = np.random.standard_normal(sample_size)
    _education = sample_education_mean + (sample_family_wealth_effect_education * _family_wealth) + (np.sqrt(3) * np.random.standard_normal(sample_size))
    _wage = (
        sample_wage_mean + (sample_education_effect_wage * (_education - sample_education_mean)) + (sample_family_wealth_effect_wage * _family_wealth) + (3000 * np.random.standard_normal(sample_size))
    )

    _df = pd.DataFrame({'family_wealth': _family_wealth, 'education': _education, 'wage': _wage})
    
    return _df



In [109]:
# create data
df1 = basic_confounding_dataset()
df1.head()

Unnamed: 0,family_wealth,education,wage
0,0.51225,11.656544,38514.684041
1,2.211136,16.459828,50874.591444
2,1.302607,11.75285,36676.787565
3,-0.168457,11.236027,29832.704236
4,0.232137,11.596205,41710.700173


In [110]:
# omitted variable bias
X = sm.add_constant(df1['education'])
y = df1['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.401
Model:                            OLS   Adj. R-squared:                  0.400
Method:                 Least Squares   F-statistic:                     667.8
Date:                Sat, 13 Jul 2024   Prob (F-statistic):          3.78e-113
Time:                        00:49:59   Log-Likelihood:                -9993.2
No. Observations:                1000   AIC:                         1.999e+04
Df Residuals:                     998   BIC:                         2.000e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9476.6172   1041.531      9.099      0.0

In [111]:
# fully specified

X = sm.add_constant(df1[['education', 'family_wealth']])
y = df1['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.798
Model:                            OLS   Adj. R-squared:                  0.798
Method:                 Least Squares   F-statistic:                     1973.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:49:59   Log-Likelihood:                -9448.8
No. Observations:                1000   AIC:                         1.890e+04
Df Residuals:                     997   BIC:                         1.892e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          2.427e+04    690.638     35.149

## 2. Collider

education -----> wage

education -----> liberal_support <----- wage

In [112]:
def collider_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage:int = education_effect_wage
        , sample_education_effect_liberal_support:int = 1
        , sample_wage_effect_liberal_support:int = 1
        , **kwargs
):
    
    # create randomised data    
    _education = sample_education_mean + (sample_education_std * np.random.standard_normal(sample_size))
    _wage = (
        sample_wage_mean + (sample_education_effect_wage * (_education - sample_education_mean)) + (3000 * np.random.standard_normal(sample_size))
    )

    _education_normalised = (_education - np.mean(_education))/ np.std(_education)
    _wage_normalised = (_wage - np.mean(_wage))/ np.std(_wage)

    _liberal_support = (
        (sample_education_effect_liberal_support * _education_normalised) + (sample_wage_effect_liberal_support * _wage_normalised) + np.random.standard_normal(sample_size)
    )

    _df = pd.DataFrame({'liberal_support': _liberal_support, 'education': _education, 'wage': _wage})
    
    return _df

In [113]:
df2 = collider_dataset(**{'sample_wage_effect_liberal_support':2})
df2.head()

Unnamed: 0,liberal_support,education,wage
0,0.023807,14.923808,36514.210891
1,0.006732,12.163707,40292.167031
2,-3.429527,9.314391,30706.331075
3,-3.650045,9.169891,32573.544739
4,-0.319056,10.206847,34744.238001


In [114]:
# "correct" regression - ignore collider

X = sm.add_constant(df2['education'])
y = df2['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.486
Model:                            OLS   Adj. R-squared:                  0.485
Method:                 Least Squares   F-statistic:                     943.1
Date:                Sat, 13 Jul 2024   Prob (F-statistic):          2.43e-146
Time:                        00:50:00   Log-Likelihood:                -9442.1
No. Observations:                1000   AIC:                         1.889e+04
Df Residuals:                     998   BIC:                         1.890e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.437e+04    398.233     61.193      0.0

In [115]:
# incorrect - control for collider

X = sm.add_constant(df2[['education', 'liberal_support']])
y = df2['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.823
Model:                            OLS   Adj. R-squared:                  0.823
Method:                 Least Squares   F-statistic:                     2325.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:50:00   Log-Likelihood:                -8907.7
No. Observations:                1000   AIC:                         1.782e+04
Df Residuals:                     997   BIC:                         1.784e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            3.768e+04    383.934     

## 3. Mediator

education ---> skills ---> wage

In [116]:
def mediator_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage: int = education_effect_wage
        , sample_skill_mean:int = 5
        , sample_education_effect_skill:int = 0.2
        , **kwargs
):
    
    # create randomised data    
    _education = sample_education_mean + (sample_education_std * np.random.standard_normal(sample_size))

    _skill = (
        sample_skill_mean + (sample_education_effect_skill * (_education - sample_education_mean)) + np.random.standard_normal(sample_size)
    )

    _implied_skill_effect_on_wage = sample_education_effect_wage / sample_education_effect_skill

    _wage = (
        sample_wage_mean + (_implied_skill_effect_on_wage * _skill) + (3000 * np.random.standard_normal(sample_size))
    )


    _df = pd.DataFrame({'skill': _skill, 'education': _education, 'wage': _wage})
    
    return _df

In [117]:
df3 = mediator_dataset()
df3.head()

Unnamed: 0,skill,education,wage
0,5.261245,12.801065,63934.379796
1,5.123689,14.950331,65352.32857
2,5.341275,7.559743,63231.842154
3,3.544813,14.52454,59495.721238
4,3.070849,6.533408,48609.666761


In [118]:
# correct - do not control for mediator collider

X = sm.add_constant(df3['education'])
y = df3['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     228.5
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           1.24e-46
Time:                        00:50:00   Log-Likelihood:                -10103.
No. Observations:                1000   AIC:                         2.021e+04
Df Residuals:                     998   BIC:                         2.022e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const           5e+04    775.745     64.451      0.0

In [119]:
# incorrect - control for mediator

X = sm.add_constant(df3[['education', 'skill']])
y = df3['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.785
Model:                            OLS   Adj. R-squared:                  0.785
Method:                 Least Squares   F-statistic:                     1824.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:50:00   Log-Likelihood:                -9436.1
No. Observations:                1000   AIC:                         1.888e+04
Df Residuals:                     997   BIC:                         1.889e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.621e+04    476.633     75.968      0.0

## 4. Backdoor/ "M bias"

education ----> wage

parents_education   ----> education

parents_education   ----> books_read_per_year

ambition            -----> wage

ambition            -----> books_read_per_year
                

In [120]:
def m_bias_dataset(
        sample_size: int = 1000
        , sample_education_mean:int = education_mean 
        , sample_education_std:int = education_std
        , sample_wage_mean:int = wage_mean 
        , sample_education_effect_wage: int = education_effect_wage
        , sample_parents_effect_education: int = 0.5
        , sample_ambition_effect_wage: int = 5000
        , sample_books_read_mean: int = 5
        , parents_education_effect_books_read: int = 1
        , ambition_effect_books_read: int = 1
        , **kwargs
):
    
    # create randomised data    
    _parents_education_mean = sample_education_mean - 3
    _parents_education = _parents_education_mean + ((sample_education_std - 1) * np.random.standard_normal(sample_size))
    _ambition = np.random.standard_normal(sample_size)

    _education = (
        sample_education_mean 
        + (sample_parents_effect_education * (_parents_education - _parents_education_mean)) 
        + (sample_education_std * np.random.standard_normal(sample_size))
    )
    
    _wage = (
        sample_wage_mean 
        + (sample_education_effect_wage * (_education - sample_education_mean)) 
        + (sample_ambition_effect_wage * _ambition) + (3000 * np.random.standard_normal(sample_size))
    )

    _books_read = (
        sample_books_read_mean 
        + (parents_education_effect_books_read * (_parents_education - _parents_education_mean)) 
        + (ambition_effect_books_read * _ambition)
        + np.random.standard_normal(sample_size)
    )
    

    _df = pd.DataFrame({
        'education': _education, 'wage': _wage
        , 'parents_education': _parents_education
        , 'ambition': _ambition
        , 'books_read': _books_read
        })
    
    return _df

In [121]:
df4 = m_bias_dataset(**{'sample_education_effect_wage':1000})
df4.describe()

Unnamed: 0,education,wage,parents_education,ambition,books_read
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,11.931359,35863.035587,8.938516,-0.017434,4.932853
std,3.268588,6730.334186,2.048168,1.014585,2.533497
min,2.46875,14646.743518,2.751349,-4.432009,-2.694312
25%,9.813425,31260.666346,7.48317,-0.698586,3.151251
50%,11.968624,35765.67384,8.904083,-0.053483,4.883983
75%,14.217442,40506.556678,10.341572,0.669371,6.744025
max,22.410188,55917.483964,15.176841,3.32058,12.443327


In [122]:
# correct - ignore backdoor as already closed

X = sm.add_constant(df4['education'])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     337.0
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           4.39e-65
Time:                        00:50:00   Log-Likelihood:                -10087.
No. Observations:                1000   AIC:                         2.018e+04
Df Residuals:                     998   BIC:                         2.019e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.352e+04    697.134     33.736      0.0

In [123]:
# correct - control for parents' education and ambition, but do not control for collider (books read)

X = sm.add_constant(df4[['education', 'parents_education', 'ambition']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     1483.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:50:00   Log-Likelihood:                -9383.5
No. Observations:                1000   AIC:                         1.877e+04
Df Residuals:                     996   BIC:                         1.879e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              2.363e+04    451.76

In [124]:
# incorrect - control for collider (books read), thereby opening the backdoor

X = sm.add_constant(df4[['education','books_read']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.366
Model:                            OLS   Adj. R-squared:                  0.365
Method:                 Least Squares   F-statistic:                     288.3
Date:                Sat, 13 Jul 2024   Prob (F-statistic):           1.55e-99
Time:                        00:50:00   Log-Likelihood:                -10005.
No. Observations:                1000   AIC:                         2.002e+04
Df Residuals:                     997   BIC:                         2.003e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.173e+04    655.851     33.133      0.0

In [125]:
# control - control for collider (books read) but also close the backdoor path by controlling for the other variables

X = sm.add_constant(df4[['education','books_read', 'ambition']])
y = df4['wage']

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.816
Method:                 Least Squares   F-statistic:                     1480.
Date:                Sat, 13 Jul 2024   Prob (F-statistic):               0.00
Time:                        00:50:00   Log-Likelihood:                -9384.3
No. Observations:                1000   AIC:                         1.878e+04
Df Residuals:                     996   BIC:                         1.880e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.403e+04    355.922     67.523      0.0