In [1]:
import warnings
warnings.filterwarnings('ignore')

# Where ML Fits into Causal Inference (review)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mixtape-Sessions/Machine-Learning/blob/main/Labs/python/Causal%20via%20Prediction.ipynb)

The traditional go-to tool for causal inference is multiple regression:
$$
Y_i = \delta D_i + X_i'\beta+\varepsilon_i,
$$
where $D_i$ is the "treatment" or causal variable whose effects we are interested in, and $X_i$ is a vector of controls, conditional on which we are willing to assume $D_i$ is as good as randomly assigned.


> *example:* Suppose we are interested in the magnitude of racial discrimination in the labor market. One way to conceptualize this is the difference in earnings between two workers who are identical in productivity, but differ in their race, or, the "effect" of race. Then $D_i$ would be an indicator for, say, a Black worker. $Y_i$ would be earnings, and $X_i$ would be characteristics that capture determinants of productivity, including educational attainment, cognitive ability, and other background characteristics.

Where does machine learning fit into causal inference? It might be tempting to treat
this regression as a prediction exercise where we are predicting $Y_{i}$
given $D_{i}$ and $X_{i}$. Don't give in to this temptation. We are not
after a prediction for $Y_{i}$, we are after a coefficient on $D_{i}$.
Modern machine learning algorithms are finely tuned for producing
predictions, but along the way they compromise coefficients. So how can we
deploy machine learning in the service of estimating the causal coefficient $\delta $?

To see where ML fits in, first remember that an equivalent way to estimate $%
\delta $ is the following three-step procedure:


1.   Regress $Y_{i}$ on $X_{i}$ and compute the residuals, $\tilde{Y}%
_{i}=Y_{i}-\hat{Y}_{i}^{OLS}$, where $\hat{Y}_{i}^{OLS}=X_{i}^{\prime
}\left( X^{\prime }X\right) ^{-1}X^{\prime }Y$
2.   Regress $D_{i}$ on $X_{i}$ and compute the residuals, $\tilde{D}%
_{i}=D_{i}-\hat{D}_{i}^{OLS}$, where $\hat{D}_{i}^{OLS}=X_{i}^{\prime
}\left( X^{\prime }X\right) ^{-1}X^{\prime }D$

3. Regress $\tilde{Y}_{i}$ on $\tilde{D}_{i}$.

Steps 1 and 2 are prediction exercises--ML's wheelhouse. When OLS isn't the right tool for the job, we can replace OLS in those steps with machine learning:

1.   Predict $Y_{i}$ based on $X_{i}$ using ML and compute the residuals, $\tilde{Y}%
_{i}=Y_{i}-\hat{Y}_{i}^{ML}$, where $\hat{Y}_{i}^{ML}$ is the prediction from an ML algorithm
2.   Predict $D_{i}$ based on $X_{i}$ using ML and compute the residuals, $\tilde{D}%
_{i}=D_{i}-\hat{D}_{i}^{ML}$, where $\hat{D}_{i}^{ML}$ is the prediction from an ML algorithm

3. Regress $\tilde{Y}_{i}$ on $\tilde{D}_{i}$.

This is the basis for the two major methods we'll look at today: The first is "Post-Double Selection Lasso" (Belloni, Chernozhukov, Hansen). The second is "Double-Debiased Machine Learning" (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins)

# Post Double Selection Lasso (PDS Lasso)

## Load useful packages: 
pandas, numpy, linear_model (from sklearn), and KFold (from sklearn.model_selection)

Try it yourself first

### Cheat

In [2]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import KFold

## Read in data and have a look at it
We'll use the NLSY data from yesterday

Try it yourself

### Cheat

In [4]:
nlsy=pd.read_csv('https://github.com/Mixtape-Sessions/Machine-Learning/blob/main/Labs/data/nlsy97.csv?raw=true')
nlsy.head()

Unnamed: 0,lnw_2016,educ,black,hispanic,other,exp,afqt,mom_educ,dad_educ,yhea_100_1997,...,_XPexp_13,_XPexp_14,_XPexp_16,_XPexp_17,_XPexp_18,_XPexp_19,_XPexp_20,_XPexp_21,_XPexp_22,_XPexp_23
0,4.076898,16,0,0,0,11,7.0724,12,12,3,...,0,0,0,0,0,0,0,0,0,0
1,3.294138,9,0,0,0,19,4.7481,9,10,2,...,0,0,0,0,0,1,0,0,0,0
2,2.830896,9,0,1,0,22,1.1987,12,9,3,...,0,0,0,0,0,0,0,0,1,0
3,4.306459,16,0,0,0,13,8.9321,16,18,2,...,1,0,0,0,0,0,0,0,0,0
4,5.991465,16,0,1,0,15,2.2618,16,16,1,...,0,0,0,0,0,0,0,0,0,0


## Define outcome, regressor of interest
y = lnw_2016

d = black 

Try it yourself:

### Cheat

In [5]:
y=nlsy['lnw_2016']
d=nlsy[['black']]
# double [[]]: use d as right hand var, linear model package like rhv be multiple columns, so use it this way to make sure it's a 2-d array

## Simple Regression with no Controls
Regress y on d and print out coefficient
Try it yourself

In [None]:
# instantiate and fit a linear regression object

# print out regression coefficient


### Cheat

In [6]:
lm=linear_model.LinearRegression().fit(d,y)
print("Simple regression race gap: {:.3f}".format(lm.coef_[0]))
# no weight var this time 
# this coef can be biased, and it caputure eff from other 
# “take a black and a white worker who match on the included covariates. The black worker on average has a 30% lower wage on average”

Simple regression race gap: -0.382


### ...
Is this the effect we're looking for? 

Let's try a regression where we control for a few things: education (linearly), experience (linearly), and cognitive ability (afqt, linearly).

Try it yourself!

In [None]:
# define X, matrix of the d and the controls we want

# run regression

# print out coefficient


### Cheat

In [7]:
# define RHS, matrix of the d and the controls we want
RHS=nlsy[['black','educ','exp','afqt']]
# run regression
lm.fit(RHS,y)
# print out coefficient
print("Multiple regression-adjusted race gap: {:.3f}".format(lm.coef_[0]))



Multiple regression-adjusted race gap: -0.262



###...
How does it compare to the simple regression? 

But who is to say the controls we included are sufficient? We have a whole host (hundred!) of other potential controls, not to mention that perhaps the controls we did put in enter linearly. This is a job for ML!

To prep, let's define a matrix X with all of our potential controls:

In [13]:
X=nlsy.drop(columns=['lnw_2016','black']) #990+metrics

## Post Double Selection Lasso

### Step 1: Lasso the outcome on X
Try it yourself. Don't forget to standard Xs, or choose the normalize=True option

#### Cheat

In [17]:
lassoy = linear_model.LassoCV(max_iter=1000,normalize=True).fit(X, y)
# suggest to use lassoCV
# does the option normalize=True, make sure that dummies are not standardized?
# It standize everything, but doesn't hurt to sd dummy

### Step 2: Lasso the treatment on X
Try it yourself

#### Cheat

In [15]:
lassod = linear_model.LassoCV(max_iter=1000,normalize=True).fit(X, d)

### Step 3: Form the union of controls
Try it yourself

#### Cheat

In [18]:
Xunion=X.iloc[:,(lassod.coef_!=0) + (lassoy.coef_!=0)]
Xunion.head()
# still 140 col left 

Unnamed: 0,educ,hispanic,other,afqt,mom_educ,yhea_2200_1997,youth_bothbio_01_1997,p4_001_1997,p5_102_1997,cv_bio_mom_age_child1_1997,...,_BGhfp_adhr_16,_BGhfp_adpe_2,_BGhfp_adpe_9,_BGhfp_adpe_11,_BGhfp_aden_4,_BGhp5_101__4,_BGhcvc_govo2,_BGhcvc_govp5,_BGhcvc_govp6,_XPexp_17
0,16,0,0,7.0724,12,98,1,2,-4,31,...,0,1,0,0,1,0,0,0,0,0
1,9,0,0,4.7481,9,140,1,1,-4,25,...,0,1,0,0,1,0,0,0,0,0
2,9,1,0,1.1987,12,185,1,3,-4,30,...,0,0,0,0,0,0,0,0,0,0
3,16,0,0,8.9321,16,140,1,2,-4,25,...,0,1,0,0,0,0,0,0,0,0
4,16,1,0,2.2618,16,145,1,2,-4,24,...,0,0,0,0,0,0,0,0,0,0


### Concatenate treatment with union of controls and regress y on that and print out estimate
Try yourself

#### Cheat

In [19]:
rhs=pd.concat([d,Xunion],axis=1)
fullreg=linear_model.LinearRegression().fit(rhs,y)
print("PDS regression earnings race gap: {:.3f}".format(fullreg.coef_[0]))

PDS regression earnings race gap: -0.241


In [20]:
fullreg.coef_

array([-2.40520978e-01,  4.98855055e-02,  8.17051403e-02, -4.56643011e-01,
        3.25440278e-02,  1.01995008e-02,  3.77727358e-04,  9.66040700e-03,
       -5.69312550e-02, -7.15002435e-02, -5.75509880e-04,  3.27448071e-02,
        5.05991440e-02, -6.68905959e-03, -7.93863999e-03,  1.61881837e-02,
        6.29131732e-05,  2.53627744e-02, -9.05469024e-02, -1.57353080e-01,
       -5.64430209e-02, -1.04324184e-01,  1.12901258e-01,  9.36311023e-02,
       -7.19750864e-02,  1.92422285e+00,  2.09626566e-02,  7.18486745e-01,
        5.97413830e-01, -1.00485453e-01,  1.52595946e-01, -5.56416928e-01,
        2.23738531e-01,  3.90683214e-01, -8.40008196e-01,  2.11646595e-01,
        1.82951909e-01,  1.83960780e-01, -7.91449296e-01, -2.20633235e-01,
       -4.56269685e-02,  1.78921558e-01,  3.02650864e-02, -9.68284460e-02,
       -8.34538159e-01, -4.71071125e-01,  5.07431467e-01,  1.13386100e-02,
       -1.91700003e-01,  5.29223419e-02, -1.55579023e-01,  5.58497821e-03,
       -2.25848104e-01,  

In [None]:
# pds lasso vs simple ols
# no st.err here, -0.24 might not sig diff from the above -0.26
# most labor economists use the simple way, usually good enough
#  lasso: reassure about robutness 
#  implimentation 

## Double-Debiased Machine Learning
For simplicity, we will first do it without sample splitting
# don't have to use the same method of two steps

### Step 1: Ridge outcome on Xs, get residuals
Try yourself

#### Cheat

In [21]:
ridgey = linear_model.RidgeCV(normalize=True).fit(X, y)
yresid=y-ridgey.predict(X)

### Step 2: Ridge treatment on Xs, get residuals
Try yourself

#### Cheat

In [22]:
ridged = linear_model.RidgeCV(normalize=True).fit(X, d)
dresid=d-ridged.predict(X)

### Step 3: Regress y resids on d resids and print out estimate
Try yourself

#### Cheat

In [23]:
dmlreg=linear_model.LinearRegression().fit(dresid,yresid)
print("DML regression earnings race gap: {:.3f}".format(dmlreg.coef_[0]))
# above method, residuals for each observation 

DML regression earnings race gap: -0.290


### The real thing: with sample splitting

In [24]:
# create our sample splitting "object", randomely choose samples 
# if cross-section, use random 
kf = KFold(n_splits=5,shuffle=True,random_state=42)

# apply the splits to our Xs
kf.get_n_splits(X)

# initialize columns for residuals
yresid = y*0
dresid = d*0

# Now loop through each fold
ii=0
for train_index, test_index in kf.split(X):
  X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  d_train, d_test = d.iloc[train_index,:], d.iloc[test_index,:]
  
  # Do DML thing
  # Ridge y on training folds:
  ridgey.fit(X_train, y_train)

  # but get residuals in test set
  yresid.iloc[test_index]=y_test-ridgey.predict(X_test)
  
  #Ridge d on training folds
  ridged.fit(X_train, d_train)

  #but get residuals in test set
  dresid.iloc[test_index,:]=d_test-ridged.predict(X_test)

 
# Regress resids
dmlreg=linear_model.LinearRegression().fit(dresid,yresid)

print("DML regression earnings race gap: {:.3f}".format(dmlreg.coef_[0]))

DML regression earnings race gap: -0.246


You want standard errors, do you?

In [25]:
import statsmodels.api as sm
rhs = sm.add_constant(dresid)
model = sm.OLS(yresid, rhs)
results = model.fit(cov_type='HC3')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               lnw_2016   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     10.32
Date:                Sat, 25 Feb 2023   Prob (F-statistic):            0.00135
Time:                        01:35:02   Log-Likelihood:                -1567.6
No. Observations:                1266   AIC:                             3139.
Df Residuals:                    1264   BIC:                             3149.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0008      0.024     -0.033      0.9

## Now do DML using Random Forest!