Importing Packages And Basic Set up

In [141]:
#.venv/scripts/activate  ; no source. 
import xgboost as xgb
from matplotlib.pyplot import hist
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.metrics import mean_squared_error, log_loss
import pandas as pd

import numpy as np
import pandas as pd

RANDOM_SEED = 23


Importing Datasets

In [142]:
# Importing maindata
file_path = "C://Users/miste/Documents/Causal_ML/"
x = pd.read_stata(file_path + "maindata.dta", convert_categoricals=False)

# Importing laws_csv, cleaning it
laws_csv = pd.read_csv(file_path + "When_Were_Laws.csv")
laws_csv = laws_csv[np.logical_not(np.isnan(laws_csv["FIPS"]))]  # FIPS codes identify states
laws_csv = laws_csv.drop("State_Name", axis=1)  # Dropping as useless
laws_csv = laws_csv.rename({'FIPS': 'stfips'}, axis=1) 

# Merging
merged = pd.merge(laws_csv, x, on='stfips', how='outer')


Cleaning Datasets, only interested in the 1997 states. 

In [143]:
basic_merged = merged.copy()  # To allow for re-running 

# Dropping states who were treated < 97 (i.e. they always had programs)
basic_merged = basic_merged[basic_merged["Year_Implemented"].str.contains("always")==False]  

# Making it so that "never-treated" states are treated at T = infinity
basic_merged = basic_merged.replace("never", "1000000") 
basic_merged["Year_Implemented"] = basic_merged["Year_Implemented"].astype(int) # converting to int

# indicator for if treatment has occured in state i
basic_merged["year_indic"] = (basic_merged["year"] >= basic_merged["Year_Implemented"]) 

# Indicator for if the individual was treated (i.e. under 19 and in a state who added a law)
basic_merged["treatment"] = basic_merged["under19"] * basic_merged["year_indic"]

# Generating list of confounders of interest, these are not necessarily optimal. 
list_of_confounders = [ "fownu18", "fpovcut", "povll", "faminctm1", "a_maritl"] 
list_of_confounders += ["a_hga",  "anykids", "year", "stfips", "disability", "elig"] 

# Dropping years  outside of [1995,2000] 
basic_merged = basic_merged[basic_merged["year"] <= 2000]
basic_merged = basic_merged[basic_merged["year"] >= 1995]

# Subsetting our dataset to only include the columns we want, and dropping all rows with empty entries. 
basic_merged = basic_merged[list_of_confounders + ["treatment", "pubonly", "insured", "privonly", "Year_Implemented"]]
basic_merged = basic_merged.dropna(axis = 0)

# Setting up matrices with confounders and either the treatment or no treatment
# In practice, we only use the former. But the latter is useful for predicting hat(g)
confounders_and_treat = basic_merged[list_of_confounders + ["treatment"]]
confounders_no_treat = basic_merged[list_of_confounders]

# Outcome of interest, this can be set to: [ "pubonly", "insured", "privonly"]
y_var = basic_merged["privonly"]

Fitting Q

In [144]:
# This is probably where my model goes wrong, as we're likely overfitting. Keep stratified k-fold
x_train, x_test, y_train, y_test = train_test_split(confounders_and_treat, y_var, test_size=0.2)

Q = HistGradientBoostingClassifier()
Q.fit(x_train, y_train)
score2 = Q.score(x_test,y_test)
print("Model Score: " + str(score2))   
                          
# Can fit g as well, normally get ~95-99%

Model Score: 0.8135968963606134


Estimating causal effect

Our target estimand is $E[E[Y_1-Y_0|A=1,X]-E[Y_1-Y_0|A=0,X]|A=1]$   

Where:   

$Y_1 = 1$ if you have insurance in 1997 and 0 otherwise  
$Y_0 = 1$ if you have insurance in 1996 and 0 otherwise  
$A = 1$ if you are between 14 and 19 and 0 if you are between 20 and 26  
X are the covariates  

So normally the model would predict the difference between the years for each individual, but we are not able to do that (unless we try to match individuals to the most similar ones in other years based on their covariates).

If we assume that the individuals surveyed between the two years were mostly the same, then this equals:

$E[E[Y_1|A=1,X,1997]-E[Y_0|A=1,X,1996]-E[Y_1|A=0,X,1997]+E[Y_0|A=0,X,1996]|A=1]$ 

Where given 1997 means given that the data in question was collected in 1997.

Our high-level goal here is to use the 20 to 26 year-olds to get a baseline conditional change and then measure and casual effect we see beyond that. It is easy to get the overall change for the 20 to 26 year-olds (just take the average), but the question is if it is still possible to get the change conditional on covariates (eg on being in a low-income household).  

So I understand our options as the following:  
1. Try to match pairs of similar individuals between years to do the analysis as usual.  
2. Train a single model over all the data to predict if individuals between 14 and 19 are insured given their covariates and then run it on the 1997 data and 1996 data, but I think this might not train the model to predict the differences in coverage rates for the individuals from 20 to 26 the way it would be able to if it was trained to predit $Y_1 - Y_0$ if the X are not constant between the years. Imagine a scenario such as the following: unemployed Irish men (who lets say choose not to have insurance to save money) always get a job and become insured in 1997. However, unemployed Irish men in both 1996 and 1997 do not have insurance. A model predicting the difference would predict that Irish men in 1996 would have insurance in 1997, but a model trained on the data between years would predict that unemployed Irish men in both years would be uninsured - is this a problem? This could be a problem for the matching approach as well, but might be somewhat alleviated since we might be able to still reasonably match individuals based off of their covariates which didnt' change.  
3. Train one model on 1996 data and another on 1997 data. If our estimator with this approach was unbiased then $E[Y_1|A=1,X,1997]-E[Y_0|A=1,X,1996]$ woul probably just recover the difference in insurance rates but the $E[Y_1|A=0,X,1997]+E[Y_0|A=0,X,1996]$ term would still make the overall estimate different than just calculating the difference in coverage between the two years. Additionally it seems like this approach would be the same as the 2nd one if a flexible enough machine learning method was used.

Another potential problem here: we might need to assume the number of people turning 14 is equal to the number of people turning 20 so that we are finding the change in a population that (even though some of its individuals covariates are different) is at least the same size. But I am not sure about this.  

I would love to hear you all's thoughts - I am probably messing something simple up here.

In [145]:
#INSERT STUFF HERE! Not sure how to do it right. 