<a href="https://colab.research.google.com/github/zd2011/causal201/blob/main/lab8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 8 - Recitation - S-learner and T-learner

author: Judith Abécassis and Élise Dumas with some inspiration from Miguel Hernan (Causal Inference : What if?)

In today's recitation, we will implement the S-learner and T-learner on an observational dataset, to assess the effect of smoking cessation on weight gain. We will use data from cigarette smokers aged 25-74 years who, as part of the NHEFS (National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study), had a medical baseline visit and a follow-up visit about 10 years later. Individuals were classified as treated if they reported having quit smoking before the follow-up visit, and as untreated otherwise. The weight gain is computed as the weight at the follow-up visit minus the weight at the baseline visit.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import scipy.stats as sps
import warnings

warnings.filterwarnings(action='once')
rg = np.random.default_rng(2907)

sns.set_context('poster')

## Exercise 1: dataset preparation

In [None]:
#Load dataset
resources = pd.read_csv("nhefs_short.csv")
resources.describe()

Unnamed: 0,active,age,alcoholfreq,cholesterol,income,marital,pregnancies,sex,smokeyrs,wt71,wt82_71,qsmk
count,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,0.692654,43.163418,2.226387,222.391304,17.754123,2.587706,3.686657,1.0,22.377811,64.952024,2.491789,0.232384
std,0.643743,11.614891,1.20111,47.565742,2.688446,1.195348,2.205477,0.0,10.705643,14.454446,8.111797,0.422669
min,0.0,25.0,0.0,105.0,11.0,2.0,1.0,1.0,1.0,39.58,-30.050074,0.0
25%,0.0,34.0,2.0,188.5,16.0,2.0,2.0,1.0,14.0,55.34,-1.867266,0.0
50%,1.0,43.0,2.0,219.0,18.0,2.0,3.0,1.0,22.0,61.92,2.613123,0.0
75%,1.0,52.0,3.0,253.0,19.0,2.0,5.0,1.0,30.0,71.67,6.689339,0.0
max,2.0,74.0,4.0,377.0,22.0,8.0,15.0,1.0,64.0,151.73,37.650512,1.0


 Here is the list of available variables : 
 
 `active`          How active are you on a usual day 0:very active, 1:moderately active, 2:inactive
 
 `age`               Age at baseline
 
 `alcoholfreq`       How often do you drink 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown
 
 `cholesterol`        Serum cholesterol (MG/100ML) at baseline visit
                      
 `income`             Total family income at baseline visit 11:<1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+
 
 `marital`           MARITAL STATUS IN 1971  1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown
 
 `sex`           0: MALE 1: FEMALE
 
 `smokeyrs`        years of smoking
 
`wt71`       weight at baseline visit (in kg)

`wt82_71`     difference in weight

`qsmk`    quit smoking between two visits 1:YES, 0:NO

### 1. What is the treatment? The outcome? The covariates?

**Answer**

Treatment is somking cessation (qsmk); outcome is difference in weight (wt82_71); covariates are all the other variables (level of activity, frequency of alcohol consumption, cholesterol level, family income, marital status, sex, years of smoking, and weight at baseline)

### 2. Would you say that SUTVA holds? Strong ignorability? Conditional ignorability? Positivity

**Answer**

SUTVA : it is reasonable to think that the "no interaction" part holds (except if units are family related; in which case if a member of your family decides to quit smoking and eats much more, you'll probably eat much more as well and may gain weight). The consistency part (only one version of treatment) is harder to assume since people may have quit anytime during the 10 years between the two medical visits (so that some people may have quit for ten years whereas other only for two months).

Strong ignorability : no, it is not reasonable to think that strong ignorability holds. The experiement is not randomized and several confounding variables may exist. Socio-economic status is a possible example : richer people are more prone to quit smoking and also less prone to gain weight. 

Conditional ignorability : we may argue that the principal confounders are available in the dataset (age, sex, weight at basline, income, etc..), so that conditional ignorability (with respect to all the coviarates) may hold. We can still think of several missing confounders. For instance, the number of pregnancies within the ten years, since pregnant women are more prone to quit smoking, but also to gain weight.

Positivity : We could not think of a reason why positivity does not hold : everyone has a non-null probability of quit smoking and of not quit smoking. Could you think of something? 

### 3.  Data preprocessing.  Remove rows with unknown alcoholfreq and unknown marital status. One-hot encode categorical variables

In [None]:
#Filter rows to discard unknown alcoloh consumption and marital status
resources = resources[(resources.alcoholfreq != 5) & (resources.marital != 8)]

#One-hot encode categorical variables.
#You may use the pandas function pd.get_dummies
categ_var = ["active","alcoholfreq","income","marital","sex",]
df_preprocessed  = pd.concat([
    resources.drop(columns = categ_var), # dataset without the categorical features
    pd.get_dummies(resources[categ_var], columns=categ_var, drop_first=False) # categorical features one-hot encoded 
], axis=1)

## Exercise 2: compute difference in means

In [None]:
diff_mean = np.mean(df_preprocessed.wt82_71[resources.qsmk == 1])-np.mean(df_preprocessed.wt82_71[resources.qsmk == 0])
print(f"The difference in means is {diff_mean.round(2)}.")

The difference in means is 1.71.


What can you conclude?

**Answer**

Quitting smoking is associated with an average weight gain of 1.71 kilograms. But the identifiability assumptions do not hold (especially ignorability) : this is not a causal effect.

## Exercise 3: S-learner

In this exercice, we will implement S-learner (equivalently called g-formula) to estimate ATE and use bootstrap to derive an 95% confidence interval around our estimate.

#### 1. Implement S-learner

In [None]:
#Fit a model for the outcome
explanatory_col = np.setdiff1d(df_preprocessed.columns, ["wt82_71","qsmk"]) #List of covariates.
#Fit a linear regression on covariates + treatment to infer outcome
reg = LinearRegression().fit(df_preprocessed[list(explanatory_col)+ ["qsmk"]], df_preprocessed["wt82_71"])

In [None]:
#Create two artifical columns : one with all zeros and the other one with only ones in the dataset df_preprocessed
df_preprocessed = df_preprocessed.assign(all_treated = 1, all_control = 0)

In [None]:
#Infer the two potential outcomes for each unit using your outcome model
df_preprocessed = df_preprocessed.assign(
    y0 = reg.predict(df_preprocessed[list(explanatory_col)+ ["all_control"]]),
    y1 = reg.predict(df_preprocessed[list(explanatory_col)+ ["all_treated"]])
)
#check that everything is fine
df_preprocessed.head()

Feature names unseen at fit time:
- all_control
Feature names seen at fit time, yet now missing:
- qsmk

Feature names unseen at fit time:
- all_treated
Feature names seen at fit time, yet now missing:
- qsmk



Unnamed: 0,age,cholesterol,pregnancies,smokeyrs,wt71,wt82_71,qsmk,active_0,active_1,active_2,...,marital_2,marital_3,marital_4,marital_5,marital_6,sex_1,all_treated,all_control,y0,y1
0,56,157,2,26,56.81,9.414486,0,1,0,0,...,0,1,0,0,0,1,1,0,5.578221,8.232636
1,43,212,1,21,99.0,4.41906,0,0,1,0,...,0,0,1,0,0,1,1,0,0.31867,2.973085
2,56,205,1,39,63.05,-4.082992,0,1,0,0,...,1,0,0,0,0,1,1,0,-0.508843,2.145572
3,29,166,2,9,58.74,0.227008,0,1,0,0,...,1,0,0,0,0,1,1,0,4.174878,6.829293
4,54,268,2,19,61.58,0.562155,0,0,1,0,...,0,1,0,0,0,1,1,0,1.134094,3.78851


In [None]:
#Compute an estimate of ATE using S-learner method
ate_s_learner = np.mean(df_preprocessed.y1 - df_preprocessed.y0)
print(f"An estimate of ATE computed by S-learner is {ate_s_learner.round(2)}")

An estimate of ATE computed by S-learner is 2.65


#### 2. Use bootstrap to compute a 95% confidence interval around your estimate of ATE

In [None]:
%%capture --no-display #This line is to remove warnings

N_boot=1000 #Number of boostrap repetitions
ate_s_learner_boot = np.empty(N_boot) #Numpy array to store boostrapped ATE estimations 
n = df_preprocessed.shape[0] #Number of rows in dataframe

for i in range(N_boot):
    
    #Simultate boostrapped dataset
    idx_boot = rg.choice(n, n, replace=True)
    sim_boot = df_preprocessed.loc[df_preprocessed.index[idx_boot]]
    
    #Compute ATE on boostraped dataset
    # Fit linear model
    reg = LinearRegression().fit(sim_boot[list(explanatory_col)+ ["qsmk"]],sim_boot["wt82_71"])
    #Create dummy variables
    sim_boot = sim_boot.assign(all_treated = 1, all_control = 0)
    #Apply the model to infer potential outcomes
    sim_boot = sim_boot.assign(
        y0 = reg.predict(sim_boot[list(explanatory_col)+ ["all_control"]]),
        y1 = reg.predict(sim_boot[list(explanatory_col)+ ["all_treated"]])
    )
    #Estimate ATE and add it to the array
    ate_s_learner_boot[i] = np.mean(sim_boot.y1 - sim_boot.y0)

In [None]:
#Compute a 95% confidence interval around your ATE estimate
print([ate_s_learner - sps.norm.ppf(0.975) * np.std(ate_s_learner_boot),
       ate_s_learner + sps.norm.ppf(0.975) * np.std(ate_s_learner_boot)])

[1.1210870996224929, 4.1877434132997]


#### 3. What do you conclude?

**Answer**

Quitting smoking significantly increase weight. When comparing our ATE estimate (2.65) wiht the difference in means, we see that the presence of confounding bias tended to underestimate the true causal effect. 

## Exercise 4 : T-learner

In this exercice, we will implement T-learner to estimate ATE and use bootstrap to derive an 95% confidence interval around our estimate.

#### 1. Implement T-learner

In [None]:
#Fit two models : one for the potential outcome under treatment and one for the potential outcome under control.
reg0 = LinearRegression().fit(df_preprocessed[df_preprocessed.qsmk==0][list(explanatory_col)],
                              df_preprocessed[df_preprocessed.qsmk==0]["wt82_71"])
reg1 = LinearRegression().fit(df_preprocessed[df_preprocessed.qsmk==1][list(explanatory_col)],
                              df_preprocessed[df_preprocessed.qsmk==1]["wt82_71"])

In [None]:
#Fit the models to estimate the potential outcome for all units
df_preprocessed = df_preprocessed.assign(
    y0_t = reg0.predict(df_preprocessed[list(explanatory_col)]),
    y1_t = reg1.predict(df_preprocessed[list(explanatory_col)])
)

In [None]:
#Compute ATE T-learner 
ate_t_learner = np.mean(df_preprocessed.y1_t - df_preprocessed.y0_t)
print(f"An estimate of ATE computed by T-learner is {ate_t_learner.round(2)}")

An estimate of ATE computed by T-learner is 2.67


#### 2. Use bootstrap to compute a 95% confidence interval around your estimate of ATE

In [None]:
%%capture --no-display #This line is to remove warnings

N_boot=1000 #Number of boostrap repetitions
ate_t_learner_boot = np.empty(N_boot) #Numpy array to store boostrapped ATE estimations 
n = df_preprocessed.shape[0] #Number of rows in dataframe

for i in range(N_boot):
    
    #Simultate boostrapped dataset
    idx_boot = rg.choice(n, n, replace=True)
    sim_boot = df_preprocessed.loc[df_preprocessed.index[idx_boot]]
    
    #Compute ATE on boostraped dataset by T-learner
    #Fit two models
    reg0 = LinearRegression().fit(sim_boot[sim_boot.qsmk==0][list(explanatory_col)],
                                  sim_boot[sim_boot.qsmk==0]["wt82_71"])
    reg1 = LinearRegression().fit(sim_boot[sim_boot.qsmk==1][list(explanatory_col)],
                                  sim_boot[sim_boot.qsmk==1]["wt82_71"])
    #Apply the two models to infer potential outcome for all untis
    sim_boot = sim_boot.assign(
        y0_t = reg0.predict(sim_boot[list(explanatory_col)]),
        y1_t = reg1.predict(sim_boot[list(explanatory_col)])
    )
    
    #Estimate ATE and add it the the array
    ate_t_learner_boot[i] = np.mean(sim_boot.y1_t - sim_boot.y0_t)



In [None]:
#Compute a 95% confidence interval around your ATE estimate
print([ate_t_learner - sps.norm.ppf(0.975) * np.std(ate_t_learner_boot),
       ate_t_learner + sps.norm.ppf(0.975) * np.std(ate_t_learner_boot)])

[0.9071610580149696, 4.42367199487407]


#### 3. What do you conclude?

**Answer**

We get pretty much the estimate than for S-learner (2.67 *versus* 2.65), so that we also conclude that quitting smoking significantly increase weight. The confidence interval is wider than for S-learner, which may be explained by the fact that the model is more complex (it has more degrees of freedom).