## Day 4: Application: Tree-Based Partially Linear Model with Cross fitting

Define $X=(V,W)$ and let
$$
Y = \alpha D + \gamma'V + g(W) + \varepsilon
$$
where $\mathbb{E}(\varepsilon|D,X)=0$. Here, $D$ is the treatment of interest, $V$ is a control variable that we assume is both linear and additive. We estimate $\alpha$ using the Robinson's two-step method as described below.
1. We estimate the conditional expectations $f_y(w) = \mathbb{E}(Y|W=w)$ and $f_{dv}(w)=\mathbb{E}(V,D|W=w)$ using RF, XGB, or XGB with row subsampling.
2. Estimate $\alpha$ and $\gamma$ jointly by regressing $(y_i-\hat{f}_y(w_i))$ on $((d_i,v_i)-\hat{f}_{dv}(w_i))$.

$\hat{f}$ is estimated by RF, XGB, or XGB with row subsampling. $V\in\mathbb{R}^2$ is given as log household income and individual age. We may attempt to reduce the bias of the estimates by employing cross fitting. In other words, we split the data into $K$ folds. For each fold, we use the data from the other folds to estimate the conditional expectations in Step 1. Then, we use the estimated conditional expectations to compute the 'prediction' error in Step 2. After collecting the 'prediction' error for each of the folds, we again regress the errors in $y_i$ on the errors on $(d_i,v_i)$ to obtain estimates of $\hat\alpha$ and $\hat\gamma$.

In [1]:
# Import data
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from functions import *

import statsmodels.api as sm 
from IPython.display import display, HTML
import plotly
import tensorflow as tf
import random
import sqlite3
import os
seed = 42
X_used, Y = get_data(seed)
n = X_used.shape[0]

grid_size = 50
a_grid = np.linspace(np.min(X_used.age), np.max(X_used.age), grid_size)
w_grid = np.linspace(np.min(X_used.hh_inc), np.max(X_used.hh_inc), grid_size)
xv, wv = np.meshgrid(a_grid, w_grid, indexing='ij')

X_used_cnst = sm.add_constant(X_used)
model = sm.OLS(Y,X_used_cnst)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,happiness,R-squared:,0.223
Model:,OLS,Adj. R-squared:,0.222
Method:,Least Squares,F-statistic:,170.2
Date:,"Thu, 22 Aug 2024",Prob (F-statistic):,0.0
Time:,09:51:51,Log-Likelihood:,-21888.0
No. Observations:,13635,AIC:,43820.0
Df Residuals:,13611,BIC:,44000.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.1901,0.192,11.430,0.000,1.814,2.566
hh_inc,0.1770,0.026,6.915,0.000,0.127,0.227
consumption_tot,0.1801,0.035,5.180,0.000,0.112,0.248
savings_tot,0.0026,0.000,9.819,0.000,0.002,0.003
hh_size,-0.0393,0.012,-3.204,0.001,-0.063,-0.015
zerotofive,0.1897,0.036,5.244,0.000,0.119,0.261
sixtotwenty,0.0845,0.029,2.955,0.003,0.028,0.141
grp,-0.0728,0.045,-1.601,0.109,-0.162,0.016
home_own,0.2198,0.023,9.512,0.000,0.175,0.265

0,1,2,3
Omnibus:,19.046,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,20.777
Skew:,-0.054,Prob(JB):,3.08e-05
Kurtosis:,3.158,Cond. No.,1360.0


In [2]:
# Make Database
database_name = 'database_happiness_PL.db'
con = sqlite3.connect(os.path.join('Results', database_name))
cur = con.cursor()

res = cur.execute("""SELECT name FROM sqlite_master WHERE type='table'""")
table_names = res.fetchall()
if ~np.isin('PL', table_names):
    print("CREATE NEW DATABASE TABLE")
    cur.execute("""CREATE TABLE IF NOT EXISTS PL(
                Method TEXT NOT NULL,
                Model TEXT NOT NULL,
                Parameter_v TEXT NOT NULL,
                Parameter_y TEXT NOT NULL,
                Seed INTEGER NOT NULL,
                Treatment TEXT NOT NULL,
                Control_1 TEXT NOT NULL,
                Control_2 TEXT NOT NULL,
                Value REAL NOT NULL,
                Con_val_1 REAL NOT NULL,
                Con_val_2 REAL NOT NULL,
                PRIMARY KEY (Method, Model, Seed, Treatment, Control_1, Control_2))""")
    con.commit()
else:
    print("DATABASE TABLE ALREADY EXISTS")


DATABASE TABLE ALREADY EXISTS


In [3]:
h = 1  # Bandwidth parameter
weights = np.zeros((n, grid_size))
for i in range(grid_size):
    u = np.abs(X_used['hh_inc'].values - w_grid[i])/h
    val = Gaussian(u)
    weights[:,i] = val/np.sum(val)

characteristic_names = ['zerotofive','sixtotwenty','grp','home_own','gender','isHHH',
                        'live_together','work','get_social_benefit',
                        'got_social_benefit','religion','marriage','health',
                        'exercise','smoke','alcohol']

ATE_names = ['zerotofive','sixtotwenty','grp','home_own','gender','isHHH',
                        'live_together','work','get_social_benefit',
                        'got_social_benefit','religion','marriage','health',
                        'exercise','smoke','alcohol']

plot_names = ['Baby - No Baby', 'Teenager - No Teenager', 'Grandparents - No Grandparents','Homeowner - Lease',
              'Female - Male', 'Head - Not Head', 'Live Together - Separate', 
              'Work - No Work','Receive Social Insurance - Has Never Received', 'Had Received Social Insurance - Has Never Received',
              'Religion - No Religion','Married - Not Married', 'Health',
              'Exercise', 'Smoke - No Smoke', 'Alcohol - No Alcohol']


In [4]:
## Import results
Model_names = ["RF","XGB","XGBs"]
# Method, Model, Seed, Treatment, Control_1, Control_2

query = "select * from PL where Method='%s' and seed=%i"%('PL', seed)
# query = "select * from PL where Method='%s'"%('PL')
PL_result = pd.read_sql(query,con)
PL_result_pivot = PL_result.pivot(index=['Model','Seed'], columns='Treatment', values='Value')
PL_result_pivot_1 = PL_result.pivot(index=['Model','Seed'], columns=['Treatment','Control_1'], values='Con_val_1')
PL_result_pivot_2 = PL_result.pivot(index=['Model','Seed'], columns=['Treatment','Control_2'], values='Con_val_2')

PL_result_pivot_T = PL_result_pivot.T
PL_result_pivot_T = PL_result_pivot_T.reindex(ATE_names)
PL_result_pivot_T.index = plot_names

PL_result_pivot_1_T = PL_result_pivot_1.T
PL_result_pivot_1_T = PL_result_pivot_1_T.reset_index()
PL_result_pivot_1_T.index = PL_result_pivot_1_T.Treatment
PL_result_pivot_1_T = PL_result_pivot_1_T.reindex(ATE_names)
PL_result_pivot_1_T.reset_index(drop=True, inplace=True)

PL_result_pivot_2_T = PL_result_pivot_2.T
PL_result_pivot_2_T = PL_result_pivot_2_T.reset_index()
PL_result_pivot_2_T.index = PL_result_pivot_2_T.Treatment
PL_result_pivot_2_T = PL_result_pivot_2_T.reindex(ATE_names)
PL_result_pivot_2_T.reset_index(drop=True, inplace=True)

The estimated $\alpha$ are given by

In [5]:
with pd.option_context('display.precision', 4):
    display(PL_result_pivot_T)

Model,RF,XGB,XGBs
Seed,42,42,42
Baby - No Baby,0.1882,0.1609,0.1677
Teenager - No Teenager,0.0553,0.0639,0.0793
Grandparents - No Grandparents,-0.1184,-0.0684,-0.1188
Homeowner - Lease,0.2041,0.211,0.2018
Female - Male,0.1426,0.1129,0.1124
Head - Not Head,0.0082,-0.0377,-0.0172
Live Together - Separate,-0.1963,-0.1917,-0.14
Work - No Work,0.0255,0.0229,0.0255
Receive Social Insurance - Has Never Received,0.2238,0.2501,0.2157
Had Received Social Insurance - Has Never Received,0.1138,0.0792,0.1178


The estimated $\gamma_1$ on log household income is given by

In [6]:
with pd.option_context('display.precision', 4):
    display(PL_result_pivot_1_T)

Model,Treatment,Control_1,RF,XGB,XGBs
Seed,Unnamed: 1_level_1,Unnamed: 2_level_1,42,42,42
0,zerotofive,hh_inc,0.1211,0.1435,0.1486
1,sixtotwenty,hh_inc,0.1353,0.1512,0.1478
2,grp,hh_inc,0.1326,0.1365,0.1487
3,home_own,hh_inc,0.1264,0.1328,0.1474
4,gender,hh_inc,0.1285,0.1526,0.1478
5,isHHH,hh_inc,0.1353,0.1534,0.1498
6,live_together,hh_inc,0.129,0.1371,0.1467
7,work,hh_inc,0.1324,0.1393,0.1499
8,get_social_benefit,hh_inc,0.1237,0.1306,0.1496
9,got_social_benefit,hh_inc,0.1275,0.1396,0.148


The estimated $\gamma_2$ on age is given by

In [7]:
with pd.option_context('display.precision', 4):
    display(PL_result_pivot_2_T)

Model,Treatment,Control_2,RF,XGB,XGBs
Seed,Unnamed: 1_level_1,Unnamed: 2_level_1,42,42,42
0,zerotofive,age,-0.0016,-0.0031,-0.0034
1,sixtotwenty,age,-0.002,-0.0024,-0.0031
2,grp,age,-0.0006,-0.0027,-0.0025
3,home_own,age,-0.001,-0.0016,-0.0022
4,gender,age,-0.0003,-0.0013,-0.0023
5,isHHH,age,-0.001,-0.0011,-0.0024
6,live_together,age,-0.0002,-0.0017,-0.0026
7,work,age,-0.0005,-0.0022,-0.0027
8,get_social_benefit,age,-0.0005,-0.0018,-0.0028
9,got_social_benefit,age,-0.0003,-0.0015,-0.003


In [8]:
con.close()