<a href="https://colab.research.google.com/github/sofials2002/SOFIA/blob/master/CUPED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression Adjustment and CUPED

In this example, we will use Jonathan Roth's DGP with heterogenous effects. You are a data scientist at Udemy looking at the effects of taking a professional development $(D)$ certificate on earnings $(Y)$. You randomly assign a sample of individuals to get the certificate or not. Let $Z$ indicate how many online courses a person has taken in the past and $Y_{t-1}$ be their earnings last year.

Suppose that taking online courses causes lower earnings $Y(0)$ in jobs that don't require any certificates, but higher earnings $Y(1)$ in jobs that do require certificates.


In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

rng = np.random.default_rng(42)

The simulated data looks like this

In [2]:
# Sample size
n = 500

# Number of online courses
Z = rng.normal(20, 10, size=n)
Z = np.where(Z < 0, 0, Z)  # truncate Z to be non-negative

# Earnings before experiment
Ypre = rng.normal(60000, 3000, size=n)

# Potential earning
Y0 = -500*Z + Ypre + rng.normal(5000, 1000, size=n)
Y1 = 500*Z + 1.01*Ypre + rng.normal(5000, 1000, size=n)

# Random treatment and observed earnings
D = rng.binomial(1, .2, size=n)  # only 20% get treated
Y = Y1 * D + Y0 * (1 - D)

# Available data
data = pd.DataFrame({'Y': Y, 'D': D, 'Z': Z, 'Ypre': Ypre}).round(0).astype(int)
data.head()

Unnamed: 0,Y,D,Z,Ypre
0,57509,0,23,64092
1,62156,0,10,62686
2,48675,0,28,57842
3,46424,0,29,55492
4,55865,0,0,51106


Descriptive statistics:

In [6]:
#data.describe().round(0).astype(int)
data.describe() #mean value of those who got D is close to 0.2, as only 20% were treated.

Unnamed: 0,Y,D,Z,Ypre
count,500.0,500.0,500.0,500.0
mean,58852.782,0.194,19.886,59866.03
std,9700.152094,0.395825,9.517256,3055.138904
min,37850.0,0.0,0.0,49055.0
25%,52452.5,0.0,13.0,57824.75
50%,56796.5,0.0,20.0,60026.0
75%,63446.25,0.0,26.0,61788.25
max,90422.0,1.0,49.0,69537.0


## Regression Adjustment

Classical 2-sample approach, no adjustment (CL)

In [8]:
CL = smf.ols("np.log(Y) ~ D", data=data).fit(cov_type='HC1')
print(CL.summary().tables[1]) #regular

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.9100      0.005   2027.965      0.000      10.899      10.921
D              0.3085      0.009     33.105      0.000       0.290       0.327


Classical linear regression adjustment (CRA)

In [9]:
CRA = smf.ols("np.log(Y) ~ D + Z + np.log(Ypre)", data=data).fit(cov_type='HC1') #same as CL, but adding the extra covariates.
print(CRA.summary().tables[1])
#our goal is to try and improve the estimates by finding smaller standard errors.
#there's a way to get as small standard errors as with CL. SO, we use the third method (ahead)

                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -0.7023      0.573     -1.225      0.221      -1.826       0.422
D                0.3213      0.012     26.618      0.000       0.298       0.345
Z               -0.0065      0.000    -14.583      0.000      -0.007      -0.006
np.log(Ypre)     1.0673      0.052     20.504      0.000       0.965       1.169


Interactive regression adjustment (IRA)

In [11]:
# Demean Z and Ypre
data['Z_dm'] = data['Z'] - data['Z'].mean()
data['Ypre_dm'] = np.log(data['Ypre']) - np.log(data['Ypre']).mean() #why do you demean the covariates?

# Interactive regression adjusment (IRA)
IRA = smf.ols("np.log(Y) ~ D + Z_dm + Z_dm*D + Ypre_dm + Ypre_dm*D", data=data).fit(cov_type='HC1') #saturated specification
print(IRA.summary().tables[1])

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.9075      0.001   1.03e+04      0.000      10.905      10.910
D              0.3171      0.002    185.078      0.000       0.314       0.320
Z_dm          -0.0092      0.000    -71.200      0.000      -0.009      -0.009
Z_dm:D         0.0159      0.000     86.739      0.000       0.016       0.016
Ypre_dm        1.0601      0.023     46.578      0.000       1.016       1.105
Ypre_dm:D     -0.3225      0.040     -7.964      0.000      -0.402      -0.243


Let's compare standard errors

In [12]:
print('CL se:', CL.bse['D'].round(5))
print('CRA se:', CRA.bse['D'].round(5))
print('IRA se:', IRA.bse['D'].round(5))

CL se: 0.00932
CRA se: 0.01207
IRA se: 0.00171


Observe that CRA delivers estimates that are less efficient than CL (pointed out by Freedman), whereas IRA delivers estimates that are more efficient (pointed out by Lin). In order for CRA to be more efficient than CL, we need the linear model to be a correct model of the conditional expectation function of Y given D and X, which is not the case here.

## CUPED: Controlled-Experiment using Pre-Experiment Data

This is a very popular technique in business settings to increase the power of RCTs.

For a recent perspective on CUPED, see
- [A New Look at CUPED in 2023](https://arxiv.org/pdf/2312.02935)
- [Powering Experiments with CUPED](https://towardsdatascience.com/powering-experiments-with-cuped-and-double-machine-learning-34dc2f3d3284)
- [Understanding CUPED](https://matteocourthoud.github.io/post/cuped/).

Steps to implement CUPED:
1. Regress $Y$ on $X \equiv [Z, Y_{t-1}]$ and obtain the residuals $\hat{Y}_{\text{cuped}} = Y - \hat{\beta}X$.
2. Regress $\hat{Y}_{\text{cuped}}$ on $D$ and obtain the treatment effect

However, this implementation might not work here since we have heterogeneous treatment effect.

In [None]:
# Compute residuals
data['Y_tilde'] = smf.ols("np.log(Y) ~ Z_dm + Ypre_dm", data=data).fit().resid
cuped = smf.ols("Y_tilde ~ D", data=data).fit(cov_type='HC1')
cuped.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0619,0.002,-34.524,0.000,-0.065,-0.058
D,0.3189,0.012,26.337,0.000,0.295,0.343


In [None]:
print("CUPED se:", cuped.bse["D"].round(5))

CUPED se: 0.01211


In [None]:
# !jupyter nbconvert --to html --no-input CUPED.ipynb