# 2. Causal inference: Simulating potential outcomes

This lab is based on the task 18.12 from Gelman book (page 358).

**Exercise**:

In this exercise, we will simulate an intervention study with a
pre-determined average treatment effect. The goal is for you to understand the potential outcome
framework, and the properties of completely randomized experiments through simulation.

The setting for our hypothetical study is a class in which students take two quizzes. After quiz
1 but before quiz 2, the instructor randomly assigns half the class to attend an extra tutoring
session. The other half of the class does not receive any additional help. 

Consider the half of the class that receives tutoring as the treated group. The goal is to estimate the effect of the extra tutoring session on average test scores for the retake of quiz 1. Assume that the stable unit
treatment value assumption is satisfied.

## (a) 

Simulating all observed and potentially observed data (omniscient mode). For this section,
you are omniscient and thus know the potential outcomes for everyone. Simulate a dataset
consistent with the following assumptions.

+ The average treatment effect on all the students equals 5.
+ The population size, N, is 1000.
+ Scores on quiz 1 approximately follow a normal distribution with mean of 65 and standard deviation of 3.
+ The potential outcomes for quiz 2 should be linearly related to the pre-treatment quiz score. In particular they should take the form, 

$$y^0 = \beta_0 + \beta_1 x + 0 + \epsilon_0$$

$$y^1 = \beta_0 + \beta_1 x + \tau + \epsilon_1$$

where the intercept $\beta_0 = 10$ and the slope $\beta_1 = 1.1$. Draw the errors $\epsilon_0$ and $\epsilon_1$
independently from normal distributions with mean 0 and standard deviations 1.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [None]:
np.random.seed(10)

TE = 5
N = 1000
x = np.random.normal(65, 3, size=N)

noise0 = np.random.normal(0, 2, N)
noise1 = np.random.normal(0, 2, N)

y0 = 10 + 1.1 * x + 0 + noise0
y1 = 10 + 1.1 * x + TE + noise1

## (b)

Calculating and interpreting average treatment effects (omniscient mode). Answer the
following questions based on the data-generating process or using your simulated data.

+ What is your interpretation of $\tau$?
+ Calculate the sample average treatment effect (SATE) for your simulated dataset.
+ Why is SATE different from $\tau$?
+ How would you interpret the intercept in the data-generating process for $y_0$ and $y_1$? 
+ How would you interpret $\beta_1$?
+ Plot the response surface versus x. What does this plot reveal?

In [None]:
# SATE (omniscient mode)

te_sate_0 = np.mean(y1 - y0)
te_sate_0

In [None]:
plt.scatter(x, y0, s=1)
plt.xlabel('x')
plt.ylabel('y0')

In [None]:
plt.scatter(x, y1, s=1)
plt.xlabel('x')
plt.ylabel('y1')

In [None]:
plt.scatter(y0, y1, s=1)
plt.xlabel('y0')
plt.ylabel('y1')

## (c)

Random assignment (researcher mode). For the remaining parts of this exercise, you are
a mere researcher! Return your goggle of omniscience and use only the observed data
available to the researcher; that is, you do not have access to the counterfactual outcomes for
each student.

Using the same simulated dataset generated above, randomly assign students to treatment
and control groups. Then, create the observed dataset, which will include pre-treatment
scores, treatment assignment, and observed $y$.

In [None]:
np.random.seed(10)

# completely randomized treatment assignment with prob=0.5
z = np.random.binomial(1, 0.5, size=len(y0))


# observed y
y = [y0_ if z_ == 0 else y1_ for y0_, y1_, z_ in zip(y0, y1, z) ]

data = pd.DataFrame(list(zip(y, x, z)), columns=['y', 'x', 'z'])


In [None]:
data

## (d)

Difference in means (researcher mode).
+ Estimate SATE using a difference in means.
+ Is this estimate close to the true SATE? Divide the difference between SATE and estimated SATE by the standard deviation of the observed outcome $y$.
+ Why the estimate of SATE is different from SATE and $\tau$?

In [None]:
# sate (researcher mode)
te_sate = data[data['z'] == 1]['y'].mean() - data[data['z'] == 0]['y'].mean()
te_sate

In [None]:
(te_sate_0 - te_sate) / np.std(y)

## (e)

Researcher view: linear regression.
+ Now you will use linear regression to estimate SATE for the observed data created as
above. 
+ What is gained by estimating the average treatment effect using linear regression instead
of the mean difference estimate from above?
+ What assumptions do we need to make in order to believe this estimate? Given how you
generated the data, do you believe these assumptions have been satisfied?
+ estimate SATE with linear regression adding covariate as an independent variable

In [None]:
model = smf.ols(formula="y ~ z + x", data=data).fit()
model.summary()

## *(f) Task for you:

Deadline: 29.11.2022 12:00, send me to e-mail **aspestova@hse.ru** in **html** format.

Adding missings. Now we will add some NAs to our data and look at what is happenning with TE estimation in this case. We will add equal number of missings to the treatment and control groups, but with different properties. **Calculate SATE (with linear regression) and compare it with true SATE and estimated SATE** (in tasks e) ) for the following experiments with NAs:

1. Add 100 NAs randomly to both treatment and control group (select random obdervations that would be replaces by NAs separately for 2 groups).
2. Select 100 people from treatment group who have scores for quiz 1  (variable x) *lower* than on average and replace them with NAs, the same for the control group, but here select people who have scores for quiz 1 *higher* than on average.
3. The same as in step 2, but vice versa. Select 100 people from treatment group with scores for quiz 1 *higher* than on average, and 100 people for control group with scores *lower* than on average.


