# PHS564 — Lecture 01 (Student)
## Counterfactuals and causal effects: definitions, estimands, and why individuals are not identifiable

**Goal:** express a causal question as an estimand using potential outcomes and understand why we cannot identify individual causal effects.

**Rules for this notebook**
- Do *not* change the overall structure.
- Only edit cells marked **TODO**.
- Keep your code explicit (no clever one-liners).


### Setup
This notebook uses only: `numpy`, `pandas`, `matplotlib`.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.set_printoptions(precision=3, suppress=True)


## Part A — The counterfactual table (toy example)

We start with a tiny toy population. This is a pedagogical table to make the counterfactual idea concrete.

Notation:
- `A` is treatment (0/1)
- `Y` is observed outcome
- `Y1` is the potential outcome if treated
- `Y0` is the potential outcome if untreated


In [2]:
# Toy population: 8 people with *both* potential outcomes (unobservable in real life)
df_full = pd.DataFrame({
    "id": np.arange(1, 9),
    "Y0": [0, 0, 0, 1, 0, 1, 0, 1],
    "Y1": [1, 0, 1, 1, 1, 1, 0, 1],
})
df_full


Unnamed: 0,id,Y0,Y1
0,1,0,1
1,2,0,0
2,3,0,1
3,4,1,1
4,5,0,1
5,6,1,1
6,7,0,0
7,8,1,1


### TODO A1 — Define the **individual causal effect** and explain why it is not identifiable

In one sentence, define the individual causal effect for person *i* using counterfactuals.


**Your answer (TODO):**

- Individual causal effect for person *i*: **TODO**
- Why it is not identifiable from observed data: **TODO**


### TODO A2 — Compute the *true* ATE in this toy population

ATE:  \( E[Y^1 - Y^0] \). Compute it directly from `df_full`.


In [None]:
# TODO: compute the true ATE from df_full
true_ate = None

print('TRUE ATE =', true_ate)


## Part B — Observed data is missing one counterfactual (the fundamental problem)

Simulate a randomized experiment:
- Assign treatment at random
- Reveal only the corresponding potential outcome

Compare:
- **true** ATE (Part A)
- **estimated** ATE (difference in means in the observed randomized data)


In [None]:
rng = np.random.default_rng(202601)

df_obs = df_full.copy()
df_obs["A"] = rng.integers(0, 2, size=len(df_obs))

# Observed outcome is the potential outcome under the received treatment
df_obs["Y"] = np.where(df_obs["A"] == 1, df_obs["Y1"], df_obs["Y0"])

# Hide the missing counterfactuals (what we DON'T observe)
df_obs["Y_missing"] = np.where(df_obs["A"] == 1, df_obs["Y0"], df_obs["Y1"])

df_obs[["id", "A", "Y", "Y_missing"]]


### TODO B1 — Estimate the ATE from the randomized observed data

Compute:
- mean(Y | A=1)
- mean(Y | A=0)
- difference in means


In [None]:
# TODO: compute the estimated ATE in this randomized sample
mean_y_a1 = None
mean_y_a0 = None
est_ate = None

print("mean(Y|A=1) =", mean_y_a1)
print("mean(Y|A=0) =", mean_y_a0)
print("Estimated ATE =", est_ate)
print("True ATE      =", true_ate)


### TODO B2 — Visualize the sampling variability

Repeat the randomization many times (e.g., 2000 trials) and plot the distribution of estimated ATEs.
Add a vertical line at the **true** ATE.


In [None]:
# TODO: simulate many randomizations and plot distribution of estimated ATE
n_trials = 2000
ates = []

# TODO: loop and append ATEs

ates = np.array(ates, dtype=float)

plt.figure()
plt.hist(ates, bins=30)
plt.axvline(true_ate, linewidth=2)
plt.title("Estimated ATE across repeated randomization")
plt.xlabel("Estimated ATE")
plt.ylabel("Frequency")
plt.show()


## Part C — When randomization fails: confounding (preview)

We *break* randomization: people with higher baseline risk are more likely to receive treatment.
We create a baseline covariate `L` that predicts both `A` and `Y`.


In [None]:
df_conf = df_full.copy()
df_conf["L"] = np.array([0, 0, 1, 1, 0, 1, 0, 1])

# Treatment assignment depends on L (confounding)
p_treat = np.where(df_conf["L"] == 1, 0.85, 0.15)
rng = np.random.default_rng(202602)
df_conf["A"] = (rng.random(len(df_conf)) < p_treat).astype(int)

df_conf["Y"] = np.where(df_conf["A"] == 1, df_conf["Y1"], df_conf["Y0"])
df_conf[["id", "L", "A", "Y"]]


### TODO C1 — Compute the crude association and explain why it is biased

Compute the crude difference in means in `df_conf` and compare to the true ATE.
Explain the bias using **exchangeability**.


In [None]:
# TODO: crude estimate under confounding
crude = None
print('Crude diff in means =', crude)
print('True ATE            =', true_ate)


**Your explanation (TODO):**
- **TODO**


## Checklist (before submitting)
- [ ] You computed `true_ate`
- [ ] You computed the randomized `est_ate`
- [ ] You simulated sampling variability and made a histogram
- [ ] You computed the crude estimate under confounding and explained the bias
