# PHS564 — Lecture 03 (Student)
## Causal effects in observational studies (identifiability and assumptions)

### Learning goals
- Conceptualize observational studies as “imperfect randomized experiments”.
- State the 3 key identifiability conditions: exchangeability, positivity, well-defined interventions.
- Recognize why conditional exchangeability cannot be tested from observed data alone.

### Required reading
- Hernán & Robins, Chapter 3. https://miguelhernan.org/whatifbook


In [None]:
# Colab bootstrap (run this first if you opened from a Colab badge)
# - Clones the repo into /content/PHS564 (if needed)
# - Installs requirements
# - Adds repo to sys.path

from __future__ import annotations

import os
import sys
import subprocess
from pathlib import Path


def _in_colab() -> bool:
    return "google.colab" in sys.modules


if _in_colab():
    REPO_URL = "https://github.com/vafaei-ar/PHS564.git"
    TARGET_DIR = Path("/content/PHS564")

    if not (TARGET_DIR / "requirements.txt").exists():
        print("Cloning course repo into Colab runtime...")
        subprocess.run(["git", "clone", "--depth", "1", REPO_URL, str(TARGET_DIR)], check=True)

    os.chdir(TARGET_DIR)

    print("Installing requirements...")
    subprocess.run([sys.executable, "-m", "pip", "-q", "install", "-r", "requirements.txt"], check=True)

    if str(TARGET_DIR) not in sys.path:
        sys.path.insert(0, str(TARGET_DIR))

    print("✓ Colab setup complete. Now run the rest of the notebook.")
else:
    print("Not running in Colab; skipping Colab bootstrap.")


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**Colab workflow (recommended):**
1) Clone the course repo (ask the instructor for the GitHub URL).
2) Install requirements.
3) Run the notebook top-to-bottom.

> If you opened this notebook directly from GitHub in Colab (without cloning),
> relative paths will not work. Clone first.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


## Part A — Observational study as a conditionally randomized experiment
We simulate confounding via a measured covariate `L`. The goal is to estimate the ATE of `A` on `Y`.


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf
n = 3000
L = RNG.normal(size=n)
# treatment depends on L (confounding)
pA = 1/(1+np.exp(-(-0.2 + 1.2*L)))
A = RNG.binomial(1, pA)
# outcome depends on A and L
pY = 1/(1+np.exp(-(-1.0 + 0.8*A + 1.0*L)))
Y = RNG.binomial(1, pY)
df = pd.DataFrame({"L":L, "A":A, "Y":Y})
df.head()

### TODO A1 — Naïve (confounded) estimate
Compute the crude RD: E[Y|A=1]-E[Y|A=0].


In [None]:
rd_crude = None  # TODO
rd_crude

## Part B — Standardization (g-formula) with a correctly specified model
We fit an outcome model E[Y|A,L] and standardize over the empirical distribution of L.


### TODO B1 — Fit a logistic outcome regression
Fill in the formula and fit the model.


In [None]:
# TODO: choose a reasonable model for Pr(Y=1 | A, L)
formula = "Y ~ A + L"  # you may extend (e.g., L^2)
model_y = smf.logit(formula=formula, data=df).fit(disp=False)
model_y.summary().tables[0]

### TODO B2 — Standardize: predict under A=1 and A=0 and average


In [None]:
df1 = df.copy(); df1["A"] = 1
df0 = df.copy(); df0["A"] = 0

# TODO: use model_y.predict to get predicted risk under A=1 and A=0
mu1 = None
mu0 = None
rd_std = None

{"mu1": mu1, "mu0": mu0, "rd_std": rd_std}

## Part C — IP weighting (propensity score)
We fit a treatment model Pr(A=1|L), compute stabilized weights, and estimate RD in the pseudo-population.


### TODO C1 — Fit propensity score model


In [None]:
# TODO: propensity model for A given L
ps_model = smf.logit(formula="A ~ L", data=df).fit(disp=False)
df["ps"] = ps_model.predict(df)

# Stabilized weights: numerator Pr(A) and denominator Pr(A|L)
pA_marg = df["A"].mean()
df["sw"] = np.where(df["A"]==1, pA_marg/df["ps"], (1-pA_marg)/(1-df["ps"]))
df[["ps","sw"]].describe(percentiles=[0.01,0.05,0.95,0.99])

### TODO C2 — Weighted RD
Compute weighted means of Y by A, then RD.


In [None]:
def wmean(x, w):
    return np.sum(w*x)/np.sum(w)

mu1_w = wmean(df.loc[df["A"]==1,"Y"].to_numpy(), df.loc[df["A"]==1,"sw"].to_numpy())
mu0_w = wmean(df.loc[df["A"]==0,"Y"].to_numpy(), df.loc[df["A"]==0,"sw"].to_numpy())
rd_ipw = mu1_w - mu0_w
{"mu1_w": mu1_w, "mu0_w": mu0_w, "rd_ipw": rd_ipw}

### Diagnostics


In [None]:
plt.figure()
plt.hist(df["sw"], bins=60)
plt.xlabel("stabilized weight")
plt.ylabel("count")
plt.title("Weight distribution")
plt.show()

## Reflection
1) Which assumptions are required for standardization and IPW to identify the ATE?
2) What is positivity, and how would it show up in the weights?
