# PHS564 — Lecture 06 (Student)
## Confounding: definition, confounders, and adjustment strategies

### Learning goals
- Define confounding as lack of exchangeability due to common causes.
- Compare definitions of a “confounder” and why purely statistical criteria are insufficient.
- Connect confounding control to DAG-based adjustment sets.

### Required reading
- Hernán & Robins, *What If*, Chapter 7 (Confounding). https://miguelhernan.org/whatifbook


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**Colab workflow (recommended):**
1) Clone the course repo (ask the instructor for the GitHub URL).
2) Install requirements.
3) Run the notebook top-to-bottom.

> If you opened this notebook directly from GitHub in Colab (without cloning),
> relative paths will not work. Clone first.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


## Part A — Confounding in a simple data-generating process
We simulate confounding and compare crude vs adjusted estimators.


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf
n = 5000
L = RNG.normal(size=n)
A = RNG.binomial(1, 1/(1+np.exp(-(-0.1 + 1.0*L))), size=n)
# true causal effect on risk depends only on A (additive on logit)
pY = 1/(1+np.exp(-(-1.4 + 0.7*A + 1.1*L)))
Y = RNG.binomial(1, pY)
df = pd.DataFrame({"L":L,"A":A,"Y":Y})

### TODO A1 — Crude RD


In [None]:
rd_crude = None  # TODO
rd_crude

## Part B — Stratification / standardization
We bin L into quintiles and standardize.


In [None]:
df["L_q"] = pd.qcut(df["L"], q=5, labels=False)

# Within-stratum RD
strata = []
for q, d in df.groupby("L_q"):
    r1 = d.loc[d["A"]==1,"Y"].mean()
    r0 = d.loc[d["A"]==0,"Y"].mean()
    strata.append({"L_q": q, "n": len(d), "rd": r1-r0, "p_stratum": len(d)/len(df)})
strata = pd.DataFrame(strata)
strata

### TODO B1 — Standardized RD
Compute sum(p_stratum * rd).


In [None]:
rd_std = None  # TODO
rd_std

## Part C — IPW (propensity scores)
Students only adjust the propensity model specification.


In [None]:
# TODO: edit covariate features if needed
ps = smf.logit("A ~ L", data=df).fit(disp=False)
df["ps"] = ps.predict(df)
pA = df["A"].mean()
df["sw"] = np.where(df["A"]==1, pA/df["ps"], (1-pA)/(1-df["ps"]))

def wmean(x,w): return np.sum(w*x)/np.sum(w)
mu1 = wmean(df.loc[df["A"]==1,"Y"].to_numpy(), df.loc[df["A"]==1,"sw"].to_numpy())
mu0 = wmean(df.loc[df["A"]==0,"Y"].to_numpy(), df.loc[df["A"]==0,"sw"].to_numpy())
rd_ipw = mu1 - mu0
{"rd_crude": rd_crude, "rd_std": rd_std, "rd_ipw": rd_ipw}

### Diagnostics: check extreme weights


In [None]:
plt.figure()
plt.hist(df["sw"], bins=60)
plt.xlabel("stabilized weight")
plt.ylabel("count")
plt.title("Weights")
plt.show()

df["sw"].describe(percentiles=[0.01,0.05,0.95,0.99])

## Reflection
1) How do regression adjustment, standardization, and IPW relate?
2) What makes a variable a confounder here?
