# PHS564 — Lecture 09 (Student)
## Standardization + (parametric) g-formula (and “why model”) (MIMIC-IV Demo)

### Learning goals
- Derive and implement standardization:
- \(Pr(Y^a=1)=\sum_l Pr(Y=1|A=a,L=l)\,Pr(L=l)\) (discrete \(L\))
- parametric g-formula for high-dimensional \(L\)
- Connect **estimand → identification → estimator → computation** (this is the “why model” message).
- Compare g-formula vs IPW and articulate tradeoffs (model dependence vs weight instability).
- Bootstrap the full procedure for uncertainty.

### Required reading
- Hernán & Robins, g-formula/standardization sections (aligned chapter in *What If*).


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**Colab workflow (recommended):**
1) Clone the course repo (ask the instructor for the GitHub URL).
2) Install requirements.
3) Run the notebook top-to-bottom.

> If you opened this notebook directly from GitHub in Colab (without cloning),
> relative paths will not work. Clone first.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


### Download MIMIC-IV Demo data (if needed)

**Important for Google Colab users:** If you are running this notebook in Google Colab, you should run the cell below to download the raw MIMIC-IV Demo data. This is needed if you want to explore the raw data or if the instructor asks you to work with it.

**Local users:** You can skip this cell if you already have the data, or run it if you need to download it.

> **Note:** This downloads the raw MIMIC-IV Demo data to `data/raw/`. The notebook will use a pre-processed cohort extract from `data/processed/`, but having the raw data available can be useful for exploration.

In [None]:
# Download MIMIC-IV Demo data (Colab-friendly)
# This cell can be skipped if you already have the data or are running locally

# Add repo root to path to import download function
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

try:
    from data.download_data import download_mimic_demo
    
    # Check if data already exists
    mimic_dir = RAW_DIR / "mimic-iv-demo-2.2"
    if mimic_dir.exists() and any(mimic_dir.rglob("*.csv.gz")):
        print("✓ MIMIC-IV Demo data already exists. Skipping download.")
    else:
        print("Downloading MIMIC-IV Demo data...")
        print("(This may take a few minutes. The data is ~7 MB compressed.)")
        download_mimic_demo(out_dir=str(RAW_DIR), method="python")
except Exception as e:
    print(f"Note: Could not download data: {e}")
    print("You can skip this cell if you already have the processed cohort extract.")

## Data
Expected file: `data/processed/cohort_L09_gformula.parquet` (or `.csv`).


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf
parquet_path = PROC_DIR / "cohort_L09_gformula.parquet"
csv_path = PROC_DIR / "cohort_L09_gformula.csv"

if parquet_path.exists():
    df = pd.read_parquet(parquet_path)
elif csv_path.exists():
    df = pd.read_csv(csv_path)
else:
    raise FileNotFoundError("Missing cohort file for L09. Run `python data/download_data.py`.")

df.head()

### Choose variables


In [None]:
A = "A"   # TODO
Y = "Y"   # TODO
L_list = ["age", "sex"]  # TODO

analysis_vars = [A, Y] + L_list
d = df[analysis_vars].dropna().copy()
d.head()

## Part A — Outcome regression model
Students only specify the model.


In [None]:
# TODO: specify model for E[Y | A, L]
# If Y is binary, use logit; if continuous, use OLS.
# Start with: "Y ~ A + age + sex"
outcome_formula = Y + " ~ " + " + ".join([A] + L_list)

# Choose model type (auto-guess binary if only 0/1 values)
is_binary = set(d[Y].dropna().unique()).issubset({0,1})

if is_binary:
    y_model = smf.logit(outcome_formula, data=d).fit(disp=False)
else:
    y_model = smf.ols(outcome_formula, data=d).fit()

y_model.summary().tables[0]

## Part B — Standardization (parametric g-formula)
Predict outcomes under A=1 and A=0 for everyone, then average.


In [None]:
d1 = d.copy(); d1[A] = 1
d0 = d.copy(); d0[A] = 0

pred1 = y_model.predict(d1)
pred0 = y_model.predict(d0)

mu1 = float(np.mean(pred1))
mu0 = float(np.mean(pred0))
ate = mu1 - mu0
{"mu1": mu1, "mu0": mu0, "ATE": ate}

### TODO B1 — Bootstrap CI for ATE (optional)


In [None]:
def gformula_ate(data: pd.DataFrame) -> float:
    # Fit model and compute ATE (same specification as above)
    is_binary = set(data[Y].dropna().unique()).issubset({0,1})
    if is_binary:
        m = smf.logit(outcome_formula, data=data).fit(disp=False)
    else:
        m = smf.ols(outcome_formula, data=data).fit()
    d1 = data.copy(); d1[A]=1
    d0 = data.copy(); d0[A]=0
    return float(m.predict(d1).mean() - m.predict(d0).mean())

def bootstrap_ci(data: pd.DataFrame, B: int = 300) -> tuple[float,float,float]:
    vals=[]
    n=len(data)
    for _ in range(B):
        samp=data.sample(n=n, replace=True)
        vals.append(gformula_ate(samp))
    vals=np.array(vals)
    return float(vals.mean()), float(np.quantile(vals,0.025)), float(np.quantile(vals,0.975))

bootstrap_ci(d, B=200)

## Reflection
1) What assumptions does parametric g-formula add beyond identification assumptions?
2) How would model misspecification show up?
