# PHS564 — Lecture 08 (Student)
## IP weighting for confounding (propensity scores) + diagnostics (MIMIC-IV Demo)

### Learning goals
- Explain IP weighting as **creating a pseudo-population** where treatment is independent of measured confounders.
- Fit a propensity model \(e(L)=Pr(A=1|L)\) and compute **unstabilized** and **stabilized** weights.
- Diagnose failure modes: **lack of overlap/positivity**, extreme weights, imbalance after weighting.
- Produce a defensible causal estimate + a minimal diagnostics panel.

### Required reading
- Hernán & Robins, *What If*, Chapter 12 (IPW basics + intuition): https://miguelhernan.org/whatifbook

**Rules for this notebook**
- Only edit cells marked **TODO**.
- Do not change the overall structure/cell order.


In [None]:
# Colab bootstrap (run this first if you opened from a Colab badge)
# - Clones the repo into /content/PHS564 (if needed)
# - Installs requirements
# - Adds repo to sys.path

from __future__ import annotations

import os
import sys
import subprocess
from pathlib import Path


def _in_colab() -> bool:
    return "google.colab" in sys.modules


if _in_colab():
    REPO_URL = "https://github.com/vafaei-ar/PHS564.git"
    TARGET_DIR = Path("/content/PHS564")

    if not (TARGET_DIR / "requirements.txt").exists():
        print("Cloning course repo into Colab runtime...")
        subprocess.run(["git", "clone", "--depth", "1", REPO_URL, str(TARGET_DIR)], check=True)

    os.chdir(TARGET_DIR)

    print("Installing requirements...")
    subprocess.run([sys.executable, "-m", "pip", "-q", "install", "-r", "requirements.txt"], check=True)

    if str(TARGET_DIR) not in sys.path:
        sys.path.insert(0, str(TARGET_DIR))

    print("✓ Colab setup complete. Now run the rest of the notebook.")
else:
    print("Not running in Colab; skipping Colab bootstrap.")


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**If you opened from the Colab badge (recommended):**
1) Run the first code cell titled **“Colab bootstrap”** (it clones the repo + installs requirements)
2) Run the notebook top-to-bottom.

**If you are running locally:**
- Install dependencies from `requirements.txt` (see the repo `README.md`), then run top-to-bottom.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


### Build the processed cohort extract (required)

This lecture expects an analysis-ready cohort file in `data/processed/`.

If it’s missing, run the next cell to:
1) download MIMIC-IV Demo into `data/raw/` (if needed)
2) build the processed cohort extracts into `data/processed/`

**Exposure mode (A):** default uses `admission_type` (contains “EMER”). You can switch to `vitals` by changing `EXPOSURE_MODE` in the code cell.

In [None]:
# Build processed cohort extracts from raw MIMIC-IV Demo (safe to re-run)
# This will create cohort files under data/processed/.

EXPOSURE_MODE = "admission_type"  # or "vitals"
HR_THRESHOLD = 100.0  # used only if EXPOSURE_MODE == "vitals"

try:
    from data.download_data import download_mimic_demo
    from data.build_processed_extracts_demo import build_processed_extracts

    mimic_dir = RAW_DIR / "mimic-iv-demo-2.2"
    if not mimic_dir.exists():
        print("Downloading raw MIMIC-IV Demo (v2.2) to data/raw/ ...")
        download_mimic_demo(out_dir=RAW_DIR, version="2.2", method="python")
    else:
        print("✓ Raw MIMIC-IV Demo already present.")

    out_paths = build_processed_extracts(exposure_mode=EXPOSURE_MODE, hr_threshold=HR_THRESHOLD)
    print("✓ Built processed cohorts:")
    for k, v in out_paths.items():
        print(f"  {k}: {v}")
except Exception as e:
    print("Could not build processed cohort extracts in this environment.")
    print("Error:", e)
    print("If you already have the cohort file, place it in data/processed/ and re-run.")

### Optional: download raw MIMIC-IV Demo tables

Not required for the homework pipeline. Skip unless your instructor asks you to explore the raw Demo tables in `data/raw/`.

In [None]:
# Optional: download raw MIMIC-IV Demo tables to data/raw/
# Not required for the processed-cohort pipeline.

try:
    from data.download_data import download_mimic_demo

    download_mimic_demo(out_dir=RAW_DIR, version="2.2", method="python")
except Exception as e:
    print("Skipping raw MIMIC-IV Demo download.")
    print("Error:", e)

## Data
This lecture uses a **processed cohort extract** derived from MIMIC-IV Demo.

Expected file:
- `data/processed/cohort_L08_ps_ipw.parquet` (preferred) or `.csv`

If this file is missing:
- Run the “Build the processed cohort extract” cell above, or
- Run locally: `python data/build_processed_extracts_demo.py --exposure-mode admission_type`

### Column definitions (this cohort)

These are defined by the cohort builder (`data/build_processed_extracts_demo.py`):

- `A` (treatment):
  - `EXPOSURE_MODE = "admission_type"` (default): `A = 1` if `admission_type` contains `"EMER"` (e.g., "EW EMER.", "DIRECT EMER.")
  - `EXPOSURE_MODE = "vitals"` (optional): `A = 1` if mean heart rate in the **first 6 hours** of the ICU stay is `> HR_THRESHOLD`
- `Y` (outcome): `Y = 1` if the patient **died in-hospital** (`hospital_expire_flag`)
- `A_label`: human-readable description of the `A` definition


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load cohort (parquet preferred, fallback to csv)
parquet_path = PROC_DIR / "cohort_L08_ps_ipw.parquet"
csv_path = PROC_DIR / "cohort_L08_ps_ipw.csv"

if parquet_path.exists():
    df = pd.read_parquet(parquet_path)
elif csv_path.exists():
    df = pd.read_csv(csv_path)
else:
    raise FileNotFoundError(
        "Cohort file not found. Expected one of:\n"
        f" - {parquet_path}\n - {csv_path}\n"
        "Build it via:\n"
        "  - Colab: run the build cell above\n"
        "  - Local: python data/build_processed_extracts_demo.py --exposure-mode admission_type\n"
    )

df.head()

### Choose variables
We will define:
- Treatment/exposure `A` (already in the cohort)
- Outcome `Y` (already in the cohort)
- Baseline covariates/confounders `L_list`

For this teaching cohort, you can use `age`, `sex` and ICU length of stay `los`.
To keep modeling simple, we will create `sex_male` as a 0/1 indicator.

In [None]:
# Variables (match the built cohort)
A = "A"   # treatment column
Y = "Y"   # outcome (0/1): in-hospital death

# Create simple numeric covariates for modeling
if "sex_male" not in df.columns:
    df["sex_male"] = (df["sex"].astype(str).str.upper() == "M").astype(int)

# Baseline covariates/confounders
L_list = ["age", "sex_male", "los"]

# Quick sanity checks
missing = [c for c in [A, Y, *L_list] if c not in df.columns]
missing

## Part A — Propensity score model


In [None]:
# Drop rows with missing values in analysis variables
analysis_vars = [A, Y] + L_list
d = df[analysis_vars].dropna().copy()

# TODO: specify the PS model. Keep it simple and explicit.
# Example: "A ~ age + sex + diabetes + ... "
ps_formula = A + " ~ " + " + ".join(L_list)
ps_model = smf.logit(ps_formula, data=d).fit(disp=False)

d["ps"] = ps_model.predict(d)

# Stabilized weights
pA = d[A].mean()
d["sw"] = np.where(d[A]==1, pA/d["ps"], (1-pA)/(1-d["ps"]))

d[["ps","sw"]].describe(percentiles=[0.01,0.05,0.95,0.99])

### TODO A1 — Weight diagnostics + truncation
Plot the weights. Choose a truncation rule (e.g., 1st/99th percentile) and create `sw_trunc`.


In [None]:
plt.figure()
plt.hist(d["sw"], bins=60)
plt.xlabel("stabilized weight")
plt.ylabel("count")
plt.title("Weights (before truncation)")
plt.show()

# TODO: truncation
lo, hi = d["sw"].quantile([0.01, 0.99]).to_list()
d["sw_trunc"] = d["sw"].clip(lower=lo, upper=hi)

plt.figure()
plt.hist(d["sw_trunc"], bins=60)
plt.xlabel("truncated weight")
plt.ylabel("count")
plt.title("Weights (after truncation)")
plt.show()

(lo, hi)

## Part B — IPW estimate of effect
We compute weighted mean outcomes by A. For binary Y, this estimates a risk difference.


In [None]:
def wmean(x,w): 
    x = np.asarray(x); w = np.asarray(w)
    return np.sum(w*x)/np.sum(w)

mu1 = wmean(d.loc[d[A]==1, Y], d.loc[d[A]==1, "sw_trunc"])
mu0 = wmean(d.loc[d[A]==0, Y], d.loc[d[A]==0, "sw_trunc"])
ate = mu1 - mu0
{"mu1": float(mu1), "mu0": float(mu0), "ATE (mu1-mu0)": float(ate)}

### TODO B1 — Sensitivity: compare to crude estimate


In [None]:
mu1_crude = d.loc[d[A]==1, Y].mean()
mu0_crude = d.loc[d[A]==0, Y].mean()
ate_crude = mu1_crude - mu0_crude
{"crude": float(ate_crude), "ipw": float(ate)}

## Reflection
1) What does a large difference between crude and IPW imply?
2) Which assumption is most vulnerable in EHR data: exchangeability, positivity, or consistency?
