# PHS564 — Lecture 07 (Student)
## Selection bias (structure and adjustment)

### Learning goals
- Recognize selection bias as conditioning on a collider (selection variable).
- Identify common epidemiologic selection mechanisms:
- Berkson’s bias, incidence-prevalence bias, differential loss to follow-up.
- Use IP weighting (or standardization) to adjust for censoring/selection under assumptions.

### Required reading
- Hernán & Robins, Chapter 8. https://miguelhernan.org/whatifbook


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**Colab workflow (recommended):**
1) Clone the course repo (ask the instructor for the GitHub URL).
2) Install requirements.
3) Run the notebook top-to-bottom.

> If you opened this notebook directly from GitHub in Colab (without cloning),
> relative paths will not work. Clone first.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


## Part A — Selection bias via informative censoring
We simulate censoring C that depends on prognosis and differs by treatment.


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf
n = 5000
L = RNG.normal(size=n)  # prognosis
A = RNG.binomial(1, 0.5, size=n)  # randomized for clarity

# outcome
pY = 1/(1+np.exp(-(-1.2 + 0.6*A + 1.0*L)))
Y = RNG.binomial(1, pY)

# censoring depends on L and A (informative)
pC = 1/(1+np.exp(-(-1.5 + 1.2*L - 0.5*A)))   # C=1 means censored/missing
C = RNG.binomial(1, pC)
df = pd.DataFrame({"L":L,"A":A,"Y":Y,"C":C})
df["Y_obs"] = df["Y"].where(df["C"]==0, np.nan)
df[["A","L","Y","C","Y_obs"]].head()

### TODO A1 — Complete-case (biased) RD
Compute RD using only uncensored observations.


In [None]:
cc = df[df["C"]==0].copy()
rd_cc = None  # TODO
rd_cc

## Part B — Inverse probability of censoring weights (IPCW)
We model Pr(C=0 | A, L) and reweight complete cases.


### TODO B1 — Fit censoring model
Fit a logistic regression for being observed (C=0).


In [None]:
df["obs"] = (df["C"]==0).astype(int)

# TODO: censoring/observation model. Start simple.
obs_model = smf.logit("obs ~ A + L", data=df).fit(disp=False)
df["p_obs"] = obs_model.predict(df)

# Stabilized weights: numerator Pr(obs|A) / denominator Pr(obs|A,L)
# (You may also use unconditional numerator Pr(obs) if preferred.)
p_obs_num = df.groupby("A")["obs"].transform("mean")
df["sw_c"] = p_obs_num / df["p_obs"]

# Use weights only for observed cases
df_obs = df[df["obs"]==1].copy()
df_obs["w"] = df_obs["sw_c"]
df_obs[["w"]].describe(percentiles=[0.01,0.05,0.95,0.99])

### TODO B2 — Weighted RD after IPCW


In [None]:
def wmean(x,w): return np.sum(w*x)/np.sum(w)
mu1 = wmean(df_obs.loc[df_obs["A"]==1,"Y"].to_numpy(), df_obs.loc[df_obs["A"]==1,"w"].to_numpy())
mu0 = wmean(df_obs.loc[df_obs["A"]==0,"Y"].to_numpy(), df_obs.loc[df_obs["A"]==0,"w"].to_numpy())
rd_ipcw = mu1 - mu0
{"rd_cc": rd_cc, "rd_ipcw": rd_ipcw}

### Diagnostics


In [None]:
plt.figure()
plt.hist(df_obs["w"], bins=60)
plt.xlabel("IPCW weight")
plt.ylabel("count")
plt.title("Censoring weights (observed only)")
plt.show()

## Reflection
1) When does complete-case analysis work?
2) What assumption is needed for IPCW to remove selection bias?
