# PHS564 — Lecture 10 (Student)
## Causal survival analysis: time-to-event outcomes, censoring, discrete-time hazards (MIMIC-IV Demo)

### Learning goals
- Distinguish **risk** vs **hazard** vs **survival**; know what effect measure you are estimating.
- Convert a cohort into **person-period** (discrete time) data and fit pooled logistic hazards.
- Handle **right censoring** with **IPCW**; understand when censoring is “informative”.
- Produce causal survival curves under a point treatment (and interpret assumptions).

### Required reading
- Hernán & Robins, sections on censoring and survival (target trial chapters/sections as applicable).


In [None]:
# Colab bootstrap (run this first if you opened from a Colab badge)
# - Clones the repo into /content/PHS564 (if needed)
# - Installs requirements
# - Adds repo to sys.path

from __future__ import annotations

import os
import sys
import subprocess
from pathlib import Path


def _in_colab() -> bool:
    return "google.colab" in sys.modules


if _in_colab():
    REPO_URL = "https://github.com/vafaei-ar/PHS564.git"
    TARGET_DIR = Path("/content/PHS564")

    if not (TARGET_DIR / "requirements.txt").exists():
        print("Cloning course repo into Colab runtime...")
        subprocess.run(["git", "clone", "--depth", "1", REPO_URL, str(TARGET_DIR)], check=True)

    os.chdir(TARGET_DIR)

    print("Installing requirements...")
    subprocess.run([sys.executable, "-m", "pip", "-q", "install", "-r", "requirements.txt"], check=True)

    if str(TARGET_DIR) not in sys.path:
        sys.path.insert(0, str(TARGET_DIR))

    print("✓ Colab setup complete. Now run the rest of the notebook.")
else:
    print("Not running in Colab; skipping Colab bootstrap.")


### Setup

This notebook is designed to run **locally** or in **Google Colab**.

**Colab workflow (recommended):**
1) Clone the course repo (ask the instructor for the GitHub URL).
2) Install requirements.
3) Run the notebook top-to-bottom.

> If you opened this notebook directly from GitHub in Colab (without cloning),
> relative paths will not work. Clone first.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.default_rng(564)

# Locate repo root (works when running from lectures/Lxx.../student or /instructor)
THIS_DIR = Path.cwd()
REPO_ROOT = THIS_DIR
for _ in range(4):
    if (REPO_ROOT / "requirements.txt").exists() or (REPO_ROOT / "README.md").exists():
        break
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"

print("Working directory:", THIS_DIR)
print("Repo root:", REPO_ROOT)
print("Processed data dir exists:", PROC_DIR.exists())


### Download MIMIC-IV Demo data (Colab users: run this once)

If you are running this notebook in **Google Colab**, run the next cell to download the **raw** MIMIC-IV Demo files into `data/raw/`.

- This is optional for most of this notebook.
- The analysis below uses an **instructor-provided processed cohort extract** in `data/processed/`.

In [None]:
# Download raw MIMIC-IV Demo data (safe to re-run)
# NOTE: This downloads raw CSVs to data/raw/. It does NOT create the processed cohort extract.

try:
    from data.download_data import download_mimic_demo

    mimic_dir = RAW_DIR / "mimic-iv-demo-2.2"
    if mimic_dir.exists() and any(mimic_dir.rglob("*.csv.gz")):
        print("✓ Raw MIMIC-IV Demo already present in data/raw/. Skipping download.")
    else:
        print("Downloading raw MIMIC-IV Demo (v2.2) to data/raw/... (few minutes)")
        download_mimic_demo(out_dir=RAW_DIR, version="2.2", method="python")
except Exception as e:
    print("Could not download raw MIMIC-IV Demo in this environment.")
    print("Error:", e)
    print("If you already have the instructor-provided cohort extract in data/processed/, you can continue.")

### Download MIMIC-IV Demo data (if needed)

**Important for Google Colab users:** If you are running this notebook in Google Colab, you should run the cell below to download the raw MIMIC-IV Demo data. This is needed if you want to explore the raw data or if the instructor asks you to work with it.

**Local users:** You can skip this cell if you already have the data, or run it if you need to download it.

> **Note:** This downloads the raw MIMIC-IV Demo data to `data/raw/`. The notebook will use a pre-processed cohort extract from `data/processed/`, but having the raw data available can be useful for exploration.

In [None]:
# Download MIMIC-IV Demo data (Colab-friendly)
# This cell can be skipped if you already have the data or are running locally

# Add repo root to path to import download function
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

try:
    from data.download_data import download_mimic_demo
    
    # Check if data already exists
    mimic_dir = RAW_DIR / "mimic-iv-demo-2.2"
    if mimic_dir.exists() and any(mimic_dir.rglob("*.csv.gz")):
        print("✓ MIMIC-IV Demo data already exists. Skipping download.")
    else:
        print("Downloading MIMIC-IV Demo data...")
        print("(This may take a few minutes. The data is ~7 MB compressed.)")
        download_mimic_demo(out_dir=str(RAW_DIR), method="python")
except Exception as e:
    print(f"Note: Could not download data: {e}")
    print("You can skip this cell if you already have the processed cohort extract.")

## Data
This lecture uses an **instructor-provided processed cohort extract** derived from MIMIC-IV Demo.

Expected file: `data/processed/cohort_L10_survival.parquet` (or `.csv`).

If this file is missing, ask the instructor for the course data bundle.
(The download cell above only downloads raw MIMIC-IV Demo to `data/raw/`.)

Assumed columns (you may adapt to the actual extract):
- `A` treatment at baseline
- `T` follow-up time (in discrete intervals)
- `E` event indicator (1=event, 0=censored)
- baseline covariates (e.g., age, sex, severity)

We will fit a **discrete-time hazard** model via logistic regression on a person-period dataset.


In [None]:
# Statsmodels for regression (logit/ols); installed via requirements.txt
import statsmodels.api as sm
import statsmodels.formula.api as smf
parquet_path = PROC_DIR / "cohort_L10_survival.parquet"
csv_path = PROC_DIR / "cohort_L10_survival.csv"

if parquet_path.exists():
    df = pd.read_parquet(parquet_path)
elif csv_path.exists():
    df = pd.read_csv(csv_path)
else:
    raise FileNotFoundError("Missing cohort file for L10. Run `python data/download_data.py`.")

df.head()

### TODO A1 — Create person-period dataset
If your extract is already long-format (one row per person-period), skip this.
Otherwise, expand each subject into rows `t=1..T` and set event indicator at the event time.


In [None]:
# Expected: either already long with columns ['id','t','A','E_t', ...]
# or wide with ['id','T','E', ...]
# TODO: adapt based on df columns.

long = df.copy()  # placeholder
long.head()

## Part B — Discrete-time hazard model
We model Pr(E_t=1 | E_{t-1}=0, A, L, t).


In [None]:
# TODO: set outcome and covariates
# Recommended structure:
#   hazard_formula = "E_t ~ A + age + sex + C(t)"  (with t as categorical or spline)
hazard_formula = "E_t ~ A + t"  # TODO
# Fit model
haz_model = smf.logit(hazard_formula, data=long).fit(disp=False)
haz_model.params.head()

### TODO B1 — Predict survival curves under A=1 and A=0 (g-formula)
Compute survival S(t) = Π_{k<=t} (1 - h(k)).


In [None]:
def survival_curve(data_long: pd.DataFrame, A_value: int, t_col: str = "t") -> pd.DataFrame:
    # data_long contains one row per person-period and baseline covariates.
    d = data_long.copy()
    d["A"] = A_value
    d["haz"] = haz_model.predict(d)
    # average hazard per time
    hz = d.groupby(t_col)["haz"].mean().sort_index()
    surv = (1 - hz).cumprod()
    return pd.DataFrame({"t": hz.index, "haz": hz.values, "surv": surv.values})

# TODO: adapt t column name if needed
curve1 = survival_curve(long, 1)
curve0 = survival_curve(long, 0)
curve1.head(), curve0.head()

In [None]:
plt.figure()
plt.step(curve0["t"], curve0["surv"], where="post", label="A=0")
plt.step(curve1["t"], curve1["surv"], where="post", label="A=1")
plt.xlabel("time interval")
plt.ylabel("Survival S(t)")
plt.legend()
plt.title("Estimated survival curves (discrete-time)")
plt.show()

## Reflection
1) Why is time-to-event analysis central to target trial emulation?
2) What is the difference between hazard and risk, and why does it matter?
