# Lecture 07: Selection Bias and IPCW

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L07_Selection_Bias/L07_Selection_Bias_student.ipynb)

## Learning Objectives
1. Understand selection bias as **collider bias**.
2. Implement **Inverse Probability of Censoring Weighting (IPCW)**.
3. Diagnose issues with weights (extreme values).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from phs564_ci.datasets import load_data
from phs564_ci.estimators.ipw import ipcw_simple

# Load data with censoring
df = load_data("l07_selection_bias.csv")
df.head()

--- 
### 1. The Problem: Crude Analysis of Observed Cases
In this dataset, many people are censored (`C=1`), and we only observe `Y` for those with `C=0`.

In [None]:
# Proportion of people censored
censoring_rate = df['C'].mean()
print(f"Censoring Rate: {censoring_rate:.1%}")

# Crude RD in observed cases only
observed = df[df['C'] == 0]
crude_rd = observed[observed['A']==1]['Y'].mean() - observed[observed['A']==0]['Y'].mean()
print(f"Crude RD (Observed Only): {crude_rd:.3f}")

--- 
### 2. The Solution: IPCW
We will calculate weights to adjust for selection bias.

In [None]:
# Calculate IPCW weights using our helper
# Formula: W = 1 / Pr(C=0 | A, L)
df['weights'] = ipcw_simple(df, 'C', ['A', 'L'])

print("Sample of Weights:")
print(df[['A', 'L', 'C', 'weights']].head())

--- 
### üñºÔ∏è Figure Generation: Weight Diagnostics (Slide 08)
Let's look at the distribution of our weights.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df[df['C']==0]['weights'], bins=30, kde=True)
plt.title("Distribution of IPCW Weights among Observed Subjects")
plt.xlabel("Weight")
plt.savefig("figures/L07/ipcw_weights.png")
plt.show()

--- 
### 3. Estimating the Adjusted Effect
Now we calculate the weighted mean of $Y$.

In [None]:
def weighted_mean(data, treatment_val):
    subset = data[(data['A'] == treatment_val) & (data['C'] == 0)]
    return (subset['Y'] * subset['weights']).sum() / subset['weights'].sum()

risk_a1 = weighted_mean(df, 1)
risk_a0 = weighted_mean(df, 0)

print(f"IPCW Adjusted RD: {risk_a1 - risk_a0:.3f}")

--- 
## üõë Activity 1: Identify S and its causes (Slide 10)

For your project question:
1. What defines the "selected" population ($S=1$)? (e.g., people who didn't die before 30 days, people who didn't move away).
2. What are the likely causes of $S$? (These variables MUST be in your weighting model).

### 4. Summary
- Selection bias occurs when we restrict our analysis to those with observed data.
- IPCW reweights the observed subjects to reconstruct the original population.
- Stabilized weights (SW) can further improve precision.