# Lecture 03: Observational Studies and Identifiability

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L03_Observational_Assumptions/L03_Observational_Assumptions_student.ipynb)

## Learning Objectives
1. Define the three core assumptions for identifiability: **Exchangeability, Positivity, and Consistency**.
2. Understand the **standardization** (g-formula) approach to identification.
3. Identify violations of positivity and unmeasured confounding.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from phs564_ci.datasets import load_data

# Load observational data
df = load_data("l03_observational.csv")
df.head()

--- 
## üõë Activity 1: Which assumption is fragile here? (Slide 10)

**Scenario:** A study on the effect of a new, very expensive drug on cancer survival using hospital registry data.

1. **Exchangeability:** Are the treated and untreated likely comparable? (Think about who can afford the drug).
2. **Positivity:** Does everyone have a chance to get the drug? (What if some hospitals don't carry it?)
3. **Consistency:** Is the treatment "taking the drug" well-defined?

--- 
### 1. Positivity Check: Overlap
We need to ensure that within levels of our confounder `L`, there is overlap in treatment assignment.

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df[df['A']==1], x='L', label='Treated (A=1)', fill=True)
sns.kdeplot(data=df[df['A']==0], x='L', label='Untreated (A=0)', fill=True)
plt.title("Positivity Check: Distribution of L by Treatment")
plt.legend()
plt.savefig("figures/L03/overlap_good.png")
plt.show()

--- 
### üñºÔ∏è Figure Generation: Positivity Failure (Slide 14)
Let's see what a positivity violation looks like.

In [None]:
# Create a toy dataset with NO overlap
df_bad = pd.DataFrame({
    'L': np.concatenate([np.random.normal(0, 1, 500), np.random.normal(5, 1, 500)]),
    'A': np.concatenate([np.zeros(500), np.ones(500)])
})

plt.figure(figsize=(10, 6))
sns.kdeplot(data=df_bad[df_bad['A']==1], x='L', label='Treated (A=1)', fill=True)
sns.kdeplot(data=df_bad[df_bad['A']==0], x='L', label='Untreated (A=0)', fill=True)
plt.title("Positivity VIOLATION: No Overlap in L")
plt.legend()
plt.savefig("figures/L03/overlap_bad.png")
plt.show()

--- 
## üõë Activity 2: List L for your project (Slide 19)

For your chosen causal question, list at least 5 pre-treatment covariates ($L$) you would need to adjust for to satisfy **conditional exchangeability** ($Y^a \perp A | L$).

--- 
### 2. Identification via Standardization (G-Formula)
We will estimate the Average Causal Effect (ACE) by standardizing to the distribution of `L` in the population.

In [None]:
# 1. Calculate the distribution of L (here L is continuous, but we will simplify for intuition)
# In a real study, we would use a model (Lecture 08). For now, let's discretize L.
df['L_bin'] = pd.qcut(df['L'], q=4, labels=False)

# 2. Calculate risk in each stratum of L_bin and A
stratified_risks = df.groupby(['L_bin', 'A'])['Y'].mean().unstack()

# 3. Calculate weights (proportion of population in each L_bin stratum)
weights = df['L_bin'].value_counts(normalize=True).sort_index()

# 4. Standardize
risk_a1 = (stratified_risks[1] * weights).sum()
risk_a0 = (stratified_risks[0] * weights).sum()

print(f"Standardized Risk if everyone treated (A=1): {risk_a1:.3f}")
print(f"Standardized Risk if no one treated (A=0): {risk_a0:.3f}")
print(f"Standardized RD (Causal Effect): {risk_a1 - risk_a0:.3f}")

# Compare with crude (associational) difference
crude_diff = df[df['A']==1]['Y'].mean() - df[df['A']==0]['Y'].mean()
print(f"Crude Difference (Association): {crude_diff:.3f}")

### 3. Summary
- Identification requires the "Big Three" assumptions.
- Standardization is a way to calculate causal effects by re-weighting data back to the original population.
- Overlap (positivity) is critical for these methods to work.