# Lecture 09: Standardization and the Parametric G-Formula

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L09_Standardization/L09_Standardization_student.ipynb)

## Learning Objectives
1. Understand the step-by-step algorithm of the **parametric g-formula**.
2. Implement standardization using a logistic outcome model.
3. Use the **bootstrap** to estimate confidence intervals.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from phs564_ci.datasets import load_data
from phs564_ci.estimators.standardization import standardization

# Load dataset
df = load_data("l09_standardization.csv")
df.head()

--- 
## üõë Activity 1: Implement g-formula (toy) (Slide 12)

We will manually walk through the g-formula algorithm.

### Step 1: Fit the Outcome Model

In [None]:
# Since Y is binary, we use a logistic model
model = smf.logit("Y ~ A + L", data=df).fit()
print(model.summary().tables[1])

### Step 2: Create Counterfactual Datasets

In [None]:
df_a1 = df.copy()
df_a1['A'] = 1

df_a0 = df.copy()
df_a0['A'] = 0

### Step 3: Predict Counterfactual Risks

In [None]:
pred_risk_a1 = model.predict(df_a1)
pred_risk_a0 = model.predict(df_a0)

### Step 4: Average and Compute ATE

In [None]:
mean_risk_a1 = pred_risk_a1.mean()
mean_risk_a0 = pred_risk_a0.mean()

print(f"Mean Risk if A=1: {mean_risk_a1:.3f}")
print(f"Mean Risk if A=0: {mean_risk_a0:.3f}")
print(f"Risk Difference (ATE): {mean_risk_a1 - mean_risk_a0:.3f}")

--- 
### üñºÔ∏è Figure Generation: Predicted Risks (Slide 13)

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(pred_risk_a1, label='Counterfactual: All Treated (A=1)', fill=True)
sns.kdeplot(pred_risk_a0, label='Counterfactual: All Untreated (A=0)', fill=True)
plt.title("Distribution of Predicted Counterfactual Risks")
plt.xlabel("Predicted Risk (Probability of Y=1)")
plt.legend()
plt.savefig("figures/L09/gformula_risk.png")
plt.show()

--- 
### 2. Implementation with Helper Function
Our package provides a helper to automate this (including bootstrapping).

In [None]:
ate_est = standardization(df, 'Y', 'A', ['L'], n_bootstrap=100)
print(f"Standardized RD (from helper): {ate_est:.3f}")

### 3. Summary
- The g-formula simulates what would happen if the entire population was treated vs. untreated.
- It relies on an correctly specified outcome model.
- It is a powerful way to handle multiple confounders and interactions.