# Interactive Tutorial: Estimating Causal Effects

Welcome to this interactive tutorial on causal inference! In this notebook, we will walk through a practical example to understand how to estimate the causal effect of a treatment on an outcome while dealing with common pitfalls like **confounding** and **collider bias**.

We will replicate the analysis from `generate_report.py`, but with more detailed explanations at each step.

## 1. Setup

First, let's import the necessary Python libraries and functions from our project files. Make sure you have run `pip install numpy scipy scikit-learn pandas matplotlib statsmodels` in your environment.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from data.simulation import simulate_adr_data
from causal_estimation import (
    fit_propensity_score, 
    compute_ipw, 
    SemiMechanisticLogisticMSM, 
    EstimatingEquationSolver
)

print("Setup complete!")

## 2. Data Simulation

We'll use a simulated dataset that mimics a real-world scenario where we want to assess the causal effect of a drug (`A`) on an adverse drug reaction (`Y`).

The simulation includes:
- **Confounders:** `L` (disease severity), `age`, `sex`, `comorbidities`.
- **A Collider:** `hospitalized` (a variable affected by both the drug and the adverse reaction).

In [None]:
df = simulate_adr_data(n_patients=2000, seed=42)
Y = df['Y'].values
A = df['A'].values
theta_true_adr = df['theta_true_adr'].iloc[0]

print(f"The true causal log(OR) for the adverse reaction is: {theta_true_adr:.4f}")
df.head()

## 3. Scenario 1: Naive Estimation (Ignoring Confounders)

Let's start with the most basic approach: ignoring all other variables and comparing the outcome between the treated and untreated groups. This is often called a "naive" analysis.

As explained in the `CONCEPTUAL_GUIDE.md`, this approach is likely to be biased by confounding.

In [None]:
naive_model = SemiMechanisticLogisticMSM(Y, A)
naive_solver = EstimatingEquationSolver(naive_model)
theta_naive, _ = naive_solver.solve([-1, 0.5])
ci_low_naive, ci_high_naive = naive_solver.confint(theta_naive)

print(f"Naive causal log(OR) estimate: {theta_naive[1]:.4f}")
print(f"95% CI: [{ci_low_naive[1]:.4f}, {ci_high_naive[1]:.4f}]")
print(f"True effect: {theta_true_adr:.4f}")

**Observation:** The naive estimate is significantly higher than the true effect. This is because sicker patients are more likely to get the drug, and also more likely to have an adverse reaction. The estimate is biased.

## 4. Scenario 2: IPW with Collider Adjustment (Inducing Bias)

Now, let's try to adjust for other variables. A common mistake is to include any variable that is associated with the outcome. Here, we will include the **collider** (`hospitalized`) in our propensity score model.

As we learned, adjusting for a collider can *induce* bias.

In [None]:
confounders_and_collider = df[['L', 'age', 'sex', 'comorbidities', 'hospitalized']].values
ps_model_collider = fit_propensity_score(confounders_and_collider, A)
ps_collider = ps_model_collider.predict_proba(confounders_and_collider)[:, 1]
weights_collider = compute_ipw(A, ps_collider)

ipw_model_collider = SemiMechanisticLogisticMSM(Y, A, weights=weights_collider)
ipw_solver_collider = EstimatingEquationSolver(ipw_model_collider)
theta_ipw_collider, _ = ipw_solver_collider.solve([-1, 0.5])
ci_low_collider, ci_high_collider = ipw_solver_collider.confint(theta_ipw_collider)

print(f"IPW estimate adjusting for collider: {theta_ipw_collider[1]:.4f}")
print(f"95% CI: [{ci_low_collider[1]:.4f}, {ci_high_collider[1]:.4f}]")
print(f"True effect: {theta_true_adr:.4f}")

**Observation:** This estimate is also biased, this time underestimating the true effect. By adjusting for the collider `hospitalized`, we created a spurious correlation between the drug and the outcome.

## 5. Scenario 3: Correct IPW (Adjusting for Confounders Only)

Finally, let's do it the right way. We will use Inverse Probability Weighting (IPW) to adjust for the **confounders only**, and we will exclude the collider.

In [None]:
confounders_only = df[['L', 'age', 'sex', 'comorbidities']].values
ps_model_correct = fit_propensity_score(confounders_only, A)
ps_correct = ps_model_correct.predict_proba(confounders_only)[:, 1]
weights_correct = compute_ipw(A, ps_correct)

ipw_model_correct = SemiMechanisticLogisticMSM(Y, A, weights=weights_correct)
ipw_solver_correct = EstimatingEquationSolver(ipw_model_correct)
theta_ipw_correct, _ = ipw_solver_correct.solve([-1, 0.5])
ci_low_correct, ci_high_correct = ipw_solver_correct.confint(theta_ipw_correct)

print(f"Correct IPW estimate: {theta_ipw_correct[1]:.4f}")
print(f"95% CI: [{ci_low_correct[1]:.4f}, {ci_high_correct[1]:.4f}]")
print(f"True effect: {theta_true_adr:.4f}")

**Observation:** Success! This estimate is very close to the true causal effect. By correctly specifying our causal model (adjusting for confounders, not colliders), we were able to get an unbiased estimate.

## Conclusion

This tutorial demonstrated that:
1.  Ignoring confounders leads to biased estimates.
2.  Adjusting for colliders introduces bias.
3.  Correctly specifying the causal model and using a method like IPW allows for accurate causal effect estimation.

The key takeaway is that **your assumptions about the causal structure of the data are more important than the specific statistical method you use.**