# Lecture 08: Why Model? Models, Estimators, and Tradeoffs

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L08_Why_Model/L08_Why_Model_student.ipynb)

## Learning Objectives
1. Distinguish between **estimand, model, and estimator**.
2. Understand the **bias-variance tradeoff** in the context of causal inference.
3. See how **model misspecification** can lead to incorrect causal conclusions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from phs564_ci.datasets import load_data

# Load data where the relationship between L and Y is non-linear
df = load_data("l08_why_model.csv")
df.head()

--- 
### üñºÔ∏è Figure Generation: Bias-Variance Tradeoff (Slide 04)
Let's visualize underfitting vs overfitting.

In [None]:
x = np.linspace(0, 10, 100)
y_true = 0.5 * x**2 - 2*x + 5
y_noisy = y_true + np.random.normal(0, 3, 100)

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(x, y_noisy, alpha=0.3)
plt.plot(x, np.polyval(np.polyfit(x, y_noisy, 1), x), color='red')
plt.title("Underfit (Linear)")

plt.subplot(1, 3, 2)
plt.scatter(x, y_noisy, alpha=0.3)
plt.plot(x, np.polyval(np.polyfit(x, y_noisy, 2), x), color='green')
plt.title("Just Right (Quadratic)")

plt.subplot(1, 3, 3)
plt.scatter(x, y_noisy, alpha=0.3)
plt.plot(x, np.polyval(np.polyfit(x, y_noisy, 15), x), color='blue')
plt.title("Overfit (High Poly)")

plt.tight_layout()
plt.savefig("figures/L08/bias_variance.png")
plt.show()

--- 
### üñºÔ∏è Figure Generation: Misspecification Demo (Slide 05)
How does using the wrong model affect our estimate of the effect of $A$?

In [None]:
# True relationship: Y ~ A + L^2
# Model 1: Y ~ A + L (Incorrect)
model_linear = smf.ols("Y ~ A + L", data=df).fit()

# Model 2: Y ~ A + np.power(L, 2) (Correct)
model_correct = smf.ols("Y ~ A + np.power(L, 2)", data=df).fit()

print(f"Effect of A (Linear Model): {model_linear.params['A']:.3f}")
print(f"Effect of A (Correct Model): {model_correct.params['A']:.3f}")

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='L', y='Y', hue='A', alpha=0.3)
plt.title("True Data: Y is non-linear in L")
plt.savefig("figures/L08/misspecification_demo.png")
plt.show()

--- 
## üõë Activity 1: Propose a model + checks (Slide 09)

For your project research question:
1. **Outcome Type:** Is it binary (Yes/No), continuous (Blood Pressure), or time-to-event (Survival)?
2. **Model Choice:** Which regression model will you use? (e.g., Logistic, OLS, Cox).
3. **Functional Form:** Do you expect any non-linear relationships with confounders (e.g., Age)?

### 4. Summary
- Models are necessary to control for many confounders simultaneously.
- Choosing the wrong model (misspecification) can lead to bias.
- Use diagnostics and flexible forms (like splines or polynomials) to minimize this risk.