# 📘 Notebook 5: Assumptions and Diagnostics
We analyze the foundational assumptions of linear regression and use diagnostics to evaluate model validity.

**Goal:** Learn how to verify assumptions with real data and interpret diagnostic plots.

## 🧠 What Assumptions Does Linear Regression Make?
1. Linearity (relationship between X and y is linear)
2. Independence of errors
3. Homoscedasticity (constant variance of errors)
4. Normality of errors
5. No multicollinearity (features are not linearly dependent)

## 📊 Step 1: Simulate a Realistic Dataset

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression

np.random.seed(0)
n = 150
X = np.random.normal(0, 1, (n, 1))
noise = np.random.normal(0, 1, n)
y = 4 + 3 * X.flatten() + noise

df = pd.DataFrame({'x': X.flatten(), 'y': y})
px.scatter(df, x='x', y='y', title='Simulated Linear Data')

## 🧪 Step 2: Fit the Model

In [None]:
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred
df['residuals'] = residuals

## 📉 Residual Plot (Homoscedasticity Check)

In [None]:
px.scatter(x=y_pred, y=residuals, labels={'x': 'Predicted', 'y': 'Residuals'},
           title='Residuals vs Predicted')

### ✅ Interpretation:
- Random scatter = constant variance = assumption likely satisfied
- Funnel shape or patterns = violation (heteroscedasticity)

## 📈 Q-Q Plot (Normality of Residuals)

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,6))
stats.probplot(residuals, dist='norm', plot=plt)
plt.title('Q-Q Plot')
plt.show()

### ✅ Interpretation:
- Straight diagonal = normally distributed residuals
- Curved/tail deviations = non-normal errors

## 📏 Multicollinearity (Variance Inflation Factor)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X_vif = add_constant(np.hstack([X, X**2]))
vif_data = pd.DataFrame()
vif_data['feature'] = ['const', 'x1', 'x2']
vif_data['VIF'] = [variance_inflation_factor(X_vif, i) for i in range(X_vif.shape[1])]
vif_data

## ✅ Summary
- We learned to diagnose violations in model assumptions
- Each assumption has a visual or numerical test
- In real data, multiple assumptions may be weakly or strongly violated

➡️ Next: Feature Transformations and Polynomial Regression