[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wasim/Data-Science/blob/main/data-analyst-roadmap/05_statistics_for_data_analysis/09_regression_diagnostics.ipynb)

# Regression Diagnostics

Check if your model is valid.

## Key Assumptions of Linear Regression
1. **Linearity:** X and Y have linear relationship
2. **Independence:** Errors are independent
3. **Homoscedasticity:** Constant variance of errors
4. **Normality:** Errors are normally distributed
5. **No Multicollinearity:** Independent variables not correlated

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

sns.set_style('whitegrid')

## 1. Fit Model

In [None]:
# Generate data
X = np.random.rand(100, 2)
y = 2 * X[:, 0] + 3 * X[:, 1] + 1 + np.random.normal(0, 0.5, 100)

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
residuals = model.resid
fitted = model.fittedvalues

## 2. Check Linearity & Homoscedasticity
Plot Residuals vs Fitted values.

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(fitted, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show()
print("Look for: Random scatter around 0 lines (Good). Funnel shape (Bad).")

## 3. Check Normality of Errors
Q-Q Plot and Histogram.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sm.qqplot(residuals, line='45', ax=ax[0])
sns.histplot(residuals, kde=True, ax=ax[1])
plt.show()

## 4. Check Multicollinearity (VIF)
Variance Inflation Factor.

In [None]:
vif = pd.DataFrame()
vif["Variable"] = [f"X{i}" for i in range(X.shape[1])]
vif["VIF"] = [
    variance_inflation_factor(X, i)
    for i in range(X.shape[1])
]

print(vif)
print("\nRule of Thumb: VIF > 5-10 indicates high multicollinearity.")

## Practice Exercise
Run diagnostics on California Housing dataset.

In [None]:
# Load sklearn california housing dataset
# Fit OLS model
# Check assumptions
# Your code here