# Linear Regression 04 — Diagnostics, Assumptions & Failures  
**Deccan AI School (Premium Bootcamp)** — Working Professionals (IT/Software)

**Goal:** Learn production-ready regression:
- Assumptions (what must be approximately true)
- Diagnostic plots (residuals, QQ plot style intuition)
- Outliers and leverage
- Multicollinearity and VIF
- Heteroscedasticity
- Data leakage & drift (engineering traps)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 5)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## 1) Assumptions (talk like a senior data scientist)

Linear regression assumes (approximately):
1. Linearity: relationship is linear in parameters.
2. Independence: observations are independent.
3. Homoscedasticity: constant error variance across predictions.
4. Normal-ish residuals: mainly for confidence intervals, not prediction accuracy.
5. No strong multicollinearity: features not heavily redundant.

Reality:
- Assumptions are never perfectly true.
- Your job is to detect when violation is severe enough to matter.

## 2) Build a dataset with issues (so we can diagnose)

We intentionally create:
- multicollinearity: x2 ~ 2*x1
- heteroscedasticity: noise increases with x
- some outliers

This makes diagnostics meaningful.

In [None]:
rng = np.random.default_rng(10)
n = 400

x1 = rng.normal(0, 1, n)
x2 = 2.0*x1 + rng.normal(0, 0.1, n)        # highly correlated with x1 (collinearity)
x3 = rng.normal(0, 1, n)

# Heteroscedastic noise: grows with |x1|
noise = rng.normal(0, 0.3 + 1.2*np.abs(x1), n)

y = 3 + 2.5*x1 - 1.5*x3 + noise

# Add a few outliers
out_idx = rng.choice(n, 6, replace=False)
y[out_idx] += rng.normal(15, 5, size=len(out_idx))

df = pd.DataFrame({"x1": x1, "x2": x2, "x3": x3, "y": y})
df.head()

In [None]:
X = df[["x1","x2","x3"]]
y = df["y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

rmse = mean_squared_error(y_test, pred) ** 0.5
r2 = r2_score(y_test, pred)
rmse, r2

## 3) Residual plot: the first diagnostic you should always do

If you see:
- funnel shape → heteroscedasticity (variance changes with prediction)
- curve shape → missing non-linearity
- clusters → missing categorical features or segment behavior

In [None]:
residuals = y_test - pred

plt.scatter(pred, residuals, s=15)
plt.axhline(0)
plt.title("Residuals vs Predictions")
plt.xlabel("Predicted")
plt.ylabel("Residual (actual - predicted)")
plt.grid(True)
plt.show()

## 4) Outlier intuition (bootcamp)

Outliers can:
- rotate your line (change slope)
- inflate error
- destabilize coefficients

In IT:
- Some projects are “special cases” (vendor delay, scope creep, compliance)
Treating them like normal cases can hurt.

We’ll visualize large residual points.

In [None]:
abs_res = np.abs(residuals)
top = np.argsort(abs_res)[-10:]

plt.scatter(pred, residuals, s=15)
plt.scatter(pred[top], residuals.iloc[top], s=60, marker="x")
plt.axhline(0)
plt.title("Residuals with largest-error points highlighted")
plt.xlabel("Predicted")
plt.ylabel("Residual")
plt.grid(True)
plt.show()

## 5) Multicollinearity: when features fight each other

If two features contain the same information:
- coefficients become unstable
- small data changes produce big coefficient changes
- interpretation becomes meaningless

A quick early hint:
- correlation matrix

In [None]:
corr = df[["x1","x2","x3"]].corr()
corr

In [None]:
plt.imshow(corr, aspect="auto")
plt.xticks(range(3), ["x1","x2","x3"])
plt.yticks(range(3), ["x1","x2","x3"])
plt.title("Correlation Heatmap (values shown in table above)")
plt.grid(False)
plt.show()

## 6) VIF (Variance Inflation Factor)

VIF quantifies multicollinearity.

Rule-of-thumb:
- VIF ~ 1 → no collinearity
- VIF > 5 → concerning
- VIF > 10 → serious problem

We compute VIF manually without external libraries.

In [None]:
from sklearn.linear_model import LinearRegression

def vif(df_features):
    X = df_features.values
    vifs = []
    for i in range(X.shape[1]):
        y_i = X[:, i]
        X_others = np.delete(X, i, axis=1)
        model = LinearRegression().fit(X_others, y_i)
        r2 = model.score(X_others, y_i)
        vifs.append(1 / (1 - r2))
    return pd.DataFrame({"feature": df_features.columns, "VIF": vifs})

vif(df[["x1","x2","x3"]])

## 7) What to do when collinearity is high?

Practical options:
1. Remove one of the correlated features.
2. Combine them into one feature (PCA, average, domain logic).
3. Use regularization (Ridge helps stabilize).
4. Collect more varied data (often best but hardest).

We’ll cover Ridge/Lasso as an extension after the linear regression track.

## 8) Heteroscedasticity (variance changes)

If error variance increases with predictions:
- standard errors (confidence intervals) become unreliable
- you might need:
  - transform target (log)
  - weighted least squares
  - robust regression
  - segment models

Here, we intentionally built it — so you should see “funnel” behavior in residual plot.

## 9) Leakage & Drift (engineering traps)

### Leakage
When a feature contains future info:
- Example: "actual_close_date" used to predict "delivery_days"
Your model looks perfect in training, then fails in production.

### Drift
Input distribution changes:
- Example: New tech stack introduced, team size policies change.

**Senior engineer habit:** Always ask:
- “Can this feature exist at prediction time?”
- “Will this feature distribution remain stable?”

## 10) Mini Assignment (production mindset)

1. Remove x2 (highly collinear with x1)
2. Retrain and compare:
   - RMSE
   - coefficient stability (print coefficients)
3. Write 5 lines:
   - Why removing redundant features can improve interpretability.

Deliverable: 1 code cell + 1 markdown cell.