# Linear Regression 03 — Multiple & Polynomial Regression  
**Deccan AI School (Premium Bootcamp)** — Working Professionals (IT/Software)

**Goal:** Move from toy “one-feature” regression to realistic modeling:
- Multiple Linear Regression (many features)
- Feature engineering (interaction terms)
- Polynomial Regression (non-linear patterns with linear model in transformed space)
- Overfitting intuition + train/test evaluation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 5)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## 1) A realistic IT dataset (synthetic, but business-shaped)

We simulate "project delivery time" in days.

Features:
- story_points (work size)
- team_size
- tech_complexity (1-10)
- dependencies_count

Target:
- delivery_days

This is a very common scenario in IT planning and project management.

In [None]:
rng = np.random.default_rng(7)
n = 500

story_points = rng.integers(10, 200, size=n)
team_size = rng.integers(2, 15, size=n)
tech_complexity = rng.integers(1, 11, size=n)  # 1..10
dependencies = rng.integers(0, 12, size=n)

# True underlying function (unknown to model)
noise = rng.normal(0, 7, size=n)

delivery_days = (
    0.35 * story_points
    - 2.8 * team_size
    + 4.5 * tech_complexity
    + 1.8 * dependencies
    + 0.02 * (story_points * tech_complexity)  # interaction effect
    + noise
)

df = pd.DataFrame({
    "story_points": story_points,
    "team_size": team_size,
    "tech_complexity": tech_complexity,
    "dependencies": dependencies,
    "delivery_days": delivery_days
})

df.head()

## 2) Train/Test split (professional habit)

Always split before experimenting, otherwise you will accidentally overfit.

In production, you may also do:
- validation set
- cross-validation
But here train/test is enough to build intuition.

In [None]:
X = df.drop(columns=["delivery_days"])
y = df["delivery_days"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

pred_train = lr.predict(X_train)
pred_test = lr.predict(X_test)

def report(y_true, y_pred, name=""):
    mse = mean_squared_error(y_true, y_pred)
    rmse = mse**0.5
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} MSE={mse:.2f} RMSE={rmse:.2f} MAE={mae:.2f} R2={r2:.3f}")

report(y_train, pred_train, "Train")
report(y_test, pred_test, "Test ")

## 3) Interpret coefficients like a business analyst

Coefficient meaning:
- Holding other features constant, 1 unit increase in feature changes target by coef units.

In this IT dataset:
- team_size coefficient should be negative (bigger team → faster delivery), but careful:
  - In real life, too big team can slow down due to communication overhead.
  - Linear regression captures only the linear part unless we add features.

In [None]:
coef_df = pd.DataFrame({
    "feature": X.columns,
    "coef": lr.coef_
}).sort_values("coef", ascending=False)

coef_df

## 4) Feature engineering: Interaction terms

We secretly built the data with:
\[
0.02 \cdot story\_points \cdot tech\_complexity
\]

A basic linear regression without this interaction will miss some signal.

We can add an engineered feature:
- story_points_x_complexity = story_points * tech_complexity

In [None]:
df["sp_x_complexity"] = df["story_points"] * df["tech_complexity"]

X2 = df.drop(columns=["delivery_days"])
y2 = df["delivery_days"]

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

lr2 = LinearRegression()
lr2.fit(X2_train, y2_train)

report(y2_train, lr2.predict(X2_train), "Train (with interaction)")
report(y2_test, lr2.predict(X2_test), "Test  (with interaction)")

## 5) Residual visualization (quick preview)

Residual = y - y_hat.

Good sign:
- residuals are randomly scattered around 0

Bad sign:
- pattern/curve/funnel

In [None]:
residuals = y_test - pred_test

plt.scatter(pred_test, residuals, s=15)
plt.axhline(0)
plt.title("Residuals vs Predictions (baseline multiple LR)")
plt.xlabel("Predicted")
plt.ylabel("Residual (actual - predicted)")
plt.grid(True)
plt.show()

## 6) Polynomial regression (why it exists)

Sometimes relationship is not linear.
Example:
- Performance improves with CPU allocation up to a point, then saturates.
- Marketing spend increases revenue, but marginal gains reduce after saturation.

Polynomial regression:
- transforms features into powers: x, x^2, x^3, ...
- still uses linear regression under the hood

We’ll do a clean 1D example so students can SEE it.

In [None]:
rng = np.random.default_rng(1)
x = np.linspace(-3, 3, 120)
y = 1.2 + 0.5*x - 1.4*(x**2) + 0.3*(x**3) + rng.normal(0, 1.2, size=len(x))

X = x.reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compare degrees
degrees = [1, 2, 3, 8]
results = []

for d in degrees:
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("lr", LinearRegression())
    ])
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    results.append((d, r2_score(y_test, pred), mean_squared_error(y_test, pred)**0.5))

pd.DataFrame(results, columns=["degree", "R2_test", "RMSE_test"])

In [None]:
# Plot fits
x_plot = np.linspace(-3, 3, 200).reshape(-1, 1)

plt.scatter(X_train, y_train, s=15, label="train")
plt.scatter(X_test, y_test, s=15, label="test")

for d in [1,2,3,8]:
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("lr", LinearRegression())
    ])
    model.fit(X_train, y_train)
    plt.plot(x_plot, model.predict(x_plot), label=f"degree={d}")

plt.title("Polynomial Regression: Underfit vs Good Fit vs Overfit")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.legend()
plt.show()

## 7) Overfitting intuition (bootcamp talk)

- Degree 1: underfit (too simple)
- Degree 3: matches true generating process
- Degree 8: may start chasing noise

Key concept:
> More complexity reduces training error, but may increase test error.

This is the Bias-Variance tradeoff.

## 8) Professional pattern: scaling + polynomial in a pipeline

If you do polynomial on real features, values can explode.
So scaling helps.

We'll show a best-practice pipeline for real work.

In [None]:
model = Pipeline([
    ("scale", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("lr", LinearRegression())
])

model.fit(X2_train, y2_train)
report(y2_test, model.predict(X2_test), "Polynomial(deg=2) on IT dataset")

## 9) Mini Projects (students can do with synthetic or real data)

### Project A: Software Sprint Delivery Predictor
- Predict delivery days using story points, team size, complexity, dependencies.
- Add interaction features.
- Compare:
  - Multiple LR baseline
  - Polynomial with degree 2

### Project B: Cloud Cost Estimator
- Predict monthly cost using:
  - compute_hours
  - storage_gb
  - network_gb
  - region factor
- Interpret coefficients.

Deliverable:
- a notebook report: data → model → metrics → interpretation → recommendations

## 10) Mini Assignment (must do)

1. Add a new feature: `team_size_squared = team_size^2`
2. Train model and compare test metrics.
3. Explain in 3 lines:
   - why such a feature could represent coordination overhead (Brooks’ Law).

This is an excellent interview talking point.