# Linear Regression 05 — Projects & Capstone (Industry Style)  
**Deccan AI School (Premium Bootcamp)** — Working Professionals (IT/Software)

**Goal:** Turn linear regression into employable skill:
- End-to-end mini projects
- Business framing + KPIs
- Model explanation + recommendations
- Clean reporting format students can reuse

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 5)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Project Template (students should follow for every project)

### 1) Problem Statement (business language)
### 2) Input features available at prediction time
### 3) Target definition + units
### 4) Data cleaning decisions
### 5) Baseline model (Linear Regression)
### 6) Evaluation metrics (RMSE/MAE/R²)
### 7) Interpretation & recommendations
### 8) Risks (leakage, drift, outliers)
### 9) Next steps (feature engineering / better models)

## Project 1 — Cloud Cost Estimator (IT/Engineering)

### Problem
Predict monthly cloud bill (₹) for a workload.

### Why it matters
- FinOps and cost governance
- Budget forecasting
- Prevent bill shocks

### Features (examples)
- compute_hours
- storage_gb
- network_egress_gb
- region_factor (1=cheap, 1.2=expensive)

In [None]:
rng = np.random.default_rng(3)
n = 600

compute_hours = rng.uniform(50, 2000, n)
storage_gb = rng.uniform(50, 5000, n)
network_gb = rng.uniform(10, 3000, n)
region_factor = rng.choice([1.0, 1.1, 1.2], size=n, p=[0.5, 0.3, 0.2])

noise = rng.normal(0, 8000, n)

monthly_cost = (
    18 * compute_hours +
    2.5 * storage_gb +
    6.5 * network_gb
) * region_factor + 15000 + noise

df = pd.DataFrame({
    "compute_hours": compute_hours,
    "storage_gb": storage_gb,
    "network_gb": network_gb,
    "region_factor": region_factor,
    "monthly_cost_inr": monthly_cost
})

df.head()

In [None]:
X = df.drop(columns=["monthly_cost_inr"])
y = df["monthly_cost_inr"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

mse = mean_squared_error(y_test, pred)
rmse = mse**0.5
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)

rmse, mae, r2

### Interpretation
Now we interpret coefficients like a FinOps/Cloud Architect:

- If compute_hours coefficient ≈ 18:
  > “Every additional compute hour increases monthly cost by ~₹18, assuming others constant.”

This becomes powerful in cost discussions with leadership.

In [None]:
coef_df = pd.DataFrame({"feature": X.columns, "coef": model.coef_}).sort_values("coef", ascending=False)
coef_df

In [None]:
# Residual plot
residuals = y_test - pred
plt.scatter(pred, residuals, s=12)
plt.axhline(0)
plt.title("Project 1: Residuals vs Predictions")
plt.xlabel("Predicted Cost")
plt.ylabel("Residual")
plt.grid(True)
plt.show()

### Recommendations (example output students should write)
- Identify top cost drivers using coefficients.
- Suggest cost optimizations:
  - reduce egress
  - use storage lifecycle policies
  - schedule compute
- Mention model limitations:
  - ignores discounts/commitment plans
  - ignores burst patterns

## Project 2 — Sprint Delivery Days Predictor (IT/Product)

Problem:
Predict delivery days using:
- story points
- team size
- complexity
- dependencies

We reuse the same idea from Notebook 03, but now we structure it as a project.

In [None]:
rng = np.random.default_rng(11)
n = 500

story_points = rng.integers(10, 200, size=n)
team_size = rng.integers(2, 15, size=n)
complexity = rng.integers(1, 11, size=n)
dependencies = rng.integers(0, 12, size=n)
noise = rng.normal(0, 7, size=n)

delivery_days = (
    0.32*story_points - 2.4*team_size + 4.8*complexity + 1.5*dependencies
    + 0.02*(story_points*complexity) + noise
)

df2 = pd.DataFrame({
    "story_points": story_points,
    "team_size": team_size,
    "complexity": complexity,
    "dependencies": dependencies,
    "delivery_days": delivery_days
})

X = df2.drop(columns=["delivery_days"])
y = df2["delivery_days"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

m = LinearRegression()
m.fit(X_train, y_train)
pred = m.predict(X_test)

print("RMSE:", mean_squared_error(y_test, pred)**0.5)
print("R2:", r2_score(y_test, pred))

pd.DataFrame({"feature": X.columns, "coef": m.coef_}).sort_values("coef", ascending=False)

## Capstone Challenge (for Deccan AI School)

**Build a “Project Planning Assistant” regression model**

Requirements:
1. Choose one project (cloud cost or sprint delivery).
2. Write a full report section in markdown:
   - problem statement
   - KPI
   - model result
   - coefficient interpretation
   - 3 actionable business recommendations
3. Add 2 risk notes:
   - leakage
   - drift
4. Optional: add interaction features and justify why.

**Submission format:**
- `.ipynb` + exported PDF
- final slide with results (1 slide)

## Interview-ready resume bullets (students can copy)

- Built an end-to-end regression model to forecast monthly cloud costs using compute, storage, and network metrics; achieved strong predictive performance and provided actionable FinOps recommendations based on interpretable coefficients.
- Developed a sprint delivery predictor using multiple linear regression with interaction terms to capture complexity effects; produced residual diagnostics and risk analysis for production deployment.