# Linear Regression 01 — Foundations & Mathematical Intuition  
**Deccan AI School (Premium Bootcamp)** — Working Professionals (IT/Software)  
**Goal:** Understand *why* linear regression works, not just how to call a library.

---

## How to use this notebook
- Read the markdown carefully (it is written as if your instructor is speaking).
- Run code cells top to bottom.
- Pause at "Stop & Think" prompts — these are interview-style checkpoints.

---

## What you will learn
1. Regression vs Classification (in practical terms)
2. What a "line of best fit" really means (geometry + intuition)
3. The hypothesis function and why we use MSE
4. Residuals: the core object you should think about
5. A gentle bridge to statistics: mean/variance/covariance/correlation
6. What the model can and cannot do in real systems

## 1) Regression in the real world (Manager mindset)

If your manager asks:
- “How much will it cost if we add 2 engineers?”
- “How will revenue change if we increase marketing spend?”
- “What’s the expected delivery time if we change the route length?”
… they are implicitly asking for **a function that maps inputs → a number**.

That is **regression**.

### Why Linear Regression is still used in 2026
Even in the era of deep learning:
- It’s a strong baseline (fast + explainable).
- It gives interpretable coefficients (feature impact).
- It is cheap to train and deploy.
- It’s often “good enough” and safer to ship.

> In many orgs: you start with Linear Regression, then you prove why you need something more complex.

## 2) Regression vs Classification (Software analogy)

Think like a backend engineer:

- **Classification** is like choosing a label / route:
  - `/approve` vs `/reject`
  - `spam` vs `not spam`

- **Regression** is like returning a numeric response:
  - `ETA = 24 minutes`
  - `price = ₹ 1,25,000`
  - `CPU utilization = 73%`

So regression outputs a **continuous** value.

---

## 3) The simplest regression model

We assume a relationship:

\[
\hat{y} = \theta_0 + \theta_1 x
\]

Where:
- \(\theta_0\) = intercept (baseline when x=0)
- \(\theta_1\) = slope (change in y for 1 unit change in x)

In [None]:
# Imports used across the notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For nicer plots (no special styling, just grid)
plt.rcParams["figure.figsize"] = (8, 5)

## 4) A toy dataset (IT Salary example)

We’ll start with a tiny dataset.
- `x` = years of experience
- `y` = salary in LPA (rough, simplified)

This is just to build intuition first.

In [None]:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
y = np.array([3.5, 4.2, 5.1, 6.0, 6.8, 7.6, 8.1, 9.0, 9.7, 10.4, 11.2], dtype=float)

df = pd.DataFrame({"years_exp": x, "salary_lpa": y})
df.head()

## 5) Visual intuition: “What are we trying to do?”

We want a line that is “as close as possible” to the points.

But close in what sense?

We need a definition of "best".

In [None]:
plt.scatter(x, y)
plt.xlabel("Years of Experience")
plt.ylabel("Salary (LPA)")
plt.title("Data: Experience vs Salary")
plt.grid(True)
plt.show()

## 6) Residuals: the most important concept

For any point:
- prediction = \(\hat{y}\)
- actual = \(y\)

Residual:
\[
r = y - \hat{y}
\]

A good model has residuals that are:
- small on average
- not systematically patterned (we'll study this later)

**Stop & Think (Interview):**  
If residuals show a curve pattern, what does it indicate?
- Hint: your model is too simple.

## 7) Why Mean Squared Error (MSE)?

Cost function:
\[
J(\theta_0,\theta_1) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
\]

Why square?
- Negative + positive errors shouldn’t cancel out.
- Big errors should be punished more.
- Squared function gives a smooth, convex surface (easy to optimize).

**Practical note:** MSE is not the only choice, but it is the standard baseline.

## 8) Try a random line and compute the MSE (manual intuition)

We’ll pick a slope and intercept and see how bad it is.

This is how you develop intuition: *try → measure → improve*.

In [None]:
def predict_line(x, theta0, theta1):
    return theta0 + theta1 * x

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

theta0, theta1 = 3.0, 0.8
y_hat = predict_line(x, theta0, theta1)
print("MSE:", mse(y, y_hat))

plt.scatter(x, y, label="Actual")
plt.plot(x, y_hat, label=f"Pred: theta0={theta0}, theta1={theta1}")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (LPA)")
plt.title("A random line + its fit")
plt.grid(True)
plt.legend()
plt.show()

## 9) Geometry: “Best fit” is minimizing total squared vertical distance

Important: in standard linear regression we minimize vertical distances (errors in y),
assuming x is measured without error (or error is negligible).

In some fields (physics), they also consider errors in x (total least squares),
but for business prediction, standard least squares is typical.

## 10) Statistics bridge (light, but meaningful)

Two key ideas:

### Variance  
How much a variable spreads out around its mean.

### Covariance  
How much two variables change together.
- Positive covariance → when x increases, y tends to increase
- Negative covariance → when x increases, y tends to decrease

Correlation is normalized covariance:
\[
\rho = \frac{cov(x,y)}{\sigma_x \sigma_y}
\]

Correlation helps you anticipate the sign of slope.

In [None]:
# Compute correlation for intuition
corr = np.corrcoef(x, y)[0, 1]
corr

## 11) Closed-form solution (preview)

In later notebooks we’ll derive it fully, but here’s the key fact:

There is a direct formula for the best parameters:
\[
\theta = (X^TX)^{-1}X^Ty
\]

This is called:
- Normal Equation
- Ordinary Least Squares (OLS) solution

We’ll implement it next, then later derive it properly.

In [None]:
# Build design matrix with bias term
X = np.c_[np.ones_like(x), x]  # shape (n, 2)
theta = np.linalg.inv(X.T @ X) @ (X.T @ y)

theta0_best, theta1_best = theta
theta0_best, theta1_best

In [None]:
y_hat_best = X @ theta

print("Best-fit MSE:", mse(y, y_hat_best))

plt.scatter(x, y, label="Actual")
plt.plot(x, y_hat_best, label="Best-fit line (OLS)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (LPA)")
plt.title("OLS Best Fit")
plt.grid(True)
plt.legend()
plt.show()

## 12) Interpret coefficients like a professional

If \(\theta_1 = 0.76\), you should say:

> “On average, an additional 1 year of experience increases salary by ~0.76 LPA, **assuming other factors remain constant**.”

That last part (“other factors remain constant”) becomes crucial in **multiple regression**.

## 13) Common misconceptions (Bootcamp warnings)

1. **High correlation does NOT mean causation**
   - Salary and experience correlate, but other variables exist.

2. **Linear regression does NOT guarantee linear reality**
   - It assumes a linear approximation.

3. **Good training fit is not enough**
   - You need generalization (test performance).

4. **Outliers can hijack the line**
   - One weird point can rotate your slope.

We’ll build diagnostics later to detect these.

## 14) Mini Assignment (Do it now)

1. Create a new dataset where salary increases faster after 6 years (non-linear).
2. Fit a linear line and plot residuals.
3. Write 3 lines explaining why linear regression struggles there.

**Deliverable:** One plot + one markdown explanation.