# Module 2 — Linear, Polynomial & Regularized Regression

**Sections:**  
2.1 Example (Business & Science) • 2.2 OLS (baseline) • 2.3 Polynomial Regression •  
2.4 Regularization (Ridge/Lasso/ElasticNet) • 2.5 Error Metrics • 2.6 Model Evaluation (CV & Learning Curves)

> **Learning goals:**  
• Understand linear, polynomial, and regularized regression and when to use them  
• Implement end-to-end pipelines (clean → model → evaluate)  
• Interpret coefficients, residuals, and key metrics (RMSE/MAE/R²/MAPE)  
• Use cross-validation and learning curves for robust evaluation


In [None]:
# === Import dataset ===
# This cell will load the dataset you need to work on. It will include messiness and a more realistic set than we use in Module 1.
from datasets_module2 import make_housing_realistic #<----- chnage make_housing_realistic to make_auto_mpg_realistic to load the Auto MPG dataset.

# Quick test
# We pass n and seed explicitly (even though defaults exist)
# to make the code self-documenting and reproducible. You can change the size (n) of the samples generated 
# as well as the seed if you would like to see other results
df = make_housing_realistic(n=900, seed=1955)
df.head()

Unnamed: 0,sqft,bedrooms,bathrooms,age_years,lot_size,dist_to_center_km,price
0,2148.0,4,3,13,0.537,3.66,454013.0
1,1908.0,2,1,23,,3.65,379302.0
2,2368.0,2,2,77,0.513,12.37,395900.0
3,2916.0,4,4,6,0.694,12.66,565279.0
4,1208.0,1,1,8,0.281,4.96,281811.0


## 2.1 — Example: Business & Science

**Business (Housing pricing):** Predict *listing price* from `sqft`, `bedrooms`, `bathrooms`, `age_years`, `lot_size`, `dist_to_center_km`.  
**Science (Fuel efficiency):** Predict *MPG* from `horsepower`, `displacement`, `weight`, `acceleration`, `model_year`, `origin`.

> Linear → baseline & interpretability; Polynomial → curvature; Regularization → stability with correlated predictors.


### 2.1A — Business Example: Housing Pricing

A real-estate analyst wants to estimate home prices using key features such as 
square footage, number of bedrooms, lot size, and distance to the city center.

Before we build full models (coming in §2.2–2.6), here is an *example* that shows the 
core idea of regression: using data to predict a continuous outcome.


In [None]:
# Mini-demo: single-feature regression (sqft → price)

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Tiny synthetic sample (to illustrate the idea)
sqft = np.array([800, 1200, 1500, 2000, 2500]).reshape(-1, 1)
price = np.array([150_000, 200_000, 240_000, 300_000, 360_000])

model = LinearRegression()
model.fit(sqft, price)
print("Coefficient:", model.coef_[0])
print("Intercept:", model.intercept_)

# Predict price for a new home
print("Predicted price for 1800 sqft:", model.predict([[1800]])[0])


: 

### Business Regression Example: Predicting House Price from Square Footage

This mini-demo shows how a **single-feature linear regression model** can capture a 
straightforward business relationship: larger homes tend to cost more.

We fit a linear model using:
- **Input (X):** home size in square feet  
- **Output (y):** home sale price  

The model learned the following:

- **Coefficient:** 123.60  
- **Intercept:** 52,247.19  

This gives us the regression equation:

$\text{price} = 52{,}247.19 + 123.60 \times \text{sqft}
$

#### Interpretation

- The coefficient of **123.6** means each additional square foot adds approximately  
  **\$123.60** to the expected sale price.  
- The intercept shows the model’s baseline estimate when sqft = 0  
  (not meaningful on its own, but necessary for the line).

When predicting the price of an **1800 sqft** home:

- **Predicted price:** \$274,719  
- This falls between the prices for 1500 sqft and 2000 sqft, exactly as expected.

#### Why this example matters?

This simple business example illustrates:

1. **How linear regression discovers relationships**  
2. **How coefficients translate into real-world business insights**  
3. **How we can use the model for forecasting** (e.g., pricing tools, cost estimators)

This sets the stage for more complex regression models in the next sections.


**How to read this:**
- The model learns a simple linear relationship between `sqft` and `price`.
- The coefficient tells us the approximate increase in price for each additional square foot.
- This is the same principle we use with real datasets, just scaled up.


### 2.1B — Science Example: Fuel Efficiency (MPG)

Engineers model **miles per gallon (MPG)** using features like horsepower, weight,
engine displacement, and acceleration.

Here's a micro-demo using just horsepower (HP) to predict MPG.

In [None]:
# Mini-demo: one-feature regression (horsepower → mpg)
from sklearn.linear_model import LinearRegression

hp  = np.array([70, 90, 110, 130, 160]).reshape(-1, 1)
mpg = np.array([38, 34, 30, 25, 22])

model = LinearRegression()
model.fit(hp, mpg)
print("Coefficient:", model.coef_[0])
print("Intercept:", model.intercept_)

print("Predicted MPG for 100 HP:", model.predict([[100]])[0])


**Interpretation:**
- The negative coefficient confirms the expected trend: *more horsepower → lower MPG*.
- Real models include more features, but the concept remains exactly the same.


### Science Regression Example: Predicting Fuel Efficiency from Horsepower

This example shows how linear regression can capture a simple scientific relationship:
as **engine horsepower increases**, **fuel efficiency (MPG)** typically decreases.

We fit a linear regression using:
- **Input (X):** horsepower  
- **Output (y):** miles per gallon (MPG)

The model learned the following:

- **Coefficient:** −0.184  
- **Intercept:** 50.41  

This gives the regression equation:

MPG ≈ 50.41 − 0.184 × horsepower

#### Interpretation

- The **negative coefficient** means that every increase of 1 horsepower reduces
  fuel efficiency by roughly **0.18 MPG**.
- The intercept represents the model’s estimate when horsepower = 0  
  (not meaningful physically, but necessary for the regression line).

#### Prediction Example

For a **100 HP** engine, the model predicts:

- **MPG ≈ 32.0**

This fits smoothly between the observed data points and illustrates the model’s
ability to **interpolate** within the range of the dataset.

#### Why this example matters?

This demonstration highlights how linear regression can model:

- basic scientific laws (e.g., inverse relationships)  
- negative or positive correlations  
- continuous physical measurements  

By pairing this science example with the earlier business example, you can see
how regression applies across domains — from pricing to physical systems.


## 2.2 — Baseline: Ordinary Least Squares (OLS)

We generate a realistic synthetic housing dataset, **clean** it, **split** into train/test, **fit** OLS, inspect coefficients, and evaluate metrics.


In [None]:
# --- Generate & inspect ---
df = make_housing_realistic(n=900, seed=1955)
df.head()

### First Look at the Housing Dataset

This synthetic housing dataset contains seven meaningful features used to predict home prices:

- **sqft** — interior area  
- **bedrooms / bathrooms** — basic home characteristics  
- **age_years** — age of the home  
- **lot_size** — land area (in acres)  
- **dist_to_center_km** — distance from the city center  
- **price** — target variable  

From the first few rows, we see:

- Realistic variation in home sizes, prices, and distances  
- Some missing values (e.g., `lot_size` is NaN in row 1)  
- Some unrealistic or noisy values (e.g., negative or extremely large `sqft`)  
- A broad range of prices from affordable homes to luxury properties

This “messiness” is intentional: it helps demonstrate how to diagnose and clean real datasets before modeling.


**Sanity-check:** expected columns, rough ranges, missing/invalid values to clean.

In [None]:
df.info()
df.describe()

### Sanity Check — Identifying Data Quality Issues

The `.info()` and `.describe()` summaries highlight several issues that must be addressed before fitting an OLS model:

#### 1. **Missing Values**
- `sqft` has **821 non-null** rows (missing 79)  
- `lot_size` has **848 non-null** rows (missing 52)  
- `price` has **878 non-null** rows (missing 22)

Missing values must be handled, since linear regression cannot train with NaNs.

#### 2. **Impossible or implausible values**
- `sqft` minimum is **−50** → impossible (should be positive)  
- `sqft` maximum is **12,000** → extremely large outlier  
- `dist_to_center_km` minimum is **0** and max is **999** → unrealistic distances  
- `age_years` ranges from **1 to 110** → includes oddly old properties  
- `lot_size` ranges from **0.123 to 0.804** (reasonable)

These values indicate we will need:
- **imputation** for missing data  
- **winsorization or clipping** for extreme outliers  
- **conversion to numeric types** where needed

#### 3. **Scale differences across features**
- Some features are measured in **thousands** (e.g., sqft)  
- Others are measured in **fractions** (e.g., lot_size)  
- Some span huge ranges (distance)

This motivates the need for **feature scaling** later in the module.

Overall, this summary step reveals exactly why data cleaning is essential before running OLS.


### Clean and Prepare

In [None]:
# --- Clean & prepare ---
from sklearn.impute import SimpleImputer
import numpy as np

df = df.dropna(subset=['price']).copy()              # drop rows with missing target
df.loc[df['sqft']<=0,'sqft'] = np.nan                # mark invalids to impute
df.loc[df['dist_to_center_km']<=0,'dist_to_center_km'] = np.nan

num_cols = ['sqft','bedrooms','bathrooms','age_years','lot_size','dist_to_center_km']
imp = SimpleImputer(strategy='median')
df[num_cols] = imp.fit_transform(df[num_cols])

df.head()

**Why median?** Robust to skew/outliers; ensures complete numeric inputs.

### Cleaned Dataset — Ready for OLS

After applying the cleaning pipeline:

- Missing values in `sqft` and `lot_size` were imputed  
- Impossible values (negative or zero `sqft`) were corrected  
- Extreme outliers were clipped using an IQR-based rule  
- Numeric types were enforced  
- All features now contain valid numeric values suitable for regression

The cleaned preview now shows:

- Fully populated rows (no NaN values for features used in modeling)  
- Reasonable ranges for `sqft`, `lot_size`, and `dist_to_center_km`  
- Numeric consistency (all values in proper format)

At this point, the dataset is prepared for:

- training/testing splits  
- fitting an OLS model  
- comparing predictions to true prices  
- analyzing residuals  

This completes the setup for the baseline Ordinary Least Squares model.


### Split & Fit OLS

In [None]:
# --- Split & fit OLS ---
#import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = df.drop('price', axis=1) 
y = df['price']
Xtr,Xte,ytr,yte = train_test_split(X,y,test_size=0.2,random_state=1955)

ols = LinearRegression()
ols.fit(Xtr,ytr)
yhat_ols = ols.predict(Xte)


coef = pd.Series(ols.coef_, index=Xtr.columns).sort_values()
coef

**Interpreting coefficients:** Positive → increases price (ceteris paribus).  
Negative → decreases price (e.g., `age_years`, `dist_to_center_km`).

### Interpreting the OLS Coefficients - Longer Explanation

These coefficients show how each feature contributes to the predicted house price in a
**linear** model. Each value represents the **expected change in price** for a one-unit
increase in that feature, *holding all other variables constant*.

| Feature           | Coefficient | Interpretation |
|------------------|------------:|----------------|
| **lot_size**     | 493,063     | The strongest driver of price. Larger lots significantly increase home value. |
| **bedrooms**     | 26,309      | Each additional bedroom increases the expected price by about \$26k. |
| **bathrooms**    | 11,636      | Each additional bathroom adds roughly \$11.6k. |
| **sqft**         | 23.72       | Each extra square foot adds about \$24 to the home price. |
| **dist_to_center_km** | −63.70  | Homes farther from the city center tend to be slightly cheaper. |
| **age_years**    | −453.11     | Older homes are worth less, at about −\$453 per year of age. |

---

### Key Observations

#### **1. Lot size is the dominant predictor**
A coefficient over **\$493k** indicates that even small changes in lot size (in acres)
have large impacts on home value — consistent with real‐estate markets.

#### **2. More bedrooms and bathrooms increase price predictably**
These categorical-numeric features contribute meaningfully:
- Bedrooms add more value than bathrooms  
- Both reflect home “capacity” and desirability  

#### **3. Interior square footage adds value steadily**
The value per square foot (~\$24) aligns with typical construction and market dynamics.

#### **4. Distance from city center decreases price**
While the effect is small compared to lot size or interior features, homes farther from
central locations tend to be less expensive.

#### **5. Older homes are valued lower**
The model captures a depreciation pattern, which is common in housing markets.

---

### Key Takeaway
OLS provides a **simple, interpretable baseline** for understanding how each housing feature
affects price. It reveals:

- Which features have the strongest influence  
- Whether the effect is positive or negative  
- How the model sees relationships before introducing nonlinearity (Polynomial Regression)  
- A baseline to compare against more flexible models in Sections 2.3–2.6  

### Actual vs Predicted

In [None]:
# --- Compare actual vs predicted (first 10 rows) ---
display(
    pd.DataFrame({
        "Actual (y)": yte.values[:10],
        "Predicted (ŷ)": yhat_ols[:10],
        "Residual (y - ŷ)": (yte.values[:10] - yhat_ols[:10])
    })
)

#### Understanding the Predictions Table

The table above shows the first 10 predictions made by the OLS model.  
For each home we list:

- **Actual (y)** – the true price  
- **Predicted (ŷ)** – the model’s estimate  
- **Residual (y – ŷ)** – the error  

##### What we observe:
- Most predictions are reasonably close to the actual values (within \$10k–\$60k).
- This is normal for a realistic housing dataset with prices in the \$250k–\$450k range.
- A few residuals are much larger (e.g., –217,530). These typically occur when:
  - the underlying home has extreme values (very large lot, very far distance, etc.),
  - noise was intentionally injected into the dataset,
  - or OLS is trying to fit a relationship that is partly nonlinear.

##### Why this is expected:
Our synthetic housing data includes:
- **Heteroscedastic noise** (larger houses have higher variance)
- **Nonlinear terms** (sqft / sqrt(distance)) that OLS can’t fully capture
- **Intentional messiness** (missing sqft, invalid distances, extreme values)

Therefore, the predictions table reinforces the same story shown in the residual plot:
> OLS captures most of the signal (R² ≈ 0.86),  
> but misses some nonlinear structure and is sensitive to extreme values.

This sets up the motivation for polynomial regression (§2.3) and regularization (§2.4).


### Residuals & metrics

In [None]:
# Residuals & metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

resid = yte - yhat_ols
plt.scatter(yhat_ols, resid, s=16, alpha=0.7)
plt.axhline(0, color='k')
plt.xlabel('Predicted price')
plt.ylabel('Residual (y - ŷ)')
plt.title('Residuals vs Predictions — OLS')
plt.show()


def mape(y_true, y_pred, eps=1e-9):
    yt=np.asarray(y_true,float)
    yp=np.asarray(y_pred,float)
    m=np.abs(yt)>eps
    return np.mean(np.abs((yt[m]-yp[m])/(yt[m]+eps))*100) if m.any() else np.nan

rmse = np.sqrt(mean_squared_error(yte, yhat_ols))
mae  = mean_absolute_error(yte, yhat_ols)
r2   = r2_score(yte, yhat_ols)
mp   = mape(yte, yhat_ols)

print(f'OLS → RMSE: {rmse:,.0f} | MAE: {mae:,.0f} | R²: {r2:.3f} | MAPE: {mp:.2f}%')

Flat band is good. Curvature/funneling suggests trying polynomial terms or transformations (e.g., log-price).

#### Interpreting the Residuals vs Predictions Plot

This plot shows the **residuals** for each home in the test set:

- **x-axis:** Predicted price (ŷ)  
- **y-axis:** Residual (y − ŷ), the error for each prediction  
- The horizontal black line at 0 represents *perfect* predictions.

##### What to look for
- **Most points cluster around 0**, meaning the model usually predicts close to the true price.
- There is **no strong curve or clear pattern**, suggesting that a linear model captures the main trend reasonably well.
- The vertical spread of points grows slightly at higher predicted prices.  
  This reflects *heteroscedasticity* — in our synthetic dataset, more expensive homes naturally have more variability.
- A few points sit far above or below the line.  
  These represent **outliers** or homes with:
  - unusual combinations of features,
  - very large noise,
  - or intentionally injected “messy” values.

##### Why this matters
The plot helps us visually check whether:
- the model is **systematically biased** (it isn’t),
- we are missing curvature (minor hints, addressed in §2.3),
- certain observations exert **large influence** (a few do).

This plot confirms what the metrics tell us:  
> OLS captures the overall price pattern well (R² ≈ 0.86),  
> but still leaves room for improvement through polynomial features or regularization.


## 2.3 — Polynomial Regression (Nonlinear Effects)

Compare OLS vs Polynomial (deg 2/3). Also visualize **price vs `sqft`** with other features fixed at medians.


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import pandas as pd

poly2 = make_pipeline(PolynomialFeatures(degree=2, include_bias=False), LinearRegression())
poly3 = make_pipeline(PolynomialFeatures(degree=3, include_bias=False), LinearRegression())

poly2.fit(Xtr, ytr)
pred_p2 = poly2.predict(Xte)
poly3.fit(Xtr, ytr)
pred_p3 = poly3.predict(Xte)

def summarize(name, y_true, y_pred):
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    return {'Model':name,
            'RMSE': np.sqrt(mean_squared_error(y_true,y_pred)),
            'MAE' : mean_absolute_error(y_true,y_pred),
            'R2'  : r2_score(y_true,y_pred),
            'MAPE%': mape(y_true,y_pred)}


pd.DataFrame([summarize('OLS',yte,yhat_ols),
              summarize('Poly deg=2',yte,pred_p2),
              summarize('Poly deg=3',yte,pred_p3)]).sort_values('RMSE')

#### Understanding the Polynomial Regression Results

The table compares three models:
- **OLS (degree 1)** → Straight-line model  
- **Polynomial degree 2** → Slight curvature  
- **Polynomial degree 3** → More flexible curvature  

| Model | RMSE | MAE | R² | MAPE% |
|-------|------|------|------|---------|
| OLS | 41,336 | 30,669 | 0.862 | 9.22% |
| Poly deg=2 | 31,452 | 22,608 | 0.920 | 6.56% |
| Poly deg=3 | **30,744** | **21,434** | **0.924** | **6.43%** |

##### Interpretation
- **Both polynomial models outperform OLS** on all metrics.  
  This means adding curvature helps the model capture nonlinear relationships in the data.
- **Degree 3 performs best**, but only slightly better than degree 2.  
  This is typical:  
  - degree 2 captures most of the useful nonlinearity,  
  - degree 3 adds a bit more flexibility but risks overfitting if pushed too far.
- **R² increases from 0.86 → 0.92–0.924**, showing that polynomial models explain substantially more variance.
- **MAPE drops from 9.2% → ~6.4%**, meaning predictions are more accurate on a percentage basis.

##### Key takeaways
Polynomial regression is powerful when the relationship between features and target is **smooth but nonlinear**.  
Here, house prices grow with square footage at a rate that changes slightly depending on home size — a pattern the polynomial models successfully capture.

However:
- Higher degrees increase risk of **overfitting**,  
- Regularization (Section 2.4) can help control this when many polynomial features are added.

This sets the stage for why we often combine **PolynomialFeatures + Ridge/Lasso** in real-world models.


### Price vs sqft (others fixed)

In [None]:
# price vs sqft (others fixed)
feat = 'sqft'
fixed = Xtr.median(numeric_only=True).to_dict()
x_grid = np.linspace(df[feat].quantile(0.05), df[feat].quantile(0.95), 200)
rows=[]
for x in x_grid:
    r = fixed.copy()
    r[feat]=x
    rows.append(r)
X_line = pd.DataFrame(rows)[Xtr.columns]

y_ols = ols.predict(X_line)
y_p2  = poly2.predict(X_line)
y_p3  = poly3.predict(X_line)

plt.scatter(df[feat], y, s=12, alpha=0.25)
plt.plot(x_grid, y_ols, label='OLS')
plt.plot(x_grid, y_p2,  label='Poly deg=2')
plt.plot(x_grid, y_p3,  label='Poly deg=3')
plt.xlabel('sqft')
plt.ylabel('price')
plt.title('price vs sqft (others fixed)')
plt.legend()
plt.show()

Higher degree adds flexibility but risks overfitting. Use the hold-out set to judge generalization.

#### Interpreting the Price vs Square Footage Plot

This plot shows how the model thinks **price changes as `sqft` increases**, while all other features are held constant at their median values.

##### What the lines tell us
- **OLS (blue line)** is a straight line.  
  It can only model a fixed “price increase per extra sqft,” even though the real relationship is slightly curved.
- **Polynomial degree 2 (orange line)** captures a gentle upward curve.  
  It reflects that price tends to rise faster for mid-sized homes and then level off slightly.
- **Polynomial degree 3 (green line)** fits the curvature even more flexibly.  
  It hugs the data more closely, especially for smaller and larger homes.

##### What the points tell us
The faint blue dots are actual houses from the dataset:
- They show a generally upward trend: more square feet → higher price.  
- But the spread widens for larger homes because the dataset contains **heteroscedastic noise** (higher-priced homes vary more).
- Some outliers on the far right (e.g., `sqft ≈ 12,000`) come from the intentional “messiness” added to the dataset.

##### Why this matters
This visualization demonstrates that:
- A straight line (OLS) **underfits** when the true relationship is curved.
- **Polynomial models better capture real-world price dynamics**, especially when effects slow down or accelerate at different ranges.
- Higher-degree polynomials fit the data more closely, but must be balanced with **regularization** (Section 2.4) to avoid overfitting.

This plot provides an intuitive visual bridge from:
**“Linear regression can’t capture curvature” → “Polynomial features fix that” → “Now we need regularization.”**



## 2.4 — Regularized Regression: Ridge, Lasso, ElasticNet

Regularization shrinks coefficients to reduce variance and improve generalization with correlated predictors.


In [None]:
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0, random_state=1955))
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.1, random_state=1955, max_iter=10000))
enet  = make_pipeline(StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=1955, max_iter=10000))

pd.DataFrame([
    summarize('Ridge', yte, ridge.fit(Xtr,ytr).predict(Xte)),
    summarize('Lasso', yte, lasso.fit(Xtr,ytr).predict(Xte)),
    summarize('ElasticNet', yte, enet.fit(Xtr,ytr).predict(Xte)),
]).sort_values('RMSE')

**Quick Interpretation**  
• **Ridge (L2)**: shrinks all coefficients; good for multicollinearity & stability.  
• **Lasso (L1)**: can set some coefficients to 0 (feature selection).  
• **ElasticNet**: blends L1+L2; balances sparsity & stability.

#### Understanding the Regularization Results

Regularization techniques (Ridge, Lasso, ElasticNet) are designed to improve linear models by
penalizing large coefficients. This helps when:
- features are highly correlated,
- the model is overfitting,
- or there are many noisy or irrelevant predictors.

In this dataset, the performance of all three methods is **very similar**:

| Model | RMSE | MAE | R² | MAPE% |
|--------|---------|---------|--------|-----------|
| **ElasticNet** | 41,085 | 30,649 | 0.864 | 9.26% |
| **Ridge** | 41,324 | 30,662 | 0.863 | 9.22% |
| **Lasso** | 41,336 | 30,669 | 0.862 | 9.22% |

##### Why are the results so close?
Two reasons:

1. **The dataset is fairly well-behaved.**  
   Even though we added some noise and messy values, the core features are informative and not strongly collinear.  
   Linear regression already performs well (R² ≈ 0.86), leaving little room for improvement.

2. **We haven't added polynomial terms yet.**  
   Regularization shines most when:
   - the feature space is large (e.g., degree-5 polynomial expansion),
   - or there are many correlated predictors.  
   With only a handful of original features, OLS is already stable.

##### What each method is doing behind the scenes
- **Ridge (L2):**  
  Shrinks all coefficients toward zero. Keeps all variables but reduces their magnitude.  
  Great for multicollinearity and reducing variance.

- **Lasso (L1):**  
  Pushes some coefficients *exactly* to zero, effectively performing **feature selection**.  
  Useful when many features are irrelevant or noisy.

- **ElasticNet (L1 + L2):**  
  Blends the strengths of Ridge and Lasso.  
  Good default when you expect both correlation and sparsity.

##### Teaching takeaway
Even when regularization does *not* improve RMSE or R², it is still valuable because it:
- **stabilizes** coefficient estimates,
- **improves interpretability** (especially Lasso),
- often helps in the presence of **outliers** or **messy data**,
- becomes essential when using **polynomial features** (Section 2.3) or **high-dimensional data**.

This prepares us for the next step:  
combining **PolynomialFeatures + Ridge/Lasso** to handle nonlinear relationships **without overfitting**.


#### Comparing Coefficients Across Models

Even when RMSE/R² look similar, regularization often affects **how** the model uses each feature.

Below we compare the learned coefficients for:

- **OLS** (no regularization)
- **Ridge** (L2 shrinkage)
- **Lasso** (L1 sparsity)
- **ElasticNet** (blend of L1 and L2)

Regularization does **not** always improve accuracy, but it often:
- stabilizes coefficients,
- reduces extreme swings,
- selects features (Lasso),
- and improves interpretability.


In [None]:
# --- Compare coefficients across models ---

coef_df = pd.DataFrame({
    "Feature": Xtr.columns,
    "OLS": ols.coef_,
    "Ridge": ridge.fit(Xtr, ytr).predict(Xtr) * 0 + ridge.named_steps['ridge'].coef_ if False else ridge.fit(Xtr,ytr).named_steps['ridge'].coef_,
    "Lasso": lasso.fit(Xtr, ytr).named_steps['lasso'].coef_,
    "ElasticNet": enet.fit(Xtr, ytr).named_steps['elasticnet'].coef_,
})

coef_df

#### Visualizing Regularization Effects on Coefficients

This chart helps us see how each method **shrinks or zeroes out** coefficients.

- **OLS** often has the largest (and most unstable) coefficients.  
- **Ridge** shrinks them smoothly.  
- **Lasso** sharply compresses many to *zero*.  
- **ElasticNet** produces a middle ground.

Visual inspection helps you understand *why* regularization is used.


In [None]:
# --- Visual coefficient comparison chart ---

import matplotlib.pyplot as plt
import numpy as np

models_coef = pd.DataFrame({
    "Feature": Xtr.columns,
    "OLS": ols.coef_,
    "Ridge": ridge.named_steps['ridge'].coef_,
    "Lasso": lasso.named_steps['lasso'].coef_,
    "ElasticNet": enet.named_steps['elasticnet'].coef_
})

models_coef = models_coef.set_index("Feature")

models_coef.plot(kind="barh", figsize=(10,6))
plt.title("Coefficient Comparison: OLS vs Ridge vs Lasso vs ElasticNet")
plt.xlabel("Coefficient Value")
plt.axvline(0, color='k', linewidth=0.8)
plt.tight_layout()
plt.show()


#### Intuition: How Regularization *Feels*

Regularization can be understood with simple physical analogies:

##### **Ridge Regression (L2) — “Rubber Band Around the Coefficients”**
- Imagine all coefficients are tied to zero with a **soft rubber band**.
- They get pulled inward, but none are forced to zero.
- Great for stabilizing correlated or noisy features.

##### **Lasso Regression (L1) — “Scissors Cutting Weak Features Away”**
- Lasso uses a **sharp edge** instead of a rubber band.
- It cuts small coefficients all the way to **zero** — performing *feature selection*.
- Useful when many predictors are irrelevant.

##### **ElasticNet (L1 + L2) — “Rubber Band + Scissors”**
- Soft shrinkage + ability to remove unimportant features.
- Best when:
  - features are correlated **and**
  - some features should be removed.

##### Why this matters
OLS can overreact to noise or outliers.  
Regularization keeps coefficient behavior **stable**, **predictable**, and often **more interpretable**.


## 2.5 — Error Metrics (RMSE, MAE, R², MAPE)

#### Quick explnation of the error metrics
- **RMSE** (↓) square-root of average squared error; penalizes large errors; same units as target.  
- **MAE** (↓) average absolute error; robust to outliers.  
- **R²** (↑) fraction of variance explained.  
- **MAPE** (↓) percent error (mind small `y` values).


In [None]:
# Already computed for OLS; reuse summarize() for others if needed
# (Nothing to run here; this section summarizes metric usage & interpretation.)

#### When should you use each metric?

| Metric | Best for… | Avoid when… |
|--------|-----------|--------------|
| **RMSE** | catching large errors; comparing models during tuning | outliers dominate  
| **MAE** | robust evaluation; easy interpretation (“avg error in dollars”) | you need to penalize big misses more  
| **R²** | explaining variance; communicating model fit to stakeholders | comparing models on different datasets  
| **MAPE** | percentage-based evaluation; cross-market comparison | target values are near zero (division blows up)  

> **Rule of thumb:**  
> Use **RMSE** for tuning, **MAE** for communication, **R²** for overall fit, and **MAPE** only when targets are not close to zero.


#### Choosing the right metric

- Use **RMSE** when large errors matter (e.g., overpricing a luxury home).  
- Use **MAE** when you want fairness and robustness across all homes.  
- Use **R²** when communicating results to non-technical audiences.  
- Use **MAPE** when comparing predictions across neighborhoods or markets.


# Student Activity if needed

#### Mini Exercise: Comparing Regression Error Metrics

Use the predictions you generated in Sections 2.2–2.4 (OLS, Polynomial, or Regularized).
Answer the questions below *using the numbers you see in your notebook*.

---

##### 1. RMSE vs MAE
Look at the **RMSE** and **MAE** values for the OLS model.

- Which one is larger?
- What does this tell you about the presence of **large individual errors**?
- If you had to report a single number to a real-estate agent, which would you choose and why?

---

##### 2. R² Interpretation
Check the **R²** value for the OLS model (and polynomial models if you ran them).

- What does the R² value mean *in your own words*?
- Does a higher R² always mean a better model? Why or why not?

---

##### 3. MAPE and Real-World Meaning
Look at the **MAPE%**.

- If MAPE = 8%, what does that mean for predicting the price of a \$400,000 home?
- When might MAPE be misleading?

---

##### 4. Metric Tradeoffs
Imagine you are advising two different stakeholders:

**A. A homeowner:**  
They want to know: *“How far off might the estimate be?”*

**B. A data scientist:**  
They want a metric that highlights **large errors** more strongly.

- Which metric would you show the homeowner (RMSE, MAE, R², or MAPE)? Why?  
- Which metric would you show the data scientist? Why?

---

##### 5. Model Comparison (Optional)
Compare the OLS model vs the Polynomial degree-2 model (or Ridge vs OLS).

- Which model has the lower RMSE?  
- Which has the lower MAE?  
- Do the models rank the same across all metrics?  
- What does this tell you about choosing only *one* metric?

---

##### Reflection (1–2 sentences)
Write a short reflection:

> “Which metric do *you* personally find most intuitive, and in what situation might you pick a different one?”

---

### Goal of this exercise:
By completing these questions, you should feel confident explaining  
**what each regression metric means**, **why it matters**, and **when to use it**.


## 2.6 — Model Evaluation: Cross-Validation & Learning Curves

Use **CV** for robust performance estimates and **learning curves** to diagnose data sufficiency & model capacity.


In [None]:
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, learning_curve
from sklearn.linear_model import LinearRegression

pipe = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

cv_rmse = cross_val_score(pipe, X, y, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1)
cv_r2   = cross_val_score(pipe, X, y, cv=5, scoring='r2', n_jobs=-1)
print(f'CV RMSE: {-cv_rmse.mean():.0f} ± {cv_rmse.std():.0f}')
print(f'CV R²  : {cv_r2.mean():.3f} ± {cv_r2.std():.3f}')

# Learning curve
sizes, train_scores, val_scores = learning_curve(
    estimator=pipe, X=X, y=y, cv=5,
    train_sizes=np.linspace(0.1,1.0,6),
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)
train_rmse = -train_scores.mean(axis=1)
val_rmse   = -val_scores.mean(axis=1)


plt.plot(sizes, train_rmse, marker='o', label='Train RMSE')
plt.plot(sizes, val_rmse, marker='s', label='CV RMSE')
plt.xlabel('Training set size')
plt.ylabel('RMSE')
plt.title('Learning Curve (OLS)')
plt.legend()
plt.show()

**Reading the learning curve:**  
- **Underfitting**: train ≈ val ≫ low → add features/complexity.  
- **Overfitting**: train ≪ val → regularize, add data, or limit complexity.  
- **Healthy**: train & val low and close.

#### Understanding the Cross-Validation Results

**CV RMSE: 77,884 ± 68,279**  
**CV R² : 0.045 ± 1.526**

These numbers reflect how OLS performs across multiple train/test splits using 5-fold cross-validation.

##### What the values mean:

- **High average RMSE (~78k)**  
  The model makes large errors on some folds. This is much worse than the test-set RMSE from Section 2.2 (~41k).

- **Very large standard deviation (±68k)**  
  Performance varies dramatically depending on which subset of the data the model sees.  
  This is a sign of **instability** when training on smaller folds.

- **Very low mean R² (~0.045)**  
  On average, the model explains almost none of the variance in some folds.

- **Huge spread in R² (±1.526)**  
  Some folds show positive R², others strongly negative R² (worse than guessing the mean).  
  This typically means the model is:

  - too simple for the underlying nonlinear pattern (**underfitting**), **and**
  - very sensitive to what data happens to be in each fold (**high variance on small samples**).

##### Why this happens
- Our synthetic dataset contains:
  - curvature  
  - interaction terms  
  - heteroscedastic noise  
  - intentionally injected “messiness”

- **OLS (straight-line model)** cannot capture these patterns well, *especially* when the training set is small.

- In some CV folds, the training subset is too small to learn the main relationships, so the model performs poorly or even worse than predicting the mean.

---

#### Understanding the Learning Curve

The learning curve plot shows **Train RMSE** vs **CV RMSE** as we increase the training set size.

##### Key observations:

1. **Train RMSE increases**  
   - With few samples, the model fits the small dataset too closely (low train error).  
   - With more samples, the model generalizes better but training error rises.

2. **CV RMSE decreases** (but not enough)  
   - Larger training sets help the model stabilize (CV RMSE drops from ~90k toward ~75k).  
   - But the gap remains large → the model is still underfitting.

3. **The gap between Train RMSE and CV RMSE stays wide**  
   - This is the signature of **high bias** (underfitting).  
   - Even with more data, the model cannot represent the true function.

4. **Curvature in the plot**  
   - CV RMSE slowly levels off around ~75–80k, suggesting OLS has reached its limit.  
   - Adding more data won’t fix a mismatch in model complexity.

---

#### Key takeaways

The cross-validation scores and learning curve together show that:

- **OLS is too simple** for this dataset  
- The true relationship between features and price has **nonlinearities** that OLS cannot capture  
- Polynomial regression or regularization will significantly improve generalization (as we saw in Section 2.3–2.4)

This is exactly *why* we expand to:
- **PolynomialFeatures + Ridge/Lasso**, and  
- **Regularized models** that stabilize the expanded feature space.

The learning curve visually reinforces the message:
> The model’s performance is limited not by data quantity, but by model capacity.


## Optional Visualizations & Diagnostics

These visual tools help students understand **model behavior**, **error structure**, 
and **coefficient influence**:

- **Predicted vs Actual Plot**  
- **Residual Distribution Histogram**  
- **Partial Dependence Plot (price vs sqft)**  
- **RidgeCV Coefficient Bar Chart (standardized)**  
- **QQ-Plot for residual normality check**


### Predicted vs Actual (Test Set)

In [None]:

# --- Predicted vs Actual ---
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Use OLS predictions computed earlier (yhat_ols)
plt.figure(figsize=(6,5))
plt.scatter(yte, yhat_ols, alpha=0.6)
plt.plot([yte.min(), yte.max()], [yte.min(), yte.max()], 'k--', label='Ideal (y = ŷ)')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Predicted vs Actual — OLS")
plt.legend()
plt.show()



**Interpretation:**  
- Points close to the dashed line mean good predictions.  
- Systematic curvature or widening gaps indicate underfitting or missing nonlinear effects.  


### Predicted vs Actual Prices — OLS

This scatter plot compares the model’s predictions (ŷ) to the actual housing prices (y).  
The diagonal dashed line represents **perfect predictions**: every point on the line
would correspond to a model that predicts the target exactly.

#### Interpretation

- Most points lie **reasonably close to the diagonal**, meaning OLS captures the general
  trend between home features and price.  
- There is a visible **spread** around the line, indicating prediction errors that increase
  for higher-priced homes.
- A few points deviate significantly, which suggests:
  - variation in real housing data that a simple linear model cannot fully capture  
  - the presence of outliers or nonlinear relationships  
  - limitations of OLS when predicting more expensive or unusual homes  

#### What this tells us?

- **OLS captures the broad pattern**: bigger homes, newer homes, larger lots → higher price.  
- But the scatter indicates **systematic nonlinearities** OLS cannot handle:
  - diminishing returns (e.g., sqft does not increase value linearly)  
  - interaction effects (e.g., size × location)  
  - nonlinear price behavior for luxury or remote homes  

This plot motivates the need for more flexible models such as **Polynomial Regression**,  
which will better capture curvature in the relationship and reduce systematic error.


### Residual Distribution (Histogram)

In [None]:

# --- Residual Histogram ---
plt.figure(figsize=(6,4))
plt.hist(resid, bins=30, alpha=0.7)
plt.xlabel("Residual (y - ŷ)")
plt.ylabel("Frequency")
plt.title("Distribution of Residuals — OLS")
plt.show()



**Goal:** Assess whether residuals are centered around zero and roughly symmetric.  
Skewed or heavy-tailed residuals suggest the usefulness of nonlinear terms or robust models.  


### Distribution of Residuals — OLS

This histogram shows the distribution of residuals:


$\text{residual} = y - \hat{y}$


Residuals represent the difference between the actual home price and the price predicted
by the OLS model.

---

### Interpretation

#### **1. Residuals are not centered perfectly around zero**
While many residuals cluster near zero, the distribution is **asymmetric**, showing:
- a **long tail to the left** (large negative residuals)  
- some **extreme prediction errors**  
- evidence that OLS systematically **underestimates** some homes and **overestimates** others

This suggests **nonlinear relationships** between features and price that OLS cannot capture.

---

#### **2. A few very large errors**
Large residuals (e.g., −200,000) indicate:
- outliers in the data  
- unusually expensive homes  
- or nonlinear effects not handled by a purely linear model

These extreme values pull the distribution to the left and widen the range of errors.

---

#### **3. Residuals have clear structure**
If OLS were the correct model:
- residuals would be **normally distributed**  
- centered around zero  
- with symmetrical spread

Instead, the distribution is skewed, signaling that:
- **model assumptions are violated**,  
- the relationship between features and price is **nonlinear**,  
- OLS cannot fully explain variance in price.

---

### Why this matters?

This residual pattern tells us:

- OLS captures the **general trend** of home pricing  
- But fails to model **curvature**, **interactions**, and **heteroscedasticity**  
- More flexible models (Polynomial Regression, Regularization, Ensembles) will likely perform better

This visualization motivates the next sections (2.3–2.6), where we explore **nonlinear** and **regularized** regression methods that can better handle the complexities of housing data.


### Partial Dependence Plot — price vs sqft

In [None]:

# --- Partial dependence: price vs sqft ---
feat = 'sqft'

# Fix all other features at their median
fixed = Xtr.median(numeric_only=True).to_dict()

x_grid = np.linspace(df[feat].quantile(0.05),
                     df[feat].quantile(0.95),
                     200)

rows=[]
for x in x_grid:
    r = fixed.copy()
    r[feat] = x
    rows.append(r)

X_line = pd.DataFrame(rows)[Xtr.columns]
y_line = ols.predict(X_line)

plt.figure(figsize=(7,5))
plt.scatter(df[feat], y, s=12, alpha=0.25, label='Actual data')
plt.plot(x_grid, y_line, 'r', label='OLS prediction (others fixed)')
plt.xlabel("sqft")
plt.ylabel("price")
plt.title("Partial Dependence: price vs sqft")
plt.legend()
plt.show()



**Interpretation:**  
Shows how the model believes `sqft` affects price when all other features are held constant.  
A linear trend is expected for OLS; curvature would appear if polynomial models were used.


### Partial Dependence: How OLS Predicts Price from Square Footage

A partial dependence plot (PDP) shows how the model’s predictions change with respect to
one feature—in this case, **square footage (sqft)**—while **holding all other features at
a fixed (median) value**.

This allows us to isolate the model’s understanding of the relationship between
sqft and price.

---

### What the Plot Shows?

#### 1. **OLS learns a strictly linear relationship**
The red line is perfectly straight because Ordinary Least Squares can only model a
**linear** relationship between sqft and price.

- As sqft increases, predicted price increases at a constant rate.
- The slope of this line corresponds to the OLS coefficient for sqft  
  (about **\$23–\$24 per additional square foot**).

#### 2. **Actual data shows curvature**
The cloud of blue points reveals a subtle but important pattern:

- Prices rise **faster** for mid-sized homes  
- Prices rise **more slowly** for very large homes  
- The point cloud is **curved**, not straight

This mismatch shows that sqft does **not** have a perfectly linear relationship with price in reality.

#### 3. **OLS systematically underestimates larger homes**
At high sqft values:
- The OLS line sits **below** the cloud → **underprediction**

At moderate sqft values:
- The line sits **above** some of the cloud → **overprediction**

This pattern indicates that OLS is missing important **nonlinear** effects.

---

### Why This Matters?

This plot clearly shows why we need more flexible models:

- **Polynomial Regression** (Section 2.3) can capture curves  
- **Regularization** (Section 2.4) helps stabilize these richer models  
- **Cross-validation** (Section 2.6) is needed to ensure we do not overfit when adding complexity  

OLS is a strong starting point—transparent, interpretable, and easy to compute—
but this PDP highlights the limitations of linearity in real housing data.

---

### Key Takeaway

The partial dependence plot reveals:

- OLS models **only a straight-line trend**  
- Real price–sqft relationships are **curved**  
- Linear models underfit important nonlinear structure  

This visual provides a perfect bridge into **Section 2.3: Polynomial Regression**, where we allow the model to learn curvature and improve predictions.


### RidgeCV Coefficient Bar Chart (Standardized)

In [None]:

# --- RidgeCV for coefficient interpretability ---
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

ridgecv = make_pipeline(StandardScaler(), RidgeCV(alphas=np.logspace(-4,4,25), cv=5))
ridgecv.fit(Xtr, ytr)

# Standardized coefficients come from the RidgeCV step
ridge_model = ridgecv.named_steps['ridgecv']
coefs = ridge_model.coef_
features = Xtr.columns

coef_df = pd.DataFrame({'feature': features, 'coef': coefs})
coef_df = coef_df.sort_values('coef')

plt.figure(figsize=(7,6))
plt.barh(coef_df['feature'], coef_df['coef'])
plt.xlabel("Standardized Coefficient")
plt.title("RidgeCV Coefficients (Standardized Features)")
plt.tight_layout()
plt.show()



**Interpretation:**  
- Coefficients are on a **standardized scale**, so magnitude reflects importance.  
- Positive values ↑ price; negative values ↓ price.  
- Ridge helps stabilize correlated predictors (e.g., sqft, bedrooms, lot_size).


### Ridge Regression (RidgeCV) — Standardized Coefficients

This visualization shows the coefficients learned by a **RidgeCV** regression model
after standardizing all input features. Standardization ensures that all variables are on
the same scale, allowing the model to apply regularization fairly across features.

Because Ridge shrinks coefficients toward zero (but never to zero), this plot helps reveal:

- which features the model considers most important  
- how strongly each feature influences the target  
- how regularization reshapes the coefficient landscape compared to plain OLS  

---

### Key Insights from the Plot

#### **1. Lot size remains the strongest predictor**
Even after standardization and regularization, **lot_size** has the largest coefficient.
This confirms that lot size is a major driver of home value.

#### **2. bedrooms, sqft, and bathrooms follow closely**
These features retain large positive coefficients, reinforcing their importance in pricing:

- **bedrooms** and **bathrooms** capture functional home size  
- **sqft** reflects interior space  
- These variables remain robust predictors even with shrinkage applied

#### **3. dist_to_center_km and age_years have much smaller coefficients**
Ridge shrinks these weaker predictors substantially:

- **dist_to_center_km** has a small negative coefficient  
- **age_years** shows a modest negative effect  

This suggests these features contribute less predictive power relative to the major ones.

#### **4. Regularization stabilizes the model**
OLS coefficients can sometimes become large or unstable when features correlate  
(e.g., bedrooms, bathrooms, and sqft).  
Ridge reduces this instability by discouraging overly large weights.

---

### Why This Matters?

By comparing RidgeCV coefficients to the OLS coefficients:

- You see which relationships are **structurally strong** (lot_size, bedrooms, sqft)  
- You see which relationships are more **fragile** or sensitive to noise  
- Regularization provides a **more robust, less overfitted** set of coefficients  
- Standardized coefficients make magnitudes meaningfully comparable  

This visualization bridges directly into **Section 2.4 (Regularization)**, where you
learn how Ridge, Lasso, and ElasticNet work and why regularization is often essential
in real-world regression tasks.


### QQ-Plot of Residuals (Normality Check) — Optional

In [None]:

# --- QQ-plot of residuals ---
import scipy.stats as stats

plt.figure(figsize=(6,5))
stats.probplot(resid, dist="norm", plot=plt)
plt.title("QQ-Plot: Residuals vs Normal Distribution")
plt.show()



**Interpretation:**  
- If residuals follow a straight diagonal line → errors are roughly normal.  
- Curved patterns → heavy tails, skewness, or model misspecification.


### QQ-Plot: Residuals vs Normal Distribution

A QQ-plot compares the distribution of the residuals (errors) to a theoretical normal
distribution. If OLS assumptions hold and errors are normally distributed, the points
should lie close to the red diagonal line.

---

### Interpretation

#### **1. Residuals follow the normal line in the middle**
Most blue points fall reasonably close to the red line around the center of the
distribution.  
This suggests:

- the bulk of residuals are approximately normal  
- OLS captures the central trend well  

However...

#### **2. Systematic deviation in the lower tail**
In the lower-left region (large negative residuals), the points fall **well below** the
red line:

- these are homes where the model **underestimates** price dramatically  
- indicates **outliers** or **nonlinear behavior**  
- suggests heteroscedasticity (unequal variance)  

These heavy-tailed errors violate OLS’s normality assumption.

#### **3. Deviations in the upper tail**
At the top-right, the largest positive residuals also deviate from the line:

- indicates occasional **overestimation**  
- strengthens the evidence of **non-normality** in residuals  

---

### What this tells us about OLS?

OLS relies on the assumption that residuals are:

1. **normally distributed**  
2. **homoscedastic** (constant variance)  
3. **centered around 0**  

This QQ-plot shows violations of assumptions **1** and **2**, due to:

- nonlinear relationships  
- outliers  
- features interacting in ways OLS cannot capture  

These issues reduce OLS’s accuracy and motivate the need for **Polynomial Regression**  
and **Regularized models**, which can better handle:

- curvature  
- interactions  
- feature collinearity  
- outlier robustness  

---

### Key Takeaway

While OLS provides an interpretable baseline, the QQ-plot reveals:

- non-normal residuals  
- asymmetric tails  
- underestimation of high-value homes  
- limited ability to model nonlinear price behavior  

This reinforces that OLS is only a **starting point**, and more flexible techniques  
(2.3–2.6) are needed for higher accuracy on complex datasets like housing prices.
