# Regression Metrics and Error Analysis

---

## Learning Objectives

By the end of this notebook, you will be able to:

- Compute and interpret MAE, MSE, RMSE, and R2
- Choose the right metric for your problem (e.g., MAE vs RMSE)
- Create and interpret residual plots and error distributions
- Perform error analysis: identify where a model fails
- Compare models against a baseline (DummyRegressor)
- Build a model comparison table

## Prerequisites

- Completed Notebooks 01-03 (Linear Regression, Assumptions, Regularization)
- Familiarity with sklearn model fitting and prediction

## Table of Contents

1. [Metric Definitions and Formulas](#1-metric-definitions-and-formulas)
2. [When to Use Which Metric](#2-when-to-use-which-metric)
3. [Computing All Metrics](#3-computing-all-metrics)
4. [Residual Analysis](#4-residual-analysis)
5. [Error Analysis: Where Does the Model Fail?](#5-error-analysis-where-does-the-model-fail)
6. [Baseline Comparison with DummyRegressor](#6-baseline-comparison-with-dummyregressor)
7. [Model Comparison Table](#7-model-comparison-table)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercise](#9-exercise)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score
)
from sklearn.datasets import fetch_california_housing

np.random.seed(42)
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

---

## 1. Metric Definitions and Formulas

### Mean Absolute Error (MAE)

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

- Average absolute difference between actual and predicted values
- Same units as the target variable
- Robust to outliers (linear penalty)

### Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

- Average squared difference
- Units are squared (harder to interpret directly)
- Penalizes large errors disproportionately

### Root Mean Squared Error (RMSE)

$$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

- Same units as the target variable
- More sensitive to outliers than MAE
- Most commonly reported metric

### R-squared (R2 / Coefficient of Determination)

$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

- Proportion of variance explained by the model
- Range: $(-\infty, 1]$; 1 = perfect, 0 = predicting the mean, negative = worse than mean
- Unitless, allows comparison across datasets

---

## 2. When to Use Which Metric

| Metric | Best For | Sensitive to Outliers? | Units |
|---|---|---|---|
| **MAE** | When all errors should be weighted equally | No (linear penalty) | Same as target |
| **MSE** | When large errors are especially bad | Yes (quadratic penalty) | Squared units |
| **RMSE** | General purpose, interpretable | Yes | Same as target |
| **R2** | Comparing models, explaining variance | Moderate | Unitless |

**Rule of thumb:**
- Use **MAE** when outliers should not dominate your evaluation (e.g., median house price)
- Use **RMSE** when large errors are costly (e.g., energy demand forecasting)
- Use **R2** for quick model comparison, but never in isolation

In [None]:
# Demonstrate MAE vs RMSE sensitivity to outliers
np.random.seed(42)

# Normal errors
errors_normal = np.random.randn(100) * 2
# Errors with outliers
errors_outlier = errors_normal.copy()
errors_outlier[:5] = np.array([20, -25, 18, -22, 30])  # 5 extreme outliers

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, errors, title in zip(
    axes,
    [errors_normal, errors_outlier],
    ["Normal Errors", "Errors with Outliers"]
):
    ax.hist(errors, bins=30, edgecolor="k", alpha=0.7, color="steelblue")
    mae = np.mean(np.abs(errors))
    rmse = np.sqrt(np.mean(errors ** 2))
    ax.axvline(mae, color="green", linewidth=2, linestyle="--", label=f"MAE = {mae:.2f}")
    ax.axvline(rmse, color="red", linewidth=2, linestyle="--", label=f"RMSE = {rmse:.2f}")
    ax.set_title(title)
    ax.set_xlabel("Error")
    ax.legend()

plt.tight_layout()
plt.show()

print("With outliers, RMSE increases much more than MAE.")
print("If RMSE >> MAE, your data likely has large outliers.")

---

## 3. Computing All Metrics

In [None]:
# Generate a dataset and fit a model
np.random.seed(42)
n_samples = 300
n_features = 8

X = np.random.randn(n_samples, n_features)
true_w = np.array([3.0, -2.0, 1.5, 0.0, -1.0, 0.5, 0.0, 2.0])
y = X @ true_w + 5.0 + np.random.randn(n_samples) * 1.5

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Compute metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Regression Metrics (Test Set):")
print(f"  MAE:  {mae:.4f}")
print(f"  MSE:  {mse:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  R2:   {r2:.4f}")

In [None]:
# Manual computation to verify understanding
residuals = y_test - y_pred

mae_manual = np.mean(np.abs(residuals))
mse_manual = np.mean(residuals ** 2)
rmse_manual = np.sqrt(mse_manual)
ss_res = np.sum(residuals ** 2)
ss_tot = np.sum((y_test - np.mean(y_test)) ** 2)
r2_manual = 1 - ss_res / ss_tot

print("Manual verification:")
print(f"  MAE:  {mae_manual:.4f} (sklearn: {mae:.4f})")
print(f"  MSE:  {mse_manual:.4f} (sklearn: {mse:.4f})")
print(f"  RMSE: {rmse_manual:.4f} (sklearn: {rmse:.4f})")
print(f"  R2:   {r2_manual:.4f} (sklearn: {r2:.4f})")

---

## 4. Residual Analysis

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.6, edgecolors="k", linewidths=0.5)
axes[0, 0].axhline(y=0, color="r", linestyle="--", linewidth=1.5)
axes[0, 0].set_xlabel("Fitted values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs Fitted Values")

# 2. Actual vs Predicted
axes[0, 1].scatter(y_test, y_pred, alpha=0.6, edgecolors="k", linewidths=0.5)
mn = min(y_test.min(), y_pred.min())
mx = max(y_test.max(), y_pred.max())
axes[0, 1].plot([mn, mx], [mn, mx], "r--", linewidth=2, label="Perfect prediction")
axes[0, 1].set_xlabel("Actual")
axes[0, 1].set_ylabel("Predicted")
axes[0, 1].set_title("Actual vs Predicted")
axes[0, 1].legend()

# 3. Residual distribution
axes[1, 0].hist(residuals, bins=25, edgecolor="k", alpha=0.7, color="steelblue", density=True)
from scipy import stats
x_range = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 0].plot(x_range, stats.norm.pdf(x_range, residuals.mean(), residuals.std()),
                "r-", linewidth=2, label="Normal fit")
axes[1, 0].set_xlabel("Residual")
axes[1, 0].set_ylabel("Density")
axes[1, 0].set_title("Distribution of Residuals")
axes[1, 0].legend()

# 4. Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title("Q-Q Plot of Residuals")
axes[1, 1].get_lines()[0].set_markerfacecolor("steelblue")
axes[1, 1].get_lines()[0].set_alpha(0.6)

plt.tight_layout()
plt.show()

print("What to look for:")
print("  - Residuals vs Fitted: random scatter = good; pattern = problem")
print("  - Actual vs Predicted: points near diagonal = accurate predictions")
print("  - Residual distribution: bell-shaped = good")
print("  - Q-Q plot: points on the line = normally distributed residuals")

---

## 5. Error Analysis: Where Does the Model Fail?

Beyond aggregate metrics, it is important to understand **when and where** the model makes large errors.

In [None]:
# Analyze errors by feature range
# Pick the most important feature (feature 0 with true weight 3.0)
feature_idx = 0
feature_values = X_test[:, feature_idx]
abs_errors = np.abs(residuals)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Absolute error vs feature value
axes[0].scatter(feature_values, abs_errors, alpha=0.6, edgecolors="k", linewidths=0.5)
axes[0].set_xlabel(f"Feature {feature_idx} value")
axes[0].set_ylabel("Absolute error")
axes[0].set_title(f"Absolute Error vs Feature {feature_idx}")

# Error by target magnitude (binned)
df_errors = pd.DataFrame({
    "y_actual": y_test,
    "y_pred": y_pred,
    "abs_error": abs_errors
})
df_errors["y_bin"] = pd.cut(df_errors["y_actual"], bins=5)
error_by_bin = df_errors.groupby("y_bin", observed=True)["abs_error"].agg(["mean", "count"])

error_by_bin["mean"].plot(kind="bar", ax=axes[1], color="steelblue",
                           edgecolor="k", alpha=0.7)
axes[1].set_xlabel("Target value range")
axes[1].set_ylabel("Mean absolute error")
axes[1].set_title("Mean Absolute Error by Target Range")
axes[1].tick_params(axis='x', rotation=30)

plt.tight_layout()
plt.show()

print("Error analysis summary:")
print(error_by_bin.to_string())

In [None]:
# Identify worst predictions
df_worst = df_errors.nlargest(10, "abs_error")
print("Top 10 worst predictions:")
print(df_worst[["y_actual", "y_pred", "abs_error"]].to_string(index=False, float_format="%.3f"))
print(f"\nMedian absolute error: {df_errors['abs_error'].median():.3f}")
print(f"90th percentile error: {df_errors['abs_error'].quantile(0.9):.3f}")
print(f"Max absolute error:    {df_errors['abs_error'].max():.3f}")

---

## 6. Baseline Comparison with DummyRegressor

**Always compare your model against a simple baseline.** A model that cannot beat predicting the mean is not useful.

- `DummyRegressor(strategy="mean")`: predicts the training set mean for every sample
- `DummyRegressor(strategy="median")`: predicts the median

If your model's R2 is close to 0, it is barely better than guessing the mean.

In [None]:
# Baseline: DummyRegressor
dummy_mean = DummyRegressor(strategy="mean")
dummy_mean.fit(X_train, y_train)
y_pred_dummy = dummy_mean.predict(X_test)

dummy_median = DummyRegressor(strategy="median")
dummy_median.fit(X_train, y_train)
y_pred_dummy_med = dummy_median.predict(X_test)

print("Baseline Comparison:")
print(f"{'Model':<22} {'MAE':<10} {'RMSE':<10} {'R2':<10}")
print("-" * 52)

for name, preds in [
    ("DummyRegressor (mean)", y_pred_dummy),
    ("DummyRegressor (median)", y_pred_dummy_med),
    ("LinearRegression", y_pred)
]:
    mae_val = mean_absolute_error(y_test, preds)
    rmse_val = np.sqrt(mean_squared_error(y_test, preds))
    r2_val = r2_score(y_test, preds)
    print(f"{name:<22} {mae_val:<10.4f} {rmse_val:<10.4f} {r2_val:<10.4f}")

print("\nA useful model must significantly beat the DummyRegressor.")

---

## 7. Model Comparison Table

In [None]:
# Scale features for regularized models
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Define models to compare
models_to_compare = {
    "Dummy (mean)": DummyRegressor(strategy="mean"),
    "LinearRegression": LinearRegression(),
    "Ridge (alpha=1.0)": Ridge(alpha=1.0),
    "Ridge (alpha=10.0)": Ridge(alpha=10.0),
    "Lasso (alpha=0.01)": Lasso(alpha=0.01, max_iter=10000, random_state=42),
    "Lasso (alpha=0.1)": Lasso(alpha=0.1, max_iter=10000, random_state=42),
}

comparison_rows = []
for name, mdl in models_to_compare.items():
    mdl.fit(X_train_s, y_train)
    preds = mdl.predict(X_test_s)

    # Cross-validation on training set
    cv_scores = cross_val_score(mdl, X_train_s, y_train, cv=5,
                                scoring="neg_mean_squared_error")
    cv_rmse = np.sqrt(-cv_scores).mean()

    comparison_rows.append({
        "Model": name,
        "MAE": mean_absolute_error(y_test, preds),
        "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
        "R2": r2_score(y_test, preds),
        "CV RMSE": cv_rmse,
        "Non-zero coefs": np.sum(mdl.coef_ != 0) if hasattr(mdl, "coef_") else "-"
    })

comparison_df = pd.DataFrame(comparison_rows).set_index("Model")
print("Model Comparison Table:")
print(comparison_df.to_string())

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Filter to only numeric columns for plotting
plot_df = comparison_df[["RMSE", "R2"]].copy()
plot_df["RMSE"] = plot_df["RMSE"].astype(float)
plot_df["R2"] = plot_df["R2"].astype(float)

# RMSE comparison
plot_df["RMSE"].plot(kind="barh", ax=axes[0], color="steelblue",
                       edgecolor="k", alpha=0.7)
axes[0].set_xlabel("RMSE (lower is better)")
axes[0].set_title("RMSE Comparison")

# R2 comparison
plot_df["R2"].plot(kind="barh", ax=axes[1], color="coral",
                     edgecolor="k", alpha=0.7)
axes[1].set_xlabel("R2 (higher is better)")
axes[1].set_title("R2 Comparison")

plt.tight_layout()
plt.show()

---

## 8. Common Mistakes

| Mistake | Why It's a Problem | Fix |
|---|---|---|
| Using R2 alone | R2 can be high even with biased predictions or violated assumptions | Always report MAE/RMSE and check residual plots |
| Not checking residual patterns | A good R2 does not guarantee the model is appropriate | Always create residual vs fitted and Q-Q plots |
| Ignoring the baseline | You cannot tell if a model is useful without a reference | Compare against `DummyRegressor` |
| Comparing MAE and RMSE across different datasets | They depend on the scale of the target | Compare models on the **same** test set |
| Reporting only training metrics | Overfitting inflates training scores | Always report test set or cross-validation metrics |
| Not investigating worst predictions | Aggregate metrics hide important failure modes | Analyze errors by feature range and target magnitude |

---

## 9. Exercise

**Task:** Load the California Housing dataset, fit multiple models, and build a complete evaluation.

Steps:
1. Load the data and split 80/20 with `random_state=42`
2. Fit: `DummyRegressor`, `LinearRegression`, `Ridge(alpha=1.0)`, `Lasso(alpha=0.01)`
3. Compute MAE, RMSE, and R2 for each model on the test set
4. Create a residual plot for the best model
5. Identify which target ranges have the highest errors

In [None]:
# --- Exercise Solution ---

# Step 1: Load and split
housing = fetch_california_housing()
X_h = housing.data
y_h = housing.target

X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

# Scale features
scaler_h = StandardScaler()
X_h_train_s = scaler_h.fit_transform(X_h_train)
X_h_test_s = scaler_h.transform(X_h_test)

# Step 2: Fit models
exercise_models = {
    "Dummy (mean)": DummyRegressor(strategy="mean"),
    "LinearRegression": LinearRegression(),
    "Ridge (alpha=1.0)": Ridge(alpha=1.0),
    "Lasso (alpha=0.01)": Lasso(alpha=0.01, max_iter=10000, random_state=42),
}

# Step 3: Evaluate
ex_rows = []
ex_preds = {}

for name, mdl in exercise_models.items():
    mdl.fit(X_h_train_s, y_h_train)
    preds = mdl.predict(X_h_test_s)
    ex_preds[name] = preds

    ex_rows.append({
        "Model": name,
        "MAE": mean_absolute_error(y_h_test, preds),
        "RMSE": np.sqrt(mean_squared_error(y_h_test, preds)),
        "R2": r2_score(y_h_test, preds)
    })

ex_df = pd.DataFrame(ex_rows).set_index("Model")
print("California Housing — Model Comparison:")
print(ex_df.round(4).to_string())

In [None]:
# Step 4: Residual plot for best model (LinearRegression or Ridge)
best_name = ex_df["R2"].idxmax()
best_preds = ex_preds[best_name]
best_residuals = y_h_test - best_preds

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(best_preds, best_residuals, alpha=0.3, s=10, edgecolors="k", linewidths=0.2)
axes[0].axhline(y=0, color="r", linestyle="--", linewidth=1.5)
axes[0].set_xlabel("Fitted values")
axes[0].set_ylabel("Residuals")
axes[0].set_title(f"Residuals vs Fitted ({best_name})")

axes[1].scatter(y_h_test, best_preds, alpha=0.3, s=10, edgecolors="k", linewidths=0.2)
mn = min(y_h_test.min(), best_preds.min())
mx = max(y_h_test.max(), best_preds.max())
axes[1].plot([mn, mx], [mn, mx], "r--", linewidth=2, label="Perfect")
axes[1].set_xlabel("Actual")
axes[1].set_ylabel("Predicted")
axes[1].set_title(f"Actual vs Predicted ({best_name})")
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Step 5: Error by target range
df_h_errors = pd.DataFrame({
    "y_actual": y_h_test,
    "abs_error": np.abs(best_residuals)
})
df_h_errors["target_bin"] = pd.cut(df_h_errors["y_actual"], bins=5)
error_by_range = df_h_errors.groupby("target_bin", observed=True)["abs_error"].agg(["mean", "median", "count"])

print(f"Error Analysis by Target Range ({best_name}):")
print(error_by_range.round(4).to_string())

fig, ax = plt.subplots(figsize=(10, 5))
error_by_range["mean"].plot(kind="bar", ax=ax, color="steelblue",
                             edgecolor="k", alpha=0.7)
ax.set_xlabel("Target value range (house value in $100k)")
ax.set_ylabel("Mean Absolute Error")
ax.set_title("Where Does the Model Fail Most?")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

print("\nThe model tends to struggle most with high-value homes (> $3-4),")
print("which makes sense — there are fewer training examples at extremes.")