# Loss Functions, Cost Functions, and the Bias-Variance Tradeoff

---

## Learning Objectives

By the end of this notebook you will be able to:

- Distinguish between a **loss function** (single sample) and a **cost function** (average over the dataset)
- Write and interpret common loss functions: MSE, MAE, Binary Cross-Entropy, Hinge Loss
- Explain the **Bias-Variance Tradeoff** and its relationship to underfitting and overfitting
- Use polynomial regression to visually demonstrate the tradeoff
- Read and interpret **learning curves** to diagnose model problems

## Prerequisites

- Basic Python and NumPy
- Familiarity with linear regression concepts (ML200)
- Understanding of train/test splits (ML100)

## Table of Contents

1. [Loss Function vs Cost Function](#1)
2. [Common Loss Functions](#2)
3. [The Bias-Variance Tradeoff](#3)
4. [Demonstrating Bias-Variance with Polynomial Regression](#4)
5. [Learning Curves](#5)
6. [Common Mistakes](#6)
7. [Exercise](#7)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import learning_curve, train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

<a id='1'></a>
## 1. Loss Function vs Cost Function

These two terms are often used interchangeably, but they have a precise distinction:

| Term | Scope | Definition |
|------|-------|------------|
| **Loss function** $L$ | Single sample | Measures the error for **one** data point |
| **Cost function** $J$ | Entire dataset | The **average** (or sum) of the loss over all samples |

$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} L\bigl(y^{(i)},\; \hat{y}^{(i)}\bigr)$$

- **Loss** tells you how wrong the model is on a single prediction.
- **Cost** aggregates that information so you can optimize over the whole dataset.

<a id='2'></a>
## 2. Common Loss Functions

### 2.1 Mean Squared Error (MSE) -- Regression

$$L = (y - \hat{y})^2 \qquad\qquad J = \frac{1}{n}\sum_{i=1}^{n}(y^{(i)} - \hat{y}^{(i)})^2$$

- Penalizes large errors heavily (quadratic)
- Differentiable everywhere -- convenient for gradient-based optimization

### 2.2 Mean Absolute Error (MAE) -- Regression

$$L = |y - \hat{y}| \qquad\qquad J = \frac{1}{n}\sum_{i=1}^{n}|y^{(i)} - \hat{y}^{(i)}|$$

- Less sensitive to outliers than MSE
- Not differentiable at zero -- requires sub-gradient methods

### 2.3 Binary Cross-Entropy -- Classification

$$L = -\bigl[y\log(\hat{p}) + (1-y)\log(1-\hat{p})\bigr]$$

- Used in logistic regression and neural networks
- Heavily penalizes confident wrong predictions

### 2.4 Hinge Loss -- Classification (SVM)

$$L = \max(0,\; 1 - y\hat{y})$$

- Used by Support Vector Machines
- $y \in \{-1, +1\}$; zero loss when the prediction is correct and confident

In [None]:
# --- Visualize MSE vs MAE loss for a single sample ---
errors = np.linspace(-4, 4, 200)

mse_loss = errors ** 2
mae_loss = np.abs(errors)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(errors, mse_loss, color='steelblue', linewidth=2)
axes[0].set_title('MSE Loss: $L = (y - \hat{y})^2$')
axes[0].set_xlabel('Error $(y - \hat{y})$')
axes[0].set_ylabel('Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(errors, mae_loss, color='darkorange', linewidth=2)
axes[1].set_title('MAE Loss: $L = |y - \hat{y}|$')
axes[1].set_xlabel('Error $(y - \hat{y})$')
axes[1].set_ylabel('Loss')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Key takeaway: MSE penalizes large errors much more than MAE.")

In [None]:
# --- Visualize Binary Cross-Entropy ---
p_hat = np.linspace(0.001, 0.999, 200)

loss_y1 = -np.log(p_hat)        # true label y=1
loss_y0 = -np.log(1 - p_hat)    # true label y=0

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(p_hat, loss_y1, color='crimson', linewidth=2)
axes[0].set_title('Cross-Entropy when $y = 1$')
axes[0].set_xlabel('Predicted probability $\hat{p}$')
axes[0].set_ylabel('Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(p_hat, loss_y0, color='teal', linewidth=2)
axes[1].set_title('Cross-Entropy when $y = 0$')
axes[1].set_xlabel('Predicted probability $\hat{p}$')
axes[1].set_ylabel('Loss')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Key takeaway: loss explodes when the model is confident AND wrong.")

<a id='3'></a>
## 3. The Bias-Variance Tradeoff

Every model's expected prediction error can be decomposed as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

| Component | Meaning | Symptom |
|-----------|---------|----------|
| **Bias** | How far the average prediction is from the true value | **Underfitting** -- model is too simple |
| **Variance** | How much predictions change across different training sets | **Overfitting** -- model is too complex |
| **Irreducible Noise** | Random noise inherent in the data | Cannot be reduced by any model |

**The tradeoff:**
- Increasing model complexity decreases bias but increases variance.
- Decreasing model complexity increases bias but decreases variance.
- The sweet spot is where total error (bias$^2$ + variance) is minimized.

In [None]:
# --- Conceptual plot: model complexity vs error ---
complexity = np.linspace(0.5, 10, 200)

bias_sq = 8 / complexity ** 1.5
variance = 0.05 * complexity ** 2
noise = np.full_like(complexity, 0.5)
total_error = bias_sq + variance + noise

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(complexity, bias_sq, '--', label='Bias$^2$', color='steelblue', linewidth=2)
ax.plot(complexity, variance, '--', label='Variance', color='darkorange', linewidth=2)
ax.plot(complexity, noise, ':', label='Irreducible Noise', color='gray', linewidth=2)
ax.plot(complexity, total_error, '-', label='Total Error', color='crimson', linewidth=2.5)

# Mark the optimal point
opt_idx = np.argmin(total_error)
ax.axvline(complexity[opt_idx], color='green', linestyle='-.', alpha=0.6, label='Optimal Complexity')

ax.set_xlabel('Model Complexity')
ax.set_ylabel('Error')
ax.set_title('Bias-Variance Tradeoff')
ax.legend(fontsize=11)
ax.set_ylim(0, 8)
ax.grid(True, alpha=0.3)

# Annotate regions
ax.annotate('Underfitting\n(High Bias)', xy=(1.5, 5), fontsize=13, color='steelblue', fontweight='bold')
ax.annotate('Overfitting\n(High Variance)', xy=(7.5, 5), fontsize=13, color='darkorange', fontweight='bold')

plt.tight_layout()
plt.show()

<a id='4'></a>
## 4. Demonstrating Bias-Variance with Polynomial Regression

We will fit polynomials of degree 1 (linear), 3 (moderate), and 15 (very flexible) to noisy sinusoidal data.

In [None]:
# --- Generate noisy sinusoidal data ---
np.random.seed(42)
n_samples = 30

X = np.sort(np.random.uniform(0, 1, n_samples))
y_true = np.sin(2 * np.pi * X)
y = y_true + np.random.normal(0, 0.25, n_samples)

X_plot = np.linspace(0, 1, 300)
y_plot_true = np.sin(2 * np.pi * X_plot)

print(f"Generated {n_samples} noisy samples from sin(2*pi*x).")

In [None]:
# --- Fit polynomials of degree 1, 3, and 15 ---
degrees = [1, 3, 15]
labels = ['Degree 1 (High Bias)', 'Degree 3 (Good Fit)', 'Degree 15 (High Variance)']
colors = ['steelblue', 'green', 'crimson']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, degree, label, color in zip(axes, degrees, labels, colors):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X.reshape(-1, 1), y)
    y_pred = model.predict(X_plot.reshape(-1, 1))

    ax.scatter(X, y, color='black', s=20, zorder=5, label='Training data')
    ax.plot(X_plot, y_plot_true, 'k--', alpha=0.4, label='True function')
    ax.plot(X_plot, y_pred, color=color, linewidth=2, label=f'Poly degree {degree}')
    ax.set_title(label, fontsize=13)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_ylim(-2, 2)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Degree 1:  Too simple -- misses the curve (HIGH BIAS).")
print("Degree 3:  Captures the shape without chasing noise (GOOD BALANCE).")
print("Degree 15: Fits noise exactly -- wild swings on unseen data (HIGH VARIANCE).")

In [None]:
# --- Train/test error vs polynomial degree ---
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(
    X.reshape(-1, 1), y, test_size=0.3, random_state=42
)

max_degree = 15
train_errors = []
test_errors = []

for d in range(1, max_degree + 1):
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test)))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(1, max_degree + 1), train_errors, 'o-', label='Training Error', color='steelblue')
ax.plot(range(1, max_degree + 1), test_errors, 's-', label='Test Error', color='crimson')
ax.set_xlabel('Polynomial Degree')
ax.set_ylabel('MSE')
ax.set_title('Model Complexity vs Training/Test Error')
ax.set_ylim(0, min(2, max(test_errors) * 1.1))
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The classic U-shaped test curve:")
print("  - Left side: high bias (underfitting)")
print("  - Right side: high variance (overfitting)")
print("  - Minimum: the sweet spot")

<a id='5'></a>
## 5. Learning Curves

A **learning curve** plots training and validation scores as a function of the **number of training samples**.

### How to read them:

| Pattern | Diagnosis | Action |
|---------|-----------|--------|
| Both scores low, close together | **High bias** (underfitting) | Increase model complexity, add features |
| Training score high, validation low, large gap | **High variance** (overfitting) | More data, regularization, simpler model |
| Both scores high, close together | **Good fit** | You are done |

Let's use `sklearn.model_selection.learning_curve` to generate these plots.

In [None]:
# --- Generate more data for learning curve analysis ---
np.random.seed(42)
n = 200
X_lc = np.sort(np.random.uniform(0, 1, n)).reshape(-1, 1)
y_lc = np.sin(2 * np.pi * X_lc.ravel()) + np.random.normal(0, 0.25, n)

def plot_learning_curve(estimator, title, X, y, ax):
    """Plot a learning curve using sklearn."""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='neg_mean_squared_error',
        random_state=42
    )
    train_mean = -train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = -val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    ax.plot(train_sizes, train_mean, 'o-', color='steelblue', label='Training Error')
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
                    alpha=0.15, color='steelblue')
    ax.plot(train_sizes, val_mean, 's-', color='crimson', label='Validation Error')
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
                    alpha=0.15, color='crimson')
    ax.set_xlabel('Training Set Size')
    ax.set_ylabel('MSE')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# High bias model
model_bias = make_pipeline(PolynomialFeatures(1), LinearRegression())
plot_learning_curve(model_bias, 'Degree 1 -- High Bias', X_lc, y_lc, axes[0])

# Good fit model
model_good = make_pipeline(PolynomialFeatures(3), LinearRegression())
plot_learning_curve(model_good, 'Degree 3 -- Good Fit', X_lc, y_lc, axes[1])

# High variance model
model_var = make_pipeline(PolynomialFeatures(15), LinearRegression())
plot_learning_curve(model_var, 'Degree 15 -- High Variance', X_lc, y_lc, axes[2])

plt.tight_layout()
plt.show()

print("Degree 1:  Both curves converge high  --> HIGH BIAS (more data won't help).")
print("Degree 3:  Both curves converge low    --> GOOD FIT.")
print("Degree 15: Large gap between curves     --> HIGH VARIANCE (more data may help).")

<a id='6'></a>
## 6. Common Mistakes

1. **Confusing bias and variance**
   - High **bias** = underfitting (model too simple, poor on training AND test)
   - High **variance** = overfitting (model too complex, great on training, poor on test)

2. **Not looking at learning curves**
   - A single train/test split score does not tell you whether to get more data, add features, or simplify the model
   - Always plot learning curves before deciding your next step

3. **Confusing loss and cost**
   - Loss = one sample, cost = average over all samples
   - Optimization minimizes the **cost** function, not individual losses

4. **Using the wrong loss for the task**
   - MSE/MAE for regression, cross-entropy for classification
   - Using MSE for classification can work but converges poorly

5. **Assuming more data always helps**
   - If your model has high bias, more data will **not** help
   - Learning curves make this obvious: if both curves have already converged, more data is useless

<a id='7'></a>
## 7. Exercise

**Task:** Use the California Housing dataset to explore the bias-variance tradeoff.

1. Load the dataset with `sklearn.datasets.fetch_california_housing()`
2. Use only the first feature (`MedInc`) for simplicity
3. Fit polynomial regression with degrees 1, 3, 5, 10
4. Plot the **learning curve** for each degree (use the `plot_learning_curve` helper above)
5. Answer:
   - Which degree shows high bias?
   - Which degree shows high variance?
   - Which degree gives the best balance?

In [None]:
# YOUR CODE HERE
# from sklearn.datasets import fetch_california_housing
# data = fetch_california_housing()
# X_ex = data.data[:, [0]]  # MedInc only
# y_ex = data.target
#
# ... fit models and plot learning curves ...