In [1]:
from BiasVar import BiasVarianceVisualizer

visualizer = BiasVarianceVisualizer()

# Sources of Error and HPO

In this chapter we'll see that a well trained model could still exhibit generalisation errors:
-  We'll understand the main sources of these errors.
-  We'll Discuss Hyper-parameter Optimization.

## MNIST-1D Dataset Overview

For this chapter, we'll use the **MNIST-1D dataset** — a simplified 1D version of the classic MNIST handwritten digit dataset.

<div style="display: flex; justify-content: space-between; gap: 20px;">

<div style="flex: 1;">

### Dataset Characteristics

| **Property** | **Value** | **Description** |
|--------------|-----------|-----------------|
| **Classes** | 10 | Digits 0–9 (i.e., $y \in \{0, 1, 2, \ldots, 9\}$) |
| **Training samples** | 4,000 | Total examples ($I = 4000$) |
| **Class distribution** | ~400 per class | Uniformly distributed (balanced dataset) |
| **Input dimensions** | 40 | Each $\mathbf{x}_i \in \mathbb{R}^{40}$ is a **synthetically generated** 1D signal (not derived from 2D images) |
| **Data generation** | Template-based | Created from scratch using:<br>1. Hand-crafted 1D template curves for each digit<br>2. Random transformations (shift, scale, pad)<br>3. Additive noise |

</div>

<div style="flex: 1;">

### Training Configuration

| **Hyperparameter** | **Value** | **Explanation** |
|--------------------|-----------|-----------------|
| **Optimizer** | SGD | Stochastic Gradient Descent |
| **Batch size** | 100 | Samples per gradient update |
| **Learning rate** | 0.1 | Step size for parameter updates |
| **Total steps** | 6,000 | Number of gradient updates |
| **Epochs** | 150 | Full passes through data<br>($6000 \times 100 / 4000 = 150$) |
| **Loss function** | Cross-Entropy | Multiclass classification loss |

</div>

</div>

---

<div align="center">

### Model Architecture

We use a **fully connected neural network** with the following structure:

<img src="../Lessons/images/chap7/MNIST_net.png" width="900" />
</div>

**Total parameters:** $(40 \times 100) + 100 + (100 \times 100) + 100 + (100 \times 10) + 10 = 15{,}210$

### Observations 

After 4000 steps, the training data classified are perfectly classified. 
The training loss deceases eventually approaching 0.

**Our Testing Data**
Generated another 1000 more examples using the same process.
The data decreases as the training proceeds but down to 40%. 
This is an imporovement of "guessing" the classifer but is far from the training data. 

---
---


## Sources of Error

To visualize why models fail to generalize, we analyze a **1D least squares regression problem** where the data generation process is fully known.

**1. The Ground Truth (Data Generation)**
We generate training and test data by sampling inputs $x \in [0,1]$, passing them through a quasi-sinusoidal function, and adding fixed Gaussian noise:
$$y_{\text{true}} = A\sin(\phi_0 + \phi_1 x) + \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, \sigma^2)$$


<div align="center">
<img src="../Lessons/images/chap7/Modelfunc.png" width="400" />
<img src="../Lessons/images/chap7/sinusModel.png" width="538" />
</div>


**2. The Model (Approximation)**
We fit this data using a simplified shallow neural network with $D$ hidden units.
* **Structure:** It forms a piecewise linear function with "joints" evenly spaced at intervals of $1/D$.
* **Optimization:** This specific architecture allows for a **closed-form solution**, guaranteeing we find the global minimum of the Mean Squared Error (MSE) loss:
    $$L[\phi] = \sum_{i=1}^N(f[x_i, \phi] - y_i)^2$$

By eliminating optimization uncertainty (since we find the global minimum), we can isolate and analyze the theoretical sources of error strictly arising from the model's capacity and the data itself.

<div align="center">

### Three Sources of Generalization Error

</div>

| | **Noise (Irreducible Error)** | **Bias (Model Rigidity)** | **Variance (Model Sensitivity)** |
|---|---|---|---|
| **Definition** | Random error inherent in the data generation process | Systematic error from insufficient model complexity | Error from model's sensitivity to specific training data |
| **Cause** | Random noise in data:<br>• Stochastic processes<br>• Mislabeling<br>• Unobserved variables | Model architecture cannot capture true function's complexity | Limited training data prevents distinguishing signal from noise |
| **Example** | Multiple valid outputs $y$ for same input $x$ due to $\epsilon \sim \mathcal{N}(0, \sigma^2)$ | Piecewise linear model with 3 segments cannot fit smooth sine curve | Model fits training noise, produces different results with different data samples |
| **Can be reduced by...** | ❌ **Cannot be reduced**<br>(intrinsic to problem) | ✅ Increasing model capacity<br>(more hidden units, deeper networks) | ✅ More training data<br>✅ Regularization<br>✅ Ensemble methods |
| **Training performance** | ✅ Can achieve 0 training error<br>(by memorizing noise) | ❌ Cannot achieve 0 training error<br>(structural limitation) | ✅ Can achieve 0 training error<br>(overfitting) |
| **Test performance** | ❌ Always contributes to test error | ❌ Contributes when model too simple<br>(**underfitting**) | ❌ Contributes when model too flexible<br>(**overfitting**) |
| **Mathematical representation** | $\mathbb{E}[(y - \mathbb{E}[y\|x])^2]$ | $\mathbb{E}[(\mathbb{E}[\hat{f}(x)] - f^*(x))^2]$ | $\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]$ |
| **Also known as** | Bayes error, aleatoric uncertainty | Approximation error, underfitting | Estimation error, overfitting |
| **Images** |  <img src="../images/chap7/sinusNoise.png" width="450" /> | <img src="../images/chap7/sinusBias.png" width="350" /> | <img src="../images//chap7/sinusVar.png" width="350" /> |



**Key insight:** Total test error = Noise + Bias² + Variance

The mathematical formulation of the total test error is located in proof directory

- **Noise** is unavoidable
- **Bias** decreases as model complexity increases
- **Variance** increases as model complexity increases

This creates the **bias-variance tradeoff** — the central challenge in model selection.

## Reducing Error 

### Reducing Variance 

Variance emerges when a model is overly sensitive to the specific training data it sees. This happens due to:

1. **Limited data**: Insufficient samples prevent the model from learning the true underlying pattern
2. **Noisy data**: Random fluctuations in training examples are mistaken for signal
3. **Sparse coverage**: Data concentrated in specific regions of the input space, leaving other regions poorly characterized

**Solutions:**
- **More training data**: Averaging over more examples helps distinguish true signal from noise
- **Better data coverage**: Ensuring samples are well-distributed across the input space
- **Regularization**: Constraining model complexity (e.g., L2 penalty, dropout)
- **Ensemble methods**: Averaging predictions from multiple models trained on different data subsets
- **Early stopping**: Preventing the model from fitting training noise
- **Data augmentation**: Artificially expanding the training set with transformed examples

When a model has high variance, it will fit the training data extremely well but perform poorly on unseen test data — this is **overfitting**.

### Reducing Bias

Bias emerges when a model lacks the expressiveness to capture the true underlying pattern in the data. This happens due to:

1. **Insufficient model capacity**: The architecture has too few parameters or layers to represent complex functions
2. **Wrong model family**: Using a model class fundamentally unsuited to the problem (e.g., linear models for non-linear data)
3. **Overly restrictive assumptions**: Imposing constraints that exclude the true data-generating function

**Solutions:**
- **Increase model capacity**: Add more hidden units, layers, or parameters
- **Use more expressive architectures**: Switch to models with greater representational power (e.g., from linear to neural networks)
- **Feature engineering**: Create more informative input representations
- **Remove unnecessary constraints**: Reduce regularization if it's too strong
- **Ensemble of diverse models**: Combine predictions from different model families

**Theoretical foundation**: The Universal Approximation Theorem guarantees that sufficiently wide neural networks can approximate any continuous function arbitrarily well.

When a model has high bias, it will perform poorly on both training and test data — this is **underfitting**.

In [2]:
visualizer.interactive_plot()

interactive(children=(IntSlider(value=3, continuous_update=False, description='Complexity:', max=20, min=1), F…