# Regularization & Hyperparameter Optimization

In the previous section we discussed the main three sources of error of generalisation, in this section we discuss $refularization$ techniques. 

These are methods that reduce the gap between the training and testing, there are:

1. Explicit Regularization
2. Implicit Regularization
3. Heuristic Methods

Finally, we will discuss in great detail hyperparameters invovled in a Deep Network and methods for finding optimal Hyperparameters.

## Explicit Regularization

$\text{Consider fitting a model } f[x, \phi] \text{with parameters } \phi \text{ using a training set } \{x_i, y_i\} \text{ of input/output pairs. We seek:}$

$$\begin{align} \hat{\phi} 
&= \mathbf{argmin}_{\phi}[L[\phi]] \\
&= \mathbf{argmin}_{\phi}[\sum_{i=1}^N l_i[x_i, y_i]]
\end{align}$$

So far in the previous section we've mentioned that by providing more data the model will be able to generalise better. 

We now focus with the fact in mind that we don't have so much data so instead we can apply **constraints** on the model.

$$\mathbf{argmin}_{\phi}\left(\sum_{i=1}^N l_i[x_i, y_i] + \lambda \cdot g[\phi]\right)$$

$g[\phi]: \quad $ A scalar function which provides additional pentalties to parameters that don't perform well. <br>
$0 \le \lambda: \quad$ Controls how much "should" this penalty contribute to the overall loss function.

- So why does this work? 
- What's happening by adding this regularization term?

### Probabilistic view


$\text{Recall the Maximum Likelihood Criterion: }$

$$\hat{\phi} = \mathbf{argmax}_{\phi}\left[\prod_{i=1}^N Pr(y_i | x_i, \phi) \right]$$

The Rgularization term can be considered as a $Prior \ Pr(\phi)$ term, that represents knowledge we have on the parameters **before** we observe the data therefore: 

$$\hat{\phi} = \mathbf{argmax}_{\phi}\left[\prod_{i=1}^N Pr(y_i | x_i, \phi)Pr(\phi) \right]$$

We then apply the NLL: 

$$\begin{align} 
\hat{\phi} 
&= \mathbf{argmin}_{\phi}\left[-\log{\prod_{i=1}^N Pr(y_i | x_i, \phi)Pr(\phi)}\right] \\
&= \mathbf{argmin}_{\phi} \left[- \sum_{i=1}^N Pr(y_i | x_i, \phi) + Pr(\phi) \right] \\
&= \mathbf{argmin}_{\phi} \left[ - \left(\sum_{i=1}^N Pr(y_i | x_i, \phi) + \sum_{i=1}^N Pr(\phi)\right) \right] \\
&= \mathbf{argmin}_{\phi}\left(\sum_{i=1}^N l_i[x_i, y_i] + \lambda \cdot g[\phi]\right)
\end{align}$$



### L2 Regularization (Weight Decay)

**Definition:**
$$\text{L2}[\phi] = \|\boldsymbol{\phi}\|^2 = \sum_{j} \phi_j^2$$

**Key Properties:**
- Typically applied only to **weights**, not bias terms → hence called **weight decay**
- Encourages smaller weight magnitudes → produces **smoother** output functions
- Hyperparameter $\lambda$ controls the strength of regularization

**Why Does This Help?**

| Scenario | Effect of L2 Regularization |
|----------|---------------------------|
| **Overfitting** | Forces the model to balance fitting training data against keeping weights small. The network can't memorize noise because large weights are penalized. |
| **Over-parameterization** | When the model has excess capacity (especially in regions with sparse/no training data), L2 favors smooth interpolation between nearby training points rather than erratic predictions. |

**Intuition:**

Think of L2 regularization as applying "friction" to the weights:
- Without regularization: Weights can grow arbitrarily large to perfectly fit every training point (including noise)
- With L2 regularization: Large weights are expensive, so the model uses smaller weights and creates smoother, more generalizable functions

**Example:**
```
Without L2: w = [10.2, -8.5, 12.1, -9.8] → fits training data perfectly but wiggly
With L2:    w = [2.1, -1.8, 2.3, -2.0]  → fits training data well, smoother predictions
```

**Probabilistic Interpretation:**

L2 regularization corresponds to placing a **Gaussian prior** on the parameters:
$$\text{Pr}(\phi) = \mathcal{N}(0, \sigma^2)$$

This encodes the belief that parameters should be small and centered around zero before seeing any data.

**Practical Tips:**
- Start with $\lambda \in [0.001, 0.1]$ and tune via validation set
- Larger $\lambda$ → stronger regularization → simpler model
- Too large $\lambda$ → underfitting (model too constrained)

---
---

### L1 Regularization (Lasso)

**Definition:**
$$\text{L1}[\phi] = \|\boldsymbol{\phi}\|_1 = \sum_{j} |\phi_j|$$

**Key Properties:**
- Penalizes the **absolute values** of weights (not squared)
- Encourages **sparsity** → drives many weights exactly to zero
- Acts as an automatic **feature selection** mechanism
- Hyperparameter $\lambda$ controls the strength of regularization


**Why Does This Help?**

| Scenario | Effect of L1 Regularization |
|----------|---------------------------|
| **Feature Selection** | Automatically identifies and eliminates irrelevant features by setting their weights to exactly zero. Useful when you suspect only a subset of features matter. |
| **High-Dimensional Data** | When you have many input features but limited data, L1 creates simpler, more interpretable models by keeping only the most important features. |
| **Overfitting** | Reduces model complexity by forcing the network to use fewer parameters, preventing it from memorizing noise. |

**Intuition:**

Think of L1 regularization as a "harsh judge":
- **L1** (lasso): Prefers few large weights, zeros out the rest → "winner takes all"

**Example:**
```
Without L1: w = [2.3, -1.8, 0.4, -0.9, 1.2, 0.3, -0.5]  → uses all features
With L1:    w = [4.1,  0.0, 0.0,  0.0, 2.8, 0.0,  0.0]  → only 2 features survive
```

**Probabilistic Interpretation:**

L1 regularization corresponds to placing a **Laplace (double exponential) prior** on the parameters:
$$\text{Pr}(\phi) = \frac{1}{2b}\exp\left(-\frac{|\phi|}{b}\right)$$

This prior has a sharp peak at zero, explaining why L1 drives weights to exactly zero.

---
---



<div align="center">

**L1 vs L2 Comparison:**

| Property | L1 (Lasso) | L2 (Ridge) |
|----------|-----------|-----------|
| **Formula** | $\sum_j \|\phi_j\|$ | $\sum_j \phi_j^2$ |
| **Effect on weights** | Sparse (many zeros) | Shrinks uniformly |
| **Feature selection** | ✅ Yes (automatic) | ❌ No (keeps all features) |
| **Solution uniqueness** | Not always unique | Always unique |
| **Best for** | High-dimensional data, interpretability | Correlated features, smooth functions |
| **Computational cost** | Higher (non-differentiable at 0) | Lower (smooth everywhere) |
</div>

**Example: L1 vs L2 Regularization**

Given $\mathbf{x} = [2, 1, 1, 1]$ and three weight vectors:
- $\mathbf{w}_1 = [0.5, 0, 0, 0]$
- $\mathbf{w}_2 = [0.125, 0.25, 0.25, 0.25]$
- $\mathbf{w}_3 = [0, 0.5, 0.5, 0]$

All produce the same output: $\mathbf{w} \cdot \mathbf{x} = 1$

<div align="center">

**Regularization Values:**

| Weight Vector | L1 Norm | L2 Norm |
|--------------|---------|---------|
| $\mathbf{w}_1$ | 0.5 | 0.25 |
| $\mathbf{w}_2$ | 0.875 | 0.203 |
| $\mathbf{w}_3$ | 1.0 | 0.5 |

</div>

**Conclusion:**
- **L1 prefers $\mathbf{w}_1$**: Sparse solution (3 zeros) with smallest L1 norm
- **L2 prefers $\mathbf{w}_2$**: Spreads weights evenly with smallest L2 norm

**Why?**
- L1 penalizes the sum of absolute values → favors fewer non-zero weights
- L2 penalizes the sum of squares → favors many small weights over few large ones

<div align="center">

**Geometric Interpretation:**


| L1 Constraint Region | L2 Constraint Region |
|:-------------------:|:-------------------:|
| Diamond shape with corners on axes | Circular shape |
| Optimal solution likely hits corner (→ sparsity) | Optimal solution hits smooth boundary |

<br>

<img src="../images/chap7/L2L1.png" width="450"/>

</div>



---



---

**Practical Tips:**
- Start with $\lambda \in [0.0001, 0.01]$ for L1 (typically smaller than L2)
- Use L1 when you suspect many features are irrelevant
- Use L2 when all features might contribute (even slightly)
- Consider **Elastic Net** (L1 + L2 combined) for the best of both worlds:
  $$g[\phi] = \lambda_1 \sum_j |\phi_j| + \lambda_2 \sum_j \phi_j^2$$




## Implicit Regularization




**Discovery:** Research from 2017-2019 revealed that SGD doesn't converge to arbitrary minima—it exhibits an implicit bias toward solutions with lower "complexity" (smaller norms, flatter minima), acting as a regularizer even without explicit penalty terms.

---

### Continuous vs. Discrete Gradient Descent

**Discrete Update Rule:**
$$\phi_1 = \phi_0 + \alpha \cdot g[\phi_0]$$

where $g[\phi_0] = -\frac{\partial L}{\partial \phi}\bigg|_{\phi_0}$ is the negative gradient and $\alpha$ is the step size.

**Continuous Limit:**
As $\alpha \to 0$, gradient descent follows the differential equation:
$$\frac{d\phi}{dt} = g[\phi]$$

**Key Insight:** For typical step sizes $\alpha$, the discrete and continuous versions converge to **different solutions**.

---

### Backward Error Analysis

**Goal:** Find a correction term $g_1[\phi]$ such that the modified continuous dynamics:

$$\frac{d\phi}{dt} \approx g[\phi] + \alpha g_1[\phi] + O(\alpha^2)$$

produces the same trajectory as the discrete update.

**Derivation:**

Consider a Taylor expansion of the modified continuous solution around $\phi_0$:

$$\phi[\alpha] \approx \phi + \alpha\frac{d\phi}{dt} + \frac{\alpha^2}{2}\frac{d^2\phi}{dt^2}\bigg|_{\phi=\phi_0}$$

Substituting the modified dynamics:

$$\phi[\alpha] \approx \phi + \alpha(g[\phi] + \alpha g_1[\phi]) + \frac{\alpha^2}{2}\left(\frac{\partial g[\phi]}{\partial \phi}\frac{d\phi}{dt} + \alpha\frac{\partial g_1[\phi]}{\partial \phi}\frac{d\phi}{dt}\right)\bigg|_{\phi=\phi_0}$$

Using $\frac{d\phi}{dt} = g[\phi]$:

$$\phi[\alpha] \approx \phi + \alpha g[\phi] + \alpha^2\left(g_1[\phi] + \frac{1}{2}\frac{\partial g[\phi]}{\partial \phi}g[\phi]\right)\bigg|_{\phi=\phi_0}$$

**Matching Terms:** The first two terms match the discrete update $\phi_0 + \alpha g[\phi_0]$. To ensure equivalence, the $O(\alpha^2)$ term must vanish:

$$g_1[\phi] = -\frac{1}{2}\frac{\partial g[\phi]}{\partial \phi}g[\phi]$$

---

### Equivalent Regularized Loss

Since $g[\phi] = -\frac{\partial L}{\partial \phi}$, the modified dynamics become:

$$\frac{d\phi}{dt} \approx -\frac{\partial L}{\partial \phi} - \frac{\alpha}{2}\frac{\partial^2 L}{\partial \phi^2}\frac{\partial L}{\partial \phi}$$

This is equivalent to performing continuous gradient descent on the **regularized loss**:

$$\bar{L}[\phi] = L[\phi] + \frac{\alpha}{4}\left\|\frac{\partial L}{\partial \phi}\right\|^2$$

**Proof:** Taking the gradient of $\bar{L}[\phi]$:

$$\frac{\partial \bar{L}}{\partial \phi} = \frac{\partial L}{\partial \phi} + \frac{\alpha}{2}\frac{\partial^2 L}{\partial \phi^2}\frac{\partial L}{\partial \phi}$$

which matches the modified dynamics above.

---

### Interpretation

**Implicit Regularization Term:**
$$\text{Penalty} = \frac{\alpha}{4}\left\|\nabla L[\phi]\right\|^2$$

**Key Consequences:**
- **Larger learning rate** $\alpha$ → **stronger implicit regularization**
- Penalizes regions with **steep gradients** → favors **flat minima**
- Flat minima generalize better (less sensitive to parameter perturbations)

**Formal Statement:**

> Discrete gradient descent with step size $\alpha$ is equivalent to continuous gradient flow on a regularized objective that includes a penalty proportional to the squared gradient norm.

This explains why SGD (with finite learning rate and stochasticity) acts as an implicit regularizer, preferring solutions in flatter regions of the loss landscape.

---

### Implicit Regularization in SGD

**Extension to Stochastic Gradient Descent:**

For SGD with batch size $B$, the effective regularized loss becomes:

$$L_{\text{SGD}}[\phi] = L[\phi] + \frac{\alpha}{4}\left\|\frac{\partial L}{\partial \phi}\right\|^2 + \frac{\alpha}{4B}\sum_{b=1}^{B}\left\|\frac{\partial L_b}{\partial \phi} - \frac{\partial L}{\partial \phi}\right\|^2$$

where $L_b$ is the loss on the $b$-th batch.

**The Extra Term:**

$$\text{Batch variance penalty} = \frac{\alpha}{4B}\sum_{b=1}^{B}\left\|\nabla L_b - \nabla L\right\|^2$$

This is the **variance of batch gradients** — it measures how much different batches disagree about the gradient direction.

**Key Consequences:**

| Property | Effect |
|----------|--------|
| **Favors consensus** | Prefers solutions where all batches agree on gradient direction |
| **Smaller batches** | Stronger variance penalty (larger $1/B$ coefficient) |
| **Better generalization** | Solutions that fit *all* data consistently, not just some data extremely well |

**Why SGD generalizes better:**

Beyond just exploration via randomness, SGD's implicit regularization encourages solutions where the model performs uniformly well across all data subsets, rather than overfitting to specific subsets.

**Practical insight:** Smaller batch sizes → stronger implicit regularization → often better test performance.

---
---

## Heuristic Methods 

### Early Stopping

**Definition** 

Stopping training procedure before it has fully converged.

**Effect** 

Reduce overfitting, if the model as already captures the general shape of the underlying function, but still hasn't had enough time to incoorperate the unwanted noise.

There are potential views: 

1. Since the weights are initialised to small vlaues, they don't have enough time to become large, thus has a similar effect to L2 regularisation.
2. Early stopping reduces the models complexity, so with the Bias/Variance trade-off we move away from the critical region and performace improves.
   
**How to determine when to Stop**

1. **Split data:** Training set + Validation set (held-out data not used for training)
2. **Track both losses:** Monitor training loss AND validation loss after each epoch
3. **Stop when validation loss stops improving**

---

### Ensembling

**Definition:**
Train multiple models and combine their predictions to improve generalization.

**Combination Methods:**

| Task Type | Combination Strategy |
|-----------|---------------------|
| **Regression** | Mean or median of outputs |
| **Classification** | Mean of pre-softmax activations, or mode of predicted classes |


**Three Main Approaches:**

1. **Different Initializations**
   - Train same architecture with different random weight initializations
   - Each model captures different aspects of the data
   - Averaging reduces variance in uncertain regions

2. **Bootstrap Aggregating (Bagging)**
   - Resample training data with replacement to create multiple datasets
   - Train separate model on each dataset
   - Reduces overfitting by introducing diversity

3. **Model Diversity**
   - Use different hyperparameters (learning rate, architecture, etc.)
   - Train different model families (CNNs, Transformers, etc.)
   - Combines complementary strengths

**Why it Works:** Individual models make different errors → averaging cancels out mistakes → better generalization.

---

### Dropout

**Definition:**
Randomly set a fraction of hidden units to zero during training at each iteration.

**Hyperparameter:** Dropout rate $p \in [0, 1]$ (probability of dropping a unit)


**How it Works:**

1. **During Training:**
   - For each training iteration, randomly select $p \cdot n$ units to "drop" (set to 0)
   - Forward/backward pass uses only the remaining $(1-p) \cdot n$ active units
   - Different random subset dropped at each iteration

2. **During Testing:**
   - Use **all** units (no dropout)
   - Scale outputs by $(1-p)$ to account for more active units

**Three Main Application Methods:**

| Method | Description | Formula |
|--------|-------------|---------|
| **Standard Dropout** | Apply to hidden layers | $h_{\text{train}} = h \odot m$ where $m \sim \text{Bernoulli}(1-p)$ |
| **Input Dropout** | Apply to input layer (use small $p \approx 0.2$) | $x_{\text{train}} = x \odot m$ |
| **DropConnect** | Drop weights instead of activations | $h = f(W \odot m \cdot x)$ |


**Why it Works:**

- **Prevents co-adaptation:** Forces network to learn redundant representations (can't rely on any single unit)
- **Implicit ensemble:** Training different "sub-networks" at each iteration
- **Encourages smaller weights:** Must spread information across many paths

**Typical values:** $p = 0.5$ for hidden layers, $p = 0.2$ for input layer

---

### Transfer Learning

**Definition:**
Use a model pre-trained on a large dataset and adapt it to a new, related task with limited data.

**Key Idea:** Features learned on one task (e.g., ImageNet) are often useful for other tasks (e.g., medical imaging).

**Two Main Approaches:**

1. **Feature Extraction (Frozen Layers)**
   - Keep pre-trained weights fixed
   - Only train new final layers for your task
   - Fast, works well when data is very limited

2. **Fine-Tuning**
   - Initialize with pre-trained weights
   - Train entire network (or later layers) on new task
   - Use small learning rate to avoid destroying pre-trained features

**Typical Strategy:**

```
Pre-trained model → Remove final layer → Add new task-specific layer → Train
```

**Why it Works:**
- Early layers learn general features (edges, textures)
- Later layers learn task-specific features
- Pre-training provides better initialization than random weights

**Practical Tips:**
- More data → fine-tune more layers
- Less data → freeze more layers
- Always use lower learning rate than training from scratch

---

### Self-Supervised Learning

**Definition:**
Train a model on a pretext task using unlabeled data, then transfer learned representations to downstream tasks.

**Key Idea:** Create "free" supervision from the data itself without human labels.


**Common Pretext Tasks:**

| Task | Description | Example |
|------|-------------|---------|
| **Contrastive Learning** | Learn representations by pulling similar samples together and pushing dissimilar ones apart | SimCLR, MoCo |
| **Masked Prediction** | Predict masked portions of input | BERT (text), MAE (images) |
| **Rotation Prediction** | Predict rotation angle applied to image | Rotate image 0°, 90°, 180°, 270° |
| **Jigsaw Puzzles** | Predict correct arrangement of shuffled patches | Rearrange 9 image patches |


**Why it Works:**
- Forces model to learn meaningful features from data structure
- Leverages vast amounts of unlabeled data
- Pre-trained representations transfer well to supervised tasks

**Typical Workflow:**

```
Unlabeled data → Self-supervised pretraining → Fine-tune on labeled task
```

**Advantages:**
- Reduces need for expensive labeled data
- Often outperforms training from scratch
- Scales well with data (unlike supervised learning)

---

## Hyperparameter Optimization (HPO)

### What is Hyperparameter Optimization?

**Hyperparameters** are configuration settings that control the learning process but are **not learned from data**. Unlike model parameters (weights and biases), which are optimized during training, hyperparameters must be set before training begins.

**Goal of HPO:** Find the hyperparameter configuration that yields the best model performance on unseen data (validation set).

---

### Types of Hyperparameters

#### 1. Optimization Hyperparameters
Control how the model learns from data:

| Hyperparameter | Description | Typical Range |
|----------------|-------------|---------------|
| **Learning rate** $\alpha$ | Step size for gradient updates | $[10^{-5}, 10^{-1}]$ |
| **Batch size** | Number of samples per gradient update | $[16, 512]$ |
| **Number of epochs** | Total passes through training data | $[10, 500]$ |
| **Optimizer type** | SGD, Adam, RMSprop, etc. | Categorical |
| **Momentum** $\beta$ | (for SGD with momentum) | $[0.9, 0.99]$ |
| **Weight decay** $\lambda$ | L2 regularization strength | $[10^{-5}, 10^{-2}]$ |

#### 2. Architecture Hyperparameters
Define the model structure:

| Hyperparameter | Description | Typical Range |
|----------------|-------------|---------------|
| **Number of layers** | Depth of network | $[2, 100+]$ |
| **Hidden units per layer** | Width of network | $[32, 1024+]$ |
| **Activation function** | ReLU, tanh, sigmoid, etc. | Categorical |
| **Dropout rate** $p$ | Fraction of units to drop | $[0.0, 0.5]$ |

#### 3. Regularization Hyperparameters
Control overfitting:

| Hyperparameter | Description | Typical Range |
|----------------|-------------|---------------|
| **L1 regularization** $\lambda_1$ | Lasso penalty strength | $[10^{-6}, 10^{-2}]$ |
| **L2 regularization** $\lambda_2$ | Ridge penalty strength | $[10^{-5}, 10^{-2}]$ |
| **Dropout rate** | Probability of dropping units | $[0.1, 0.5]$ |
| **Early stopping patience** | Epochs to wait before stopping | $[5, 50]$ |

#### 4. Data Preprocessing Hyperparameters

| Hyperparameter | Description |
|----------------|-------------|
| **Data augmentation** | Rotation, flip, crop parameters |
| **Normalization** | Mean/std for standardization |
| **Train/val/test split** | Ratio for data splitting |

---

### How to Identify Your Hyperparameters

**Step 1: List all configurable settings**
- Review your model architecture, optimizer, loss function, and training loop
- Any value you manually set (not learned) is a hyperparameter

**Step 2: Categorize by impact**
- **High impact:** Learning rate, batch size, architecture size
- **Medium impact:** Regularization strength, optimizer choice
- **Low impact:** Momentum, scheduler parameters

**Step 3: Start with high-impact parameters**
- Focus optimization efforts on the most influential hyperparameters
- Fix low-impact parameters to reasonable defaults

---

## Hyperparameter Search Methods

| Method | Process | Pros | Cons | When to Use |
|--------|---------|------|------|-------------|
| **Manual Search** | Try different values based on intuition and experience | ✅ Can leverage domain expertise<br>✅ Good for understanding model behavior | ❌ Time-consuming<br>❌ Not reproducible<br>❌ Biased by human intuition | Initial exploration, small models |
| **Grid Search** | Define discrete set of values for each hyperparameter and exhaustively evaluate all combinations<br><br>**Example:**<br>• Learning rates: [0.001, 0.01, 0.1]<br>• Batch sizes: [32, 64, 128]<br>• Hidden sizes: [64, 128, 256]<br>• Total: 3 × 3 × 3 = 27 experiments | ✅ Simple to implement<br>✅ Reproducible<br>✅ Guaranteed to find best configuration in search space | ❌ Exponential growth: $n_1 \times n_2 \times \cdots \times n_k$ combinations<br>❌ Wastes computation on unimportant dimensions<br>❌ Infeasible for $>3$ hyperparameters | Few hyperparameters ($\leq 3$), small search spaces |
| **Random Search** | Sample hyperparameter combinations randomly from defined distributions<br><br>**Key Insight (Bergstra & Bengio, 2012):**<br>Random search is more efficient than grid search because it explores more unique values along each dimension | ✅ More efficient than grid search<br>✅ Can search larger spaces with same budget<br>✅ Parallelizable | ❌ No guarantee of finding optimal configuration<br>❌ Doesn't learn from previous trials | $>3$ hyperparameters, limited computational budget |
| **Log-Scale Sampling** | For hyperparameters spanning multiple orders of magnitude<br><br>**Wrong approach:**<br>`lr = uniform(0.00001, 0.1)`  *# Most samples near 0.1*<br><br>**Correct approach:**<br>`log_lr = uniform(-5, -1)`<br>`lr = 10^log_lr`  *# Evenly distributed across orders* | ✅ **Use for:**<br>• Learning rate<br>• Weight decay (regularization strength)<br>• Any parameter spanning multiple orders of magnitude | ❌ **Don't use for:**<br>• Number of layers (discrete, small range)<br>• Batch size (power of 2, small range)<br>• Dropout rate (bounded [0,1]) | Parameters with exponential scale (learning rate, regularization) |




---

## Practical HPO Strategy

### Step 1: Coarse Search
- Use **random search**
- Wide ranges: learning rate $[10^{-5}, 10^{-1}]$, batch size $[16, 512]$
- Budget: 20-50 trials with early stopping

### Step 2: Refinement
- Narrow ranges around best region found
- Budget: 50-100 trials

### Step 3: Validation
- Train best 3-5 configurations with different seeds
- Report mean ± std on validation set
- Evaluate final model on **held-out test set only once**

---

## Best Practices

1. **Always use a validation set**
   - Never tune on test data (leads to overfitting)
   - Use cross-validation if data is limited

2. **Use log-scale for learning rate**
   - Sample from $10^{\text{uniform}(-5, -1)}$ not uniform(0.00001, 0.1)

3. **Start simple**
   - Begin with single-layer models, few trials
   - Add complexity only when necessary

4. **Monitor training dynamics**
   - Plot learning curves for different configurations
   - Watch for underfitting vs. overfitting

5. **Budget computational resources**
   - Set maximum trials and wall-clock time limits
   - Use early stopping aggressively during search

6. **Document everything**
   - Track all hyperparameters, seeds, and results
   - Use tools like MLflow, Weights & Biases, or TensorBoard

---

## Common Pitfalls

| Mistake | Consequence | Solution |
|---------|-------------|----------|
| **Testing on validation set** | Overfitting to validation data | Use separate test set, evaluate only once |
| **Not using log-scale for LR** | Wasting trials on narrow range | Use exponential/log distributions |
| **Too small validation set** | Noisy performance estimates | Use at least 10-20% of training data |
| **Grid search for many parameters** | Exponential computational cost | Use random search |
| **Single seed evaluation** | High variance in results | Average over 3-5 seeds |
| **Ignoring training time** | Wasting resources on slow configs | Set time/epoch budgets |


**Key Takeaway:** Start with random search using log-scale sampling for learning rate and regularization parameters. <Br> Use validation set for all tuning, and only evaluate on test set once with your final configuration.

---