# Module 16: Hyperparameter Tuning

**Finding the Best Configuration**

---

## Objectives

By the end of this notebook, you will:
- Understand key hyperparameters
- Know grid search vs random search
- Understand manual tuning strategies

---

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from itertools import product

---

# Part 1: Key Hyperparameters

---

| Hyperparameter | Typical Range | Notes |
|----------------|---------------|-------|
| Learning rate | 0.0001 - 0.1 | Most important |
| Batch size | 16 - 512 | Powers of 2 |
| Hidden units | 32 - 1024 | Per layer |
| Num layers | 1 - 10 | For MLPs |
| Dropout | 0.0 - 0.5 | Regularization |
| Weight decay | 0 - 0.01 | L2 regularization |

---

# Part 2: Search Strategies

---

## 2.1 Grid Search

In [2]:
# Grid search example
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'hidden_size': [64, 128, 256],
    'dropout': [0.0, 0.2, 0.5]
}

# Generate all combinations
keys = param_grid.keys()
values = param_grid.values()
combinations = list(product(*values))

print(f"Total combinations: {len(combinations)}")
print(f"First 5: {combinations[:5]}")

Total combinations: 27
First 5: [(0.001, 64, 0.0), (0.001, 64, 0.2), (0.001, 64, 0.5), (0.001, 128, 0.0), (0.001, 128, 0.2)]


## 2.2 Random Search

In [3]:
# Random search example
def sample_hyperparameters():
    return {
        'learning_rate': 10 ** np.random.uniform(-4, -1),  # Log-uniform
        'hidden_size': np.random.choice([64, 128, 256, 512]),
        'dropout': np.random.uniform(0, 0.5),
        'weight_decay': 10 ** np.random.uniform(-5, -2)
    }

print("Random samples:")
for i in range(3):
    params = sample_hyperparameters()
    print(f"  {i+1}: lr={params['learning_rate']:.6f}, hidden={params['hidden_size']}")

Random samples:
  1: lr=0.000132, hidden=128
  2: lr=0.022167, hidden=512
  3: lr=0.008236, hidden=128


---

# Part 3: Practical Tips

---

## Priority Order

1. **Learning rate**: Start here, try 0.001
2. **Batch size**: 32 or 64 usually good
3. **Architecture**: Number of layers and units
4. **Regularization**: Dropout, weight decay

## Learning Rate Tips

- If loss doesn't decrease: LR too low or too high
- If loss oscillates wildly: LR too high
- Start with 0.001 for Adam, 0.01 for SGD
- Use learning rate schedulers

---

## Next Module: [17 - Practice Problems](../17_practice/17_practice.ipynb)