# Day 8: Hyperparameter Tuning and Experimentation

**Time:** 3-4 hours

**Mathematical Prerequisites:**
- Optimization theory (gradient descent, convergence)
- Statistics (sampling, distributions)
- Probability (Bayesian inference for advanced methods)
- Understanding of search spaces and complexity

---

## Objectives

Hyperparameters can make or break your model. Today we explore systematic approaches to hyperparameter optimization:
1. Learning rate finder (critical first step)
2. Grid search vs random search
3. Bayesian optimization (advanced)
4. Experiment tracking and management
5. Visualization of hyperparameter effects
6. Budget allocation strategies

**Goal:** Build a systematic framework for finding optimal hyperparameters

---

## Part 1: Theory - The Hyperparameter Optimization Problem

### 1.1 Hyperparameters vs Parameters

**Parameters ($\theta$):** Learned from data via gradient descent
- Weights, biases
- Updated during training

**Hyperparameters ($\lambda$):** Set before training
- Learning rate, batch size, architecture choices
- Not directly optimized via gradient descent
- Require expensive outer loop optimization

### 1.2 The Optimization Problem

**Inner loop (training):**
$$
\theta^*(\lambda) = \arg\min_{\theta} L_{\text{train}}(\theta; \lambda)
$$

**Outer loop (hyperparameter tuning):**
$$
\lambda^* = \arg\min_{\lambda} L_{\text{val}}(\theta^*(\lambda))
$$

**Challenge:** 
- Each evaluation of $\theta^*(\lambda)$ requires full training run
- High-dimensional search space
- No gradient information for $\lambda$

### 1.3 Common Hyperparameters and Their Scales

| Hyperparameter | Type | Search Scale | Typical Range |
|----------------|------|--------------|---------------|
| Learning rate | Continuous | Log | [1e-5, 1e-1] |
| Batch size | Discrete | Linear/Log | [16, 512] |
| Weight decay | Continuous | Log | [1e-6, 1e-2] |
| Dropout rate | Continuous | Linear | [0.0, 0.5] |
| Number of layers | Discrete | Linear | [2, 10] |
| Hidden units | Discrete | Log | [32, 1024] |

**Key Insight:** Learning rate, weight decay, and other scaling parameters should be searched on **log scale** because their effect is multiplicative.

### 1.4 Search Strategies Comparison

**Grid Search:**
- Try all combinations on a grid
- Complexity: $O(k^d)$ where $k$ = grid points per dimension, $d$ = dimensions
- Good: Exhaustive, reproducible
- Bad: Exponential in dimensions, wastes computation

**Random Search:**
- Sample hyperparameters randomly
- Complexity: $O(n)$ where $n$ = number of trials
- Good: More efficient than grid for high dimensions
- Bad: May miss optimal region

**Bayesian Optimization:**
- Build probabilistic model of objective function
- Use acquisition function to select next point
- Good: Sample-efficient, intelligent exploration
- Bad: Overhead for simple problems

**Important Result (Bergstra & Bengio, 2012):**
Random search is more efficient than grid search when only a few hyperparameters truly matter.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from tqdm import tqdm
import json
import time
from pathlib import Path
from datetime import datetime
from itertools import product

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Subset

# For Bayesian optimization
try:
    from skopt import BayesSearchCV
    from skopt.space import Real, Integer, Categorical
    SKOPT_AVAILABLE = True
except:
    SKOPT_AVAILABLE = False
    print("scikit-optimize not available. Bayesian optimization will be skipped.")

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create experiments directory
experiments_dir = Path('./experiments')
experiments_dir.mkdir(exist_ok=True)

## Part 2: Dataset and Model Setup

We'll use CIFAR-10 for hyperparameter tuning experiments.

In [None]:
# Load CIFAR-10
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

# Full datasets
train_dataset_full = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform_train
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform_test
)

# Create train/val split
train_size = int(0.8 * len(train_dataset_full))
val_size = len(train_dataset_full) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(
    train_dataset_full, [train_size, val_size]
)

print(f"Train samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

### Define Simple CNN for Fast Experimentation

In [None]:
class SimpleCNN(nn.Module):
    """Simple CNN for hyperparameter tuning experiments."""
    
    def __init__(self, hidden_size=64, dropout_rate=0.5, num_layers=2):
        super(SimpleCNN, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(dropout_rate)
        
        # Flexible FC layers
        self.fc_layers = nn.ModuleList()
        input_size = 64 * 8 * 8
        
        for _ in range(num_layers - 1):
            self.fc_layers.append(nn.Linear(input_size, hidden_size))
            input_size = hidden_size
        
        self.fc_final = nn.Linear(input_size, 10)
        
    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = self.dropout(x)
        
        for fc in self.fc_layers:
            x = torch.relu(fc(x))
            x = self.dropout(x)
        
        x = self.fc_final(x)
        return x

## Part 3: Learning Rate Finder

### 3.1 Theory: Learning Rate Range Test

**Algorithm (Smith, 2017):**
1. Start with very small learning rate (e.g., 1e-7)
2. Exponentially increase LR after each batch
3. Record loss at each LR
4. Stop when loss explodes

**Optimal LR Selection:**
- **Method 1:** LR where loss decreases fastest (steepest gradient)
- **Method 2:** One order of magnitude before minimum loss
- **Method 3:** Middle of steepest decline region

**Mathematical Formulation:**
At iteration $i$:
$$
\text{LR}_i = \text{LR}_{\min} \cdot \left(\frac{\text{LR}_{\max}}{\text{LR}_{\min}}\right)^{i/n}
$$
where $n$ is total iterations.

In [None]:
class LRFinder:
    """Learning Rate Range Test."""
    
    def __init__(self, model, optimizer, criterion, device):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
        
        # Save initial state
        self.model_state = model.state_dict()
        self.optimizer_state = optimizer.state_dict()
    
    def range_test(self, train_loader, start_lr=1e-7, end_lr=10, num_iter=100, 
                   smooth_f=0.05, diverge_th=5):
        """
        Perform learning rate range test.
        
        Args:
            train_loader: Training data loader
            start_lr: Starting learning rate
            end_lr: Ending learning rate
            num_iter: Number of iterations
            smooth_f: Smoothing factor for loss
            diverge_th: Threshold for divergence detection
        """
        # Setup
        self.model.train()
        lrs = []
        losses = []
        best_loss = float('inf')
        
        # Calculate LR multiplier
        lr_mult = (end_lr / start_lr) ** (1 / num_iter)
        lr = start_lr
        
        # Update learning rate
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        # Iterate
        iterator = iter(train_loader)
        smoothed_loss = 0
        
        for iteration in tqdm(range(num_iter), desc='LR Finder'):
            # Get batch
            try:
                inputs, labels = next(iterator)
            except StopIteration:
                iterator = iter(train_loader)
                inputs, labels = next(iterator)
            
            inputs, labels = inputs.to(self.device), labels.to(self.device)
            
            # Forward pass
            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, labels)
            
            # Smooth loss
            if iteration == 0:
                smoothed_loss = loss.item()
            else:
                smoothed_loss = smooth_f * loss.item() + (1 - smooth_f) * smoothed_loss
            
            # Record
            lrs.append(lr)
            losses.append(smoothed_loss)
            
            # Check for divergence
            if smoothed_loss > diverge_th * best_loss or torch.isnan(loss):
                print(f"\nStopping early at iteration {iteration}. Loss diverged.")
                break
            
            if smoothed_loss < best_loss:
                best_loss = smoothed_loss
            
            # Backward pass
            loss.backward()
            self.optimizer.step()
            
            # Update learning rate
            lr *= lr_mult
            for param_group in self.optimizer.param_groups:
                param_group['lr'] = lr
        
        # Restore initial state
        self.model.load_state_dict(self.model_state)
        self.optimizer.load_state_dict(self.optimizer_state)
        
        return lrs, losses
    
    def plot(self, lrs, losses, skip_start=10, skip_end=5):
        """Plot learning rate vs loss."""
        if skip_start >= len(lrs):
            skip_start = 0
        if skip_end >= len(lrs):
            skip_end = 0
        
        lrs = lrs[skip_start:-skip_end] if skip_end > 0 else lrs[skip_start:]
        losses = losses[skip_start:-skip_end] if skip_end > 0 else losses[skip_start:]
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
        
        # Log scale plot
        ax1.plot(lrs, losses, linewidth=2)
        ax1.set_xscale('log')
        ax1.set_xlabel('Learning Rate (log scale)', fontsize=12)
        ax1.set_ylabel('Loss', fontsize=12)
        ax1.set_title('Learning Rate Finder', fontsize=14, fontweight='bold')
        ax1.grid(True, alpha=0.3)
        
        # Find steepest gradient
        gradients = np.gradient(losses)
        min_gradient_idx = np.argmin(gradients)
        suggested_lr = lrs[min_gradient_idx]
        
        ax1.axvline(x=suggested_lr, color='red', linestyle='--', linewidth=2,
                   label=f'Suggested LR: {suggested_lr:.2e}')
        ax1.legend(fontsize=11)
        
        # Gradient plot
        ax2.plot(lrs, gradients, linewidth=2, color='green')
        ax2.set_xscale('log')
        ax2.set_xlabel('Learning Rate (log scale)', fontsize=12)
        ax2.set_ylabel('Loss Gradient', fontsize=12)
        ax2.set_title('Loss Gradient vs Learning Rate', fontsize=14, fontweight='bold')
        ax2.axvline(x=suggested_lr, color='red', linestyle='--', linewidth=2,
                   label=f'Min gradient at LR: {suggested_lr:.2e}')
        ax2.legend(fontsize=11)
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\nSuggested learning rate: {suggested_lr:.2e}")
        print(f"Consider using: {suggested_lr/10:.2e} to {suggested_lr:.2e}")
        
        return suggested_lr

### Run Learning Rate Finder

In [None]:
# Create data loader
lr_finder_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)

# Create model and optimizer
model = SimpleCNN(hidden_size=64, dropout_rate=0.5, num_layers=2).to(device)
optimizer = optim.SGD(model.parameters(), lr=1e-7, momentum=0.9)
criterion = nn.CrossEntropyLoss()

# Run LR finder
lr_finder = LRFinder(model, optimizer, criterion, device)
lrs, losses = lr_finder.range_test(lr_finder_loader, start_lr=1e-6, end_lr=1, num_iter=100)
suggested_lr = lr_finder.plot(lrs, losses)

## Part 4: Grid Search

### 4.1 Define Search Space

In [None]:
# Grid search hyperparameters
grid_search_space = {
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [32, 64, 128],
    'weight_decay': [0.0, 1e-4, 1e-3],
}

# Calculate total combinations
total_combinations = np.prod([len(v) for v in grid_search_space.values()])
print(f"Grid Search: {total_combinations} total combinations")
print(f"Search space: {grid_search_space}")

### 4.2 Training Function with Experiment Tracking

In [None]:
def train_and_evaluate(hyperparams, train_loader, val_loader, num_epochs=5, verbose=False):
    """
    Train model with given hyperparameters and return validation performance.
    
    Returns:
        dict with results including val_acc, train_time, etc.
    """
    # Unpack hyperparameters
    lr = hyperparams.get('learning_rate', 0.01)
    weight_decay = hyperparams.get('weight_decay', 0.0)
    dropout_rate = hyperparams.get('dropout_rate', 0.5)
    hidden_size = hyperparams.get('hidden_size', 64)
    num_layers = hyperparams.get('num_layers', 2)
    
    # Create model
    model = SimpleCNN(
        hidden_size=hidden_size,
        dropout_rate=dropout_rate,
        num_layers=num_layers
    ).to(device)
    
    # Setup training
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    
    # Track metrics
    train_losses = []
    val_accs = []
    
    start_time = time.time()
    
    # Training loop
    for epoch in range(num_epochs):
        # Train
        model.train()
        running_loss = 0.0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
        
        epoch_loss = running_loss / len(train_loader)
        train_losses.append(epoch_loss)
        
        # Validate
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()
        
        val_acc = correct / total
        val_accs.append(val_acc)
        
        if verbose:
            print(f"Epoch {epoch+1}/{num_epochs}: Train Loss={epoch_loss:.4f}, Val Acc={val_acc:.4f}")
    
    train_time = time.time() - start_time
    
    # Return results
    results = {
        'hyperparams': hyperparams,
        'val_acc': max(val_accs),
        'final_val_acc': val_accs[-1],
        'train_losses': train_losses,
        'val_accs': val_accs,
        'train_time': train_time,
        'num_epochs': num_epochs
    }
    
    return results

### 4.3 Run Grid Search

In [None]:
def grid_search(search_space, train_dataset, val_dataset, num_epochs=5):
    """
    Perform grid search over hyperparameter space.
    """
    results = []
    
    # Generate all combinations
    keys = list(search_space.keys())
    values = list(search_space.values())
    combinations = list(product(*values))
    
    print(f"\nRunning Grid Search: {len(combinations)} combinations\n")
    print("="*80)
    
    for idx, combo in enumerate(combinations):
        hyperparams = dict(zip(keys, combo))
        
        print(f"\nExperiment {idx+1}/{len(combinations)}")
        print(f"Hyperparameters: {hyperparams}")
        
        # Create data loaders with specified batch size
        batch_size = hyperparams.get('batch_size', 64)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=2)
        
        # Train and evaluate
        result = train_and_evaluate(hyperparams, train_loader, val_loader, num_epochs=num_epochs)
        results.append(result)
        
        print(f"Validation Accuracy: {result['val_acc']:.4f}")
        print(f"Training Time: {result['train_time']:.2f}s")
        
    print("\n" + "="*80)
    print("Grid Search Complete!")
    
    return results

# Run grid search (reduced space for demo)
print("Starting Grid Search...")
grid_results = grid_search(grid_search_space, train_dataset, val_dataset, num_epochs=3)

### 4.4 Analyze Grid Search Results

In [None]:
# Create DataFrame
grid_df = pd.DataFrame([{
    **r['hyperparams'],
    'val_acc': r['val_acc'],
    'train_time': r['train_time']
} for r in grid_results])

# Sort by validation accuracy
grid_df_sorted = grid_df.sort_values('val_acc', ascending=False)

print("\nTop 5 Configurations:")
print("="*100)
print(grid_df_sorted.head().to_string(index=False))
print("="*100)

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Learning rate effect
lr_groups = grid_df.groupby('learning_rate')['val_acc'].agg(['mean', 'std'])
axes[0].bar(range(len(lr_groups)), lr_groups['mean'], yerr=lr_groups['std'],
           capsize=5, alpha=0.7, edgecolor='black')
axes[0].set_xticks(range(len(lr_groups)))
axes[0].set_xticklabels([f"{lr:.3f}" for lr in lr_groups.index])
axes[0].set_xlabel('Learning Rate', fontsize=12)
axes[0].set_ylabel('Mean Validation Accuracy', fontsize=12)
axes[0].set_title('Effect of Learning Rate', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Batch size effect
bs_groups = grid_df.groupby('batch_size')['val_acc'].agg(['mean', 'std'])
axes[1].bar(range(len(bs_groups)), bs_groups['mean'], yerr=bs_groups['std'],
           capsize=5, alpha=0.7, edgecolor='black', color='orange')
axes[1].set_xticks(range(len(bs_groups)))
axes[1].set_xticklabels([f"{bs}" for bs in bs_groups.index])
axes[1].set_xlabel('Batch Size', fontsize=12)
axes[1].set_ylabel('Mean Validation Accuracy', fontsize=12)
axes[1].set_title('Effect of Batch Size', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Weight decay effect
wd_groups = grid_df.groupby('weight_decay')['val_acc'].agg(['mean', 'std'])
axes[2].bar(range(len(wd_groups)), wd_groups['mean'], yerr=wd_groups['std'],
           capsize=5, alpha=0.7, edgecolor='black', color='green')
axes[2].set_xticks(range(len(wd_groups)))
axes[2].set_xticklabels([f"{wd:.1e}" for wd in wd_groups.index])
axes[2].set_xlabel('Weight Decay', fontsize=12)
axes[2].set_ylabel('Mean Validation Accuracy', fontsize=12)
axes[2].set_title('Effect of Weight Decay', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Best configuration
best_idx = grid_df_sorted.index[0]
best_config = grid_results[best_idx]
print(f"\nBest Configuration:")
print(f"Hyperparameters: {best_config['hyperparams']}")
print(f"Validation Accuracy: {best_config['val_acc']:.4f}")

## Part 5: Random Search

### 5.1 Theory: Why Random Search?

**Key Result (Bergstra & Bengio, 2012):**

If only a few hyperparameters truly matter, random search explores those dimensions more thoroughly than grid search for the same budget.

**Example:** 9 experiments, 2 hyperparameters
- Grid: 3×3 grid → Only 3 unique values per dimension
- Random: 9 random samples → 9 unique values per dimension

**Search Space Definition:**
- Continuous: Sample from distribution (uniform, log-uniform)
- Discrete: Sample from set
- Categorical: Sample from choices

In [None]:
def sample_random_hyperparams(search_space, n_samples=20):
    """
    Sample hyperparameters randomly from search space.
    
    search_space format:
    {
        'param_name': ('type', min, max) or ('choice', [options])
    }
    """
    samples = []
    
    for _ in range(n_samples):
        sample = {}
        for param, spec in search_space.items():
            param_type = spec[0]
            
            if param_type == 'log_uniform':
                # Sample on log scale (for LR, weight decay, etc.)
                log_min, log_max = np.log10(spec[1]), np.log10(spec[2])
                sample[param] = 10 ** np.random.uniform(log_min, log_max)
            
            elif param_type == 'uniform':
                # Sample uniformly
                sample[param] = np.random.uniform(spec[1], spec[2])
            
            elif param_type == 'int_uniform':
                # Sample integer uniformly
                sample[param] = np.random.randint(spec[1], spec[2] + 1)
            
            elif param_type == 'choice':
                # Sample from discrete choices
                sample[param] = np.random.choice(spec[1])
        
        samples.append(sample)
    
    return samples

# Define random search space
random_search_space = {
    'learning_rate': ('log_uniform', 1e-4, 1e-1),
    'batch_size': ('choice', [32, 64, 128, 256]),
    'weight_decay': ('log_uniform', 1e-6, 1e-2),
    'dropout_rate': ('uniform', 0.0, 0.7),
    'hidden_size': ('choice', [32, 64, 128, 256]),
}

# Sample hyperparameters
n_random_samples = 20
random_samples = sample_random_hyperparams(random_search_space, n_samples=n_random_samples)

print(f"Random Search: {n_random_samples} samples")
print("\nFirst 5 samples:")
for i, sample in enumerate(random_samples[:5]):
    print(f"{i+1}. {sample}")

### 5.2 Run Random Search

In [None]:
def random_search(hyperparams_list, train_dataset, val_dataset, num_epochs=5):
    """
    Perform random search.
    """
    results = []
    
    print(f"\nRunning Random Search: {len(hyperparams_list)} samples\n")
    print("="*80)
    
    for idx, hyperparams in enumerate(hyperparams_list):
        print(f"\nExperiment {idx+1}/{len(hyperparams_list)}")
        print(f"Hyperparameters: {hyperparams}")
        
        # Create data loaders
        batch_size = hyperparams.get('batch_size', 64)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=2)
        
        # Train and evaluate
        try:
            result = train_and_evaluate(hyperparams, train_loader, val_loader, num_epochs=num_epochs)
            results.append(result)
            
            print(f"Validation Accuracy: {result['val_acc']:.4f}")
            print(f"Training Time: {result['train_time']:.2f}s")
        except Exception as e:
            print(f"Error: {e}")
            continue
    
    print("\n" + "="*80)
    print("Random Search Complete!")
    
    return results

# Run random search
print("Starting Random Search...")
random_results = random_search(random_samples, train_dataset, val_dataset, num_epochs=3)

### 5.3 Compare Grid Search vs Random Search

In [None]:
# Create DataFrames
random_df = pd.DataFrame([{
    **r['hyperparams'],
    'val_acc': r['val_acc'],
    'train_time': r['train_time'],
    'method': 'Random Search'
} for r in random_results])

grid_df['method'] = 'Grid Search'

# Combine
combined_df = pd.concat([grid_df, random_df], ignore_index=True)

# Compare best results
print("\nGrid Search vs Random Search:")
print("="*80)
print(f"Grid Search - Best Acc: {grid_df['val_acc'].max():.4f}, "
      f"Mean Acc: {grid_df['val_acc'].mean():.4f}, "
      f"Total Time: {grid_df['train_time'].sum():.2f}s")
print(f"Random Search - Best Acc: {random_df['val_acc'].max():.4f}, "
      f"Mean Acc: {random_df['val_acc'].mean():.4f}, "
      f"Total Time: {random_df['train_time'].sum():.2f}s")
print("="*80)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Distribution of accuracies
grid_accs = grid_df['val_acc'].values
random_accs = random_df['val_acc'].values

axes[0].hist(grid_accs, bins=10, alpha=0.7, label='Grid Search', edgecolor='black')
axes[0].hist(random_accs, bins=10, alpha=0.7, label='Random Search', edgecolor='black')
axes[0].axvline(grid_accs.max(), color='blue', linestyle='--', linewidth=2, 
               label=f'Grid Best: {grid_accs.max():.4f}')
axes[0].axvline(random_accs.max(), color='orange', linestyle='--', linewidth=2,
               label=f'Random Best: {random_accs.max():.4f}')
axes[0].set_xlabel('Validation Accuracy', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Distribution of Validation Accuracies', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Cumulative best over time
grid_cummax = pd.Series(grid_accs).cummax()
random_cummax = pd.Series(random_accs).cummax()

axes[1].plot(range(1, len(grid_cummax)+1), grid_cummax, 'o-', 
            linewidth=2, markersize=6, label='Grid Search')
axes[1].plot(range(1, len(random_cummax)+1), random_cummax, 's-',
            linewidth=2, markersize=6, label='Random Search')
axes[1].set_xlabel('Number of Trials', fontsize=12)
axes[1].set_ylabel('Best Validation Accuracy So Far', fontsize=12)
axes[1].set_title('Cumulative Best Performance', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Show best from each
print("\nBest Configuration from Grid Search:")
best_grid = grid_df.loc[grid_df['val_acc'].idxmax()]
print(best_grid.to_string())

print("\nBest Configuration from Random Search:")
best_random = random_df.loc[random_df['val_acc'].idxmax()]
print(best_random.to_string())

## Part 6: Visualization of Hyperparameter Landscape

### 6.1 Pairwise Hyperparameter Interactions

In [None]:
# Use random search results for visualization (more diverse)
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# Learning rate vs Batch size
scatter = axes[0, 0].scatter(random_df['learning_rate'], random_df['batch_size'],
                            c=random_df['val_acc'], s=100, cmap='viridis',
                            edgecolors='black', alpha=0.7)
axes[0, 0].set_xlabel('Learning Rate (log scale)', fontsize=12)
axes[0, 0].set_ylabel('Batch Size', fontsize=12)
axes[0, 0].set_xscale('log')
axes[0, 0].set_title('Learning Rate vs Batch Size', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[0, 0], label='Val Accuracy')

# Learning rate vs Weight decay
scatter = axes[0, 1].scatter(random_df['learning_rate'], random_df['weight_decay'],
                            c=random_df['val_acc'], s=100, cmap='viridis',
                            edgecolors='black', alpha=0.7)
axes[0, 1].set_xlabel('Learning Rate (log scale)', fontsize=12)
axes[0, 1].set_ylabel('Weight Decay (log scale)', fontsize=12)
axes[0, 1].set_xscale('log')
axes[0, 1].set_yscale('log')
axes[0, 1].set_title('Learning Rate vs Weight Decay', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[0, 1], label='Val Accuracy')

# Dropout rate vs Hidden size
scatter = axes[1, 0].scatter(random_df['dropout_rate'], random_df['hidden_size'],
                            c=random_df['val_acc'], s=100, cmap='viridis',
                            edgecolors='black', alpha=0.7)
axes[1, 0].set_xlabel('Dropout Rate', fontsize=12)
axes[1, 0].set_ylabel('Hidden Size', fontsize=12)
axes[1, 0].set_title('Dropout Rate vs Hidden Size', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1, 0], label='Val Accuracy')

# Parallel coordinates plot
from pandas.plotting import parallel_coordinates

# Normalize columns for parallel coordinates
plot_df = random_df[['learning_rate', 'batch_size', 'weight_decay', 
                     'dropout_rate', 'hidden_size', 'val_acc']].copy()

# Add performance category
plot_df['performance'] = pd.cut(plot_df['val_acc'], bins=3, 
                                labels=['Low', 'Medium', 'High'])

# Normalize numeric columns
for col in ['learning_rate', 'batch_size', 'weight_decay', 'dropout_rate', 'hidden_size']:
    plot_df[col] = (plot_df[col] - plot_df[col].min()) / (plot_df[col].max() - plot_df[col].min())

parallel_coordinates(plot_df, 'performance', ax=axes[1, 1], 
                    colormap='viridis', alpha=0.5)
axes[1, 1].set_title('Parallel Coordinates Plot', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('Normalized Value', fontsize=12)
axes[1, 1].grid(True, alpha=0.3, axis='y')
axes[1, 1].legend(loc='upper right')

plt.tight_layout()
plt.show()

## Part 7: Experiment Tracking and Management

### 7.1 Create Experiment Logger

In [None]:
class ExperimentLogger:
    """Logger for tracking hyperparameter tuning experiments."""
    
    def __init__(self, experiment_name):
        self.experiment_name = experiment_name
        self.log_dir = experiments_dir / experiment_name
        self.log_dir.mkdir(exist_ok=True)
        
        self.log_file = self.log_dir / 'experiments.json'
        self.experiments = self._load_experiments()
    
    def _load_experiments(self):
        """Load existing experiments."""
        if self.log_file.exists():
            with open(self.log_file, 'r') as f:
                return json.load(f)
        return []
    
    def log_experiment(self, hyperparams, metrics, metadata=None):
        """Log a single experiment."""
        experiment = {
            'id': len(self.experiments) + 1,
            'timestamp': datetime.now().isoformat(),
            'hyperparams': hyperparams,
            'metrics': metrics,
            'metadata': metadata or {}
        }
        
        self.experiments.append(experiment)
        self._save_experiments()
        
        return experiment['id']
    
    def _save_experiments(self):
        """Save experiments to file."""
        with open(self.log_file, 'w') as f:
            json.dump(self.experiments, f, indent=2)
    
    def get_best_experiment(self, metric='val_acc', maximize=True):
        """Get best experiment by metric."""
        if not self.experiments:
            return None
        
        sorted_exps = sorted(
            self.experiments,
            key=lambda x: x['metrics'].get(metric, 0),
            reverse=maximize
        )
        
        return sorted_exps[0]
    
    def summary(self):
        """Print summary of experiments."""
        if not self.experiments:
            print("No experiments logged.")
            return
        
        df = pd.DataFrame([{
            'id': exp['id'],
            **exp['hyperparams'],
            **exp['metrics']
        } for exp in self.experiments])
        
        print(f"\nExperiment Summary: {self.experiment_name}")
        print("="*100)
        print(f"Total experiments: {len(self.experiments)}")
        print(f"Best validation accuracy: {df['val_acc'].max():.4f}")
        print(f"Mean validation accuracy: {df['val_acc'].mean():.4f}")
        print(f"Std validation accuracy: {df['val_acc'].std():.4f}")
        print("="*100)
        
        return df

# Create logger
logger = ExperimentLogger('cifar10_tuning')

# Log all random search experiments
for result in random_results:
    logger.log_experiment(
        hyperparams=result['hyperparams'],
        metrics={
            'val_acc': result['val_acc'],
            'train_time': result['train_time']
        },
        metadata={'method': 'random_search'}
    )

# Print summary
summary_df = logger.summary()
print("\nTop 5 Experiments:")
print(summary_df.nlargest(5, 'val_acc').to_string(index=False))

# Get best experiment
best_exp = logger.get_best_experiment()
print(f"\nBest Experiment ID: {best_exp['id']}")
print(f"Hyperparameters: {best_exp['hyperparams']}")
print(f"Validation Accuracy: {best_exp['metrics']['val_acc']:.4f}")

## Part 8: Best Practices and Strategies

### 8.1 Coarse-to-Fine Search

**Strategy:**
1. **Coarse search:** Wide range, few iterations
2. **Identify promising region**
3. **Fine search:** Narrow range, more iterations

### 8.2 Budget Allocation

**Successive Halving:**
1. Train N configurations for k epochs
2. Keep top N/2, train for 2k epochs
3. Repeat until 1 configuration remains

**Early Stopping:**
- Monitor validation loss
- Stop if no improvement for n epochs
- Saves time on poor configurations

### 8.3 Hyperparameter Importance

**Generally most important (in order):**
1. **Learning rate** - Most critical
2. **Architecture** (depth, width)
3. **Batch size**
4. **Regularization** (dropout, weight decay)
5. **Optimizer parameters** (momentum, betas)

**Guideline:** Start with LR finder, then tune architecture, then regularization.

### 8.4 Common Pitfalls

1. **Not using log scale for LR, weight decay**
   - Solution: Always use log scale for multiplicative parameters

2. **Overfitting to validation set**
   - Solution: Use separate test set, limit tuning iterations

3. **Ignoring computational budget**
   - Solution: Use early stopping, successive halving

4. **Not tracking experiments**
   - Solution: Always log hyperparameters and metrics

5. **Optimizing too many hyperparameters**
   - Solution: Focus on most important ones first

## Part 9: Advanced - Bayesian Optimization (Optional)

### 9.1 Theory

**Idea:** Build probabilistic model of $f(\lambda)$ (validation performance)

**Algorithm:**
1. Train surrogate model (Gaussian Process) on observed $(\lambda, f(\lambda))$ pairs
2. Use acquisition function to select next $\lambda$ to try
3. Evaluate $f(\lambda)$, update surrogate
4. Repeat

**Acquisition Functions:**
- **Expected Improvement (EI):** $EI(\lambda) = \mathbb{E}[\max(f(\lambda) - f(\lambda^*), 0)]$
- **Upper Confidence Bound (UCB):** $UCB(\lambda) = \mu(\lambda) + \kappa \sigma(\lambda)$

**Trade-off:** Exploration (high uncertainty) vs Exploitation (high mean)

**Note:** This section is optional as it requires additional dependencies.

---

## Summary

Congratulations! You've mastered systematic hyperparameter optimization. You now understand:

✅ The hyperparameter optimization problem (inner vs outer loop)  
✅ Learning rate finder (critical first step)  
✅ Grid search (exhaustive but expensive)  
✅ Random search (more efficient for high dimensions)  
✅ Hyperparameter landscape visualization  
✅ Experiment tracking and management  
✅ Best practices (coarse-to-fine, budget allocation)  
✅ Bayesian optimization principles  

**Key Insights:**
1. **Always start with LR finder** - Most critical hyperparameter
2. **Random search > Grid search** for high-dimensional spaces
3. **Use log scale** for learning rate, weight decay, etc.
4. **Track everything** - Reproducibility is crucial
5. **Budget wisely** - Use early stopping, successive halving

**Typical Workflow:**
1. Run LR finder → Get LR range
2. Coarse random search → Identify promising region
3. Fine random/grid search → Refine best configuration
4. (Optional) Bayesian optimization → Final tuning

**Time spent:** ~3-4 hours

**Next:** Day 9 - Advanced CNN Architectures (ResNet, VGG from papers)