# Day 04: Advanced Optimization and Gradient Dynamics

**Goal:** Master modern optimization algorithms and understand gradient flow in deep networks.

**Prerequisites:** You understand gradient descent, the chain rule, and basic optimization theory. This focuses on practical deep learning challenges.

**Time estimate:** 3-4 hours

## Overview

We'll explore:
1. **Mathematical foundations** of modern optimizers
2. **Implement from scratch**: SGD+Momentum, Nesterov, RMSprop, Adam
3. **Convergence analysis** and learning rate sensitivity
4. **Gradient flow** in deep networks
5. **Numerical stability** considerations

## Mathematical Foundations

### 1. Stochastic Gradient Descent (SGD)

Basic update rule:
$$\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$$

**Issues:**
- High variance in gradient estimates
- Poor conditioning (different scales in different dimensions)
- Oscillation in ravines

---

### 2. SGD with Momentum (Polyak, 1964)

Introduces velocity to smooth updates:

$$v_{t+1} = \beta v_t + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_{t+1}$$

**Interpretation:**
- Exponentially weighted moving average of gradients
- $\beta \in [0, 1)$ controls momentum (typically 0.9)
- Accelerates in consistent directions
- Dampens oscillations

**Mathematical insight:** With $\beta = 0.9$, we average approximately $\frac{1}{1-\beta} = 10$ past gradients.

---

### 3. Nesterov Accelerated Gradient (NAG)

"Look-ahead" gradient:

$$v_{t+1} = \beta v_t + \nabla L(\theta_t - \alpha \beta v_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_{t+1}$$

**Key difference:** Evaluate gradient at the "look-ahead" position $\theta_t - \alpha \beta v_t$

**Advantage:** Better convergence rate in convex settings ($O(1/t^2)$ vs $O(1/t)$ for momentum)

---

### 4. RMSprop (Hinton, 2012)

Adaptive learning rates per parameter:

$$v_{t+1} = \beta v_t + (1-\beta)(\nabla L(\theta_t))^2$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_{t+1} + \epsilon}} \nabla L(\theta_t)$$

**Interpretation:**
- Divide learning rate by (running average of) gradient magnitude
- Large gradients → smaller effective learning rate
- Small gradients → larger effective learning rate
- Connection to **diagonal preconditioning**: $D^{-1/2}$ where $D = \text{diag}(v_t)$

---

### 5. Adam (Kingma & Ba, 2015)

Combines momentum + RMSprop:

$$m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla L(\theta_t) \quad \text{(first moment)}$$
$$v_{t+1} = \beta_2 v_t + (1-\beta_2)(\nabla L(\theta_t))^2 \quad \text{(second moment)}$$

**Bias correction** (important for early iterations):
$$\hat{m}_{t+1} = \frac{m_{t+1}}{1-\beta_1^{t+1}}, \quad \hat{v}_{t+1} = \frac{v_{t+1}}{1-\beta_2^{t+1}}$$

**Update:**
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_{t+1}} + \epsilon} \hat{m}_{t+1}$$

**Typical hyperparameters:** $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

**Why it works:**
- Adapts learning rates per parameter
- Momentum helps with sparse gradients
- Bias correction prevents initial underestimation
- Effective step size roughly bounded by $\alpha$

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
import copy

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Implementing Optimizers from Scratch

We'll implement all major optimizers to understand their mechanics.

In [None]:
class SGDMomentum:
    """
    SGD with Momentum (Polyak)
    
    Update rules:
        v_t = beta * v_{t-1} + grad
        theta_t = theta_{t-1} - lr * v_t
    """
    def __init__(self, params, lr=0.01, momentum=0.9):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        self.velocity = [torch.zeros_like(p.data) for p in self.params]
    
    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()
    
    def step(self):
        with torch.no_grad():
            for i, p in enumerate(self.params):
                if p.grad is None:
                    continue
                
                # v_t = beta * v_{t-1} + grad
                self.velocity[i].mul_(self.momentum).add_(p.grad)
                
                # theta_t = theta_{t-1} - lr * v_t
                p.data.add_(self.velocity[i], alpha=-self.lr)


class NesterovMomentum:
    """
    Nesterov Accelerated Gradient
    
    Look-ahead gradient evaluation
    """
    def __init__(self, params, lr=0.01, momentum=0.9):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        self.velocity = [torch.zeros_like(p.data) for p in self.params]
    
    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()
    
    def step(self):
        with torch.no_grad():
            for i, p in enumerate(self.params):
                if p.grad is None:
                    continue
                
                # Nesterov momentum
                v_prev = self.velocity[i].clone()
                self.velocity[i].mul_(self.momentum).add_(p.grad)
                
                # Update with look-ahead term
                p.data.add_(self.velocity[i], alpha=-self.lr)
                p.data.add_(v_prev, alpha=-self.lr * self.momentum)


class RMSpropOptimizer:
    """
    RMSprop: Root Mean Square Propagation
    
    Adaptive learning rates based on recent gradient magnitudes
    """
    def __init__(self, params, lr=0.01, beta=0.9, eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = [torch.zeros_like(p.data) for p in self.params]
    
    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()
    
    def step(self):
        with torch.no_grad():
            for i, p in enumerate(self.params):
                if p.grad is None:
                    continue
                
                # v_t = beta * v_{t-1} + (1-beta) * grad^2
                self.v[i].mul_(self.beta).addcmul_(p.grad, p.grad, value=1-self.beta)
                
                # theta_t = theta_{t-1} - lr / sqrt(v_t + eps) * grad
                p.data.addcdiv_(p.grad, self.v[i].sqrt().add_(self.eps), value=-self.lr)


class AdamOptimizer:
    """
    Adam: Adaptive Moment Estimation
    
    Combines momentum (first moment) and RMSprop (second moment)
    with bias correction
    """
    def __init__(self, params, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        # First moment (momentum)
        self.m = [torch.zeros_like(p.data) for p in self.params]
        # Second moment (RMSprop)
        self.v = [torch.zeros_like(p.data) for p in self.params]
        
        self.t = 0  # Timestep for bias correction
    
    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()
    
    def step(self):
        self.t += 1
        
        with torch.no_grad():
            for i, p in enumerate(self.params):
                if p.grad is None:
                    continue
                
                # First moment: m_t = beta1 * m_{t-1} + (1-beta1) * grad
                self.m[i].mul_(self.beta1).add_(p.grad, alpha=1-self.beta1)
                
                # Second moment: v_t = beta2 * v_{t-1} + (1-beta2) * grad^2
                self.v[i].mul_(self.beta2).addcmul_(p.grad, p.grad, value=1-self.beta2)
                
                # Bias correction
                m_hat = self.m[i] / (1 - self.beta1 ** self.t)
                v_hat = self.v[i] / (1 - self.beta2 ** self.t)
                
                # Update: theta_t = theta_{t-1} - lr * m_hat / (sqrt(v_hat) + eps)
                p.data.addcdiv_(m_hat, v_hat.sqrt().add_(self.eps), value=-self.lr)


print("Custom optimizers implemented:")
print("  - SGDMomentum")
print("  - NesterovMomentum")
print("  - RMSpropOptimizer")
print("  - AdamOptimizer")

## 2. Setup: Model and Data

In [None]:
# Load MNIST
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")


# Simple MLP for experiments
class SimpleMLP(nn.Module):
    def __init__(self, input_size=784, hidden_size=128, num_classes=10):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

print("\nModel architecture:")
model = SimpleMLP().to(device)
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

## 3. Training Function with Metrics

In [None]:
def train_with_optimizer(model, optimizer, num_epochs=5, verbose=True):
    """
    Train model and track detailed metrics
    
    Returns:
        history: dict with losses, accuracies, gradient norms
    """
    criterion = nn.CrossEntropyLoss()
    
    history = {
        'train_loss': [],
        'train_acc': [],
        'test_acc': [],
        'grad_norms': [],
    }
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        correct = 0
        total = 0
        epoch_grad_norms = []
        
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            
            # Compute gradient norm
            total_norm = 0
            for p in model.parameters():
                if p.grad is not None:
                    param_norm = p.grad.data.norm(2)
                    total_norm += param_norm.item() ** 2
            total_norm = total_norm ** 0.5
            epoch_grad_norms.append(total_norm)
            
            optimizer.step()
            
            train_loss += loss.item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)
        
        train_loss /= len(train_loader)
        train_acc = 100. * correct / total
        
        # Testing
        model.eval()
        test_correct = 0
        test_total = 0
        
        with torch.no_grad():
            for data, target in test_loader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                pred = output.argmax(dim=1, keepdim=True)
                test_correct += pred.eq(target.view_as(pred)).sum().item()
                test_total += target.size(0)
        
        test_acc = 100. * test_correct / test_total
        
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['test_acc'].append(test_acc)
        history['grad_norms'].append(np.mean(epoch_grad_norms))
        
        if verbose:
            print(f'Epoch {epoch+1}/{num_epochs}: '
                  f'Loss={train_loss:.4f}, '
                  f'Train Acc={train_acc:.2f}%, '
                  f'Test Acc={test_acc:.2f}%, '
                  f'Grad Norm={history["grad_norms"][-1]:.4f}')
    
    return history

## 4. Optimizer Comparison Experiments

In [None]:
# Define optimizers to compare
lr = 0.01
num_epochs = 10

optimizers_config = [
    ('SGD (no momentum)', lambda p: torch.optim.SGD(p, lr=lr)),
    ('SGD + Momentum', lambda p: SGDMomentum(p, lr=lr, momentum=0.9)),
    ('Nesterov', lambda p: NesterovMomentum(p, lr=lr, momentum=0.9)),
    ('RMSprop', lambda p: RMSpropOptimizer(p, lr=lr)),
    ('Adam', lambda p: AdamOptimizer(p, lr=lr)),
]

results = {}

for name, opt_fn in optimizers_config:
    print(f"\n{'='*60}")
    print(f"Training with {name}")
    print(f"{'='*60}")
    
    # Fresh model for each optimizer
    model = SimpleMLP().to(device)
    optimizer = opt_fn(model.parameters())
    
    history = train_with_optimizer(model, optimizer, num_epochs=num_epochs)
    results[name] = history
    
    print(f"Final test accuracy: {history['test_acc'][-1]:.2f}%")

In [None]:
# Plot comprehensive comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

epochs = range(1, num_epochs + 1)
colors = ['b', 'g', 'r', 'c', 'm']
markers = ['o', 's', '^', 'D', 'v']

# Training loss
for (name, history), color, marker in zip(results.items(), colors, markers):
    axes[0, 0].plot(epochs, history['train_loss'], 
                    label=name, color=color, marker=marker, linewidth=2)
axes[0, 0].set_xlabel('Epoch', fontsize=12)
axes[0, 0].set_ylabel('Training Loss', fontsize=12)
axes[0, 0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Test accuracy
for (name, history), color, marker in zip(results.items(), colors, markers):
    axes[0, 1].plot(epochs, history['test_acc'],
                    label=name, color=color, marker=marker, linewidth=2)
axes[0, 1].set_xlabel('Epoch', fontsize=12)
axes[0, 1].set_ylabel('Test Accuracy (%)', fontsize=12)
axes[0, 1].set_title('Test Accuracy Comparison', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Gradient norms
for (name, history), color, marker in zip(results.items(), colors, markers):
    axes[1, 0].plot(epochs, history['grad_norms'],
                    label=name, color=color, marker=marker, linewidth=2)
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Gradient Norm', fontsize=12)
axes[1, 0].set_title('Gradient Norm Evolution', fontsize=14, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_yscale('log')

# Final accuracy bar chart
names = list(results.keys())
final_accs = [results[name]['test_acc'][-1] for name in names]
bars = axes[1, 1].bar(range(len(names)), final_accs, color=colors)
axes[1, 1].set_xticks(range(len(names)))
axes[1, 1].set_xticklabels(names, rotation=45, ha='right')
axes[1, 1].set_ylabel('Final Test Accuracy (%)', fontsize=12)
axes[1, 1].set_title('Final Performance Comparison', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, acc in zip(bars, final_accs):
    height = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., height,
                    f'{acc:.1f}%', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

# Print summary table
print("\n" + "="*70)
print("SUMMARY: Final Test Accuracies")
print("="*70)
for name in names:
    final_acc = results[name]['test_acc'][-1]
    print(f"{name:20s}: {final_acc:6.2f}%")
print("="*70)

## 5. Learning Rate Sensitivity Analysis

Different optimizers have different sensitivity to learning rate.

In [None]:
# Test different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1]
lr_results = {}

# Test Adam with different LRs
print("Testing Adam with different learning rates...\n")
for lr in learning_rates:
    print(f"Learning rate: {lr}")
    model = SimpleMLP().to(device)
    optimizer = AdamOptimizer(model.parameters(), lr=lr)
    history = train_with_optimizer(model, optimizer, num_epochs=5, verbose=False)
    lr_results[lr] = history
    print(f"  Final test accuracy: {history['test_acc'][-1]:.2f}%\n")

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for lr in learning_rates:
    plt.plot(lr_results[lr]['train_loss'], label=f'LR={lr}', marker='o')
plt.xlabel('Epoch')
plt.ylabel('Training Loss')
plt.title('Adam: Learning Rate Sensitivity (Loss)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

plt.subplot(1, 2, 2)
for lr in learning_rates:
    plt.plot(lr_results[lr]['test_acc'], label=f'LR={lr}', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Test Accuracy (%)')
plt.title('Adam: Learning Rate Sensitivity (Accuracy)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation: Adam is relatively robust to learning rate choice,")
print("but LR=0.001 or 0.01 typically works best.")

## 6. Gradient Flow Analysis

Study how gradients flow through layers in a deeper network.

In [None]:
class DeepMLP(nn.Module):
    """Deeper network to study gradient flow"""
    def __init__(self, input_size=784, hidden_sizes=[256, 128, 64, 32], num_classes=10):
        super(DeepMLP, self).__init__()
        
        layers = []
        prev_size = input_size
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size
        layers.append(nn.Linear(prev_size, num_classes))
        
        self.network = nn.Sequential(*layers)
        self.layer_names = [f'Layer{i}' for i in range(len(hidden_sizes) + 1)]
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.network(x)

model_deep = DeepMLP().to(device)
print(model_deep)
print(f"\nTotal parameters: {sum(p.numel() for p in model_deep.parameters()):,}")

In [None]:
def analyze_gradient_flow(model, data_loader, num_batches=10):
    """
    Analyze gradient magnitudes per layer
    """
    criterion = nn.CrossEntropyLoss()
    
    # Get linear layers
    linear_layers = [module for module in model.modules() if isinstance(module, nn.Linear)]
    layer_names = [f'Layer {i+1}' for i in range(len(linear_layers))]
    
    grad_norms = {name: [] for name in layer_names}
    
    model.eval()
    for batch_idx, (data, target) in enumerate(data_loader):
        if batch_idx >= num_batches:
            break
        
        data, target = data.to(device), target.to(device)
        
        # Forward and backward
        output = model(data)
        loss = criterion(output, target)
        
        model.zero_grad()
        loss.backward()
        
        # Record gradient norms
        for name, layer in zip(layer_names, linear_layers):
            if layer.weight.grad is not None:
                grad_norm = layer.weight.grad.norm().item()
                grad_norms[name].append(grad_norm)
    
    # Average over batches
    avg_grad_norms = {name: np.mean(norms) for name, norms in grad_norms.items()}
    return avg_grad_norms


# Analyze with and without batch norm
print("Analyzing gradient flow in deep network...")
grad_analysis = analyze_gradient_flow(model_deep, train_loader, num_batches=20)

# Plot
plt.figure(figsize=(10, 6))
layers = list(grad_analysis.keys())
norms = list(grad_analysis.values())

plt.bar(layers, norms, color='steelblue', alpha=0.7)
plt.xlabel('Layer', fontsize=12)
plt.ylabel('Average Gradient Norm', fontsize=12)
plt.title('Gradient Flow Through Layers', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.yscale('log')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nGradient Norms by Layer:")
for layer, norm in grad_analysis.items():
    print(f"  {layer}: {norm:.6f}")

# Check for vanishing gradients
min_norm = min(norms)
max_norm = max(norms)
ratio = max_norm / min_norm

print(f"\nGradient magnitude ratio (max/min): {ratio:.2f}")
if ratio > 100:
    print("⚠️ Warning: Large gradient magnitude differences detected!")
    print("   Consider: batch normalization, gradient clipping, or better initialization")
else:
    print("✓ Gradient flow appears stable")

## 7. Advanced Topics: Loss Landscape Visualization

Visualize the loss landscape around the optimal point (2D projection).

In [None]:
def compute_loss_landscape_2d(model, data_loader, center_params, 
                              direction1, direction2, alpha_range=(-1, 1), 
                              beta_range=(-1, 1), resolution=20):
    """
    Compute 2D slice of loss landscape
    
    L(alpha, beta) = Loss(center_params + alpha*direction1 + beta*direction2)
    """
    criterion = nn.CrossEntropyLoss()
    
    alphas = np.linspace(alpha_range[0], alpha_range[1], resolution)
    betas = np.linspace(beta_range[0], beta_range[1], resolution)
    
    losses = np.zeros((resolution, resolution))
    
    # Sample batches for efficiency
    sample_data = []
    sample_targets = []
    for batch_idx, (data, target) in enumerate(data_loader):
        if batch_idx >= 5:  # Use 5 batches
            break
        sample_data.append(data)
        sample_targets.append(target)
    sample_data = torch.cat(sample_data).to(device)
    sample_targets = torch.cat(sample_targets).to(device)
    
    model.eval()
    with torch.no_grad():
        for i, alpha in enumerate(alphas):
            for j, beta in enumerate(betas):
                # Perturb parameters
                for p, p_center, d1, d2 in zip(model.parameters(), 
                                                center_params, direction1, direction2):
                    p.data = p_center + alpha * d1 + beta * d2
                
                # Compute loss
                output = model(sample_data)
                loss = criterion(output, sample_targets)
                losses[i, j] = loss.item()
        
        # Restore original parameters
        for p, p_center in zip(model.parameters(), center_params):
            p.data = p_center.clone()
    
    return alphas, betas, losses


# Train a small model first
print("Training model for loss landscape visualization...")
model_viz = SimpleMLP().to(device)
optimizer = AdamOptimizer(model_viz.parameters(), lr=0.01)
history = train_with_optimizer(model_viz, optimizer, num_epochs=5, verbose=False)
print(f"Model trained. Test accuracy: {history['test_acc'][-1]:.2f}%")

# Get trained parameters as center
center_params = [p.data.clone() for p in model_viz.parameters()]

# Generate random directions (normalized)
direction1 = [torch.randn_like(p) for p in model_viz.parameters()]
direction2 = [torch.randn_like(p) for p in model_viz.parameters()]

# Normalize directions
norm1 = sum(d.norm()**2 for d in direction1).sqrt()
norm2 = sum(d.norm()**2 for d in direction2).sqrt()
direction1 = [d / norm1 for d in direction1]
direction2 = [d / norm2 for d in direction2]

print("\nComputing loss landscape (this may take a minute)...")
alphas, betas, losses = compute_loss_landscape_2d(
    model_viz, train_loader, center_params, direction1, direction2,
    alpha_range=(-0.5, 0.5), beta_range=(-0.5, 0.5), resolution=15
)

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.contourf(alphas, betas, losses.T, levels=20, cmap='viridis')
plt.colorbar(label='Loss')
plt.xlabel('Direction 1 (α)')
plt.ylabel('Direction 2 (β)')
plt.title('Loss Landscape (Contour)', fontweight='bold')
plt.plot(0, 0, 'r*', markersize=15, label='Optimum')
plt.legend()

plt.subplot(1, 2, 2)
from mpl_toolkits.mplot3d import Axes3D
ax = plt.subplot(1, 2, 2, projection='3d')
A, B = np.meshgrid(alphas, betas)
ax.plot_surface(A, B, losses.T, cmap='viridis', alpha=0.8)
ax.set_xlabel('Direction 1 (α)')
ax.set_ylabel('Direction 2 (β)')
ax.set_zlabel('Loss')
ax.set_title('Loss Landscape (3D)', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nLoss landscape visualized!")
print(f"Min loss: {losses.min():.4f}")
print(f"Max loss: {losses.max():.4f}")

## 8. Key Takeaways

### Optimizer Characteristics:

1. **SGD (vanilla)**
   - ✅ Simple, well-understood
   - ❌ Slow convergence
   - ❌ Sensitive to learning rate
   - ❌ Poor in ravines

2. **SGD + Momentum**
   - ✅ Faster convergence
   - ✅ Dampens oscillations
   - ✅ Accelerates in consistent directions
   - ⚠️ Can overshoot

3. **Nesterov Momentum**
   - ✅ Better theoretical convergence rate
   - ✅ "Look-ahead" prevents overshooting
   - ⚠️ Slightly more complex

4. **RMSprop**
   - ✅ Adaptive per-parameter learning rates
   - ✅ Good for non-stationary objectives
   - ✅ Handles different scales well
   - ⚠️ No momentum

5. **Adam** (Most popular in practice)
   - ✅ Combines momentum + adaptive LR
   - ✅ Works well out-of-the-box
   - ✅ Robust to hyperparameter choices
   - ⚠️ Can converge to sharp minima
   - ⚠️ May need tuning for final accuracy

### Practical Guidelines:

1. **Default choice**: Adam with lr=0.001
2. **For final tuning**: Try SGD + momentum with decaying LR
3. **Learning rate**: Most important hyperparameter
4. **Gradient clipping**: Use when training RNNs or very deep networks
5. **Monitor gradients**: Watch for vanishing/exploding

### Mathematical Insights:

- **Momentum** ≈ exponentially weighted moving average of gradients
- **Adam** ≈ diagonal preconditioning with momentum
- **Adaptive methods** adjust per-parameter learning rates
- **Batch normalization** smooths loss landscape → easier optimization

### When to Use What:

- **Quick experiments**: Adam
- **Final training**: SGD + momentum with LR schedule
- **RNNs**: Adam or RMSprop
- **Large batches**: Use higher learning rate with warmup
- **Small datasets**: Be careful with adaptive methods (can overfit)

## 9. Exercises

1. **Implement AdamW**: Add weight decay directly to update (not in gradient)
2. **Learning rate warmup**: Linearly increase LR for first N steps
3. **Gradient clipping**: Implement both norm-based and value-based clipping
4. **Second-order methods**: Implement L-BFGS and compare with first-order
5. **Hessian analysis**: Compute top eigenvalues of Hessian at optimum
6. **Convergence proof**: Prove momentum convergence for quadratic loss
7. **Adaptive clipping**: Implement gradient clipping that adapts to gradient history
8. **Loss landscape**: Create animated trajectory of optimizer through landscape