# Optimization Landscape Visualized

This notebook visualizes how different optimizers navigate loss surfaces. We'll implement SGD, Momentum, RMSprop, and Adam from scratch and animate their paths on 2D functions.

**Goal:** Build intuition for optimizer behavior on different loss landscapes.

**Prerequisites:** [optimization.md](../math-foundations/optimization.md), [optimizers.md](../training/optimizers.md)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

## 1. Optimizers from Scratch

Let's implement each optimizer with clear, minimal code.

In [None]:
class SGD:
    """
    Vanilla Stochastic Gradient Descent.
    
    Update: θ = θ - lr * gradient
    """
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def step(self, params, grads):
        return params - self.lr * grads


class Momentum:
    """
    SGD with Momentum.
    
    Accumulates velocity from past gradients.
    
    v = β * v + gradient
    θ = θ - lr * v
    """
    def __init__(self, lr=0.01, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.v = None
    
    def step(self, params, grads):
        if self.v is None:
            self.v = np.zeros_like(params)
        
        self.v = self.beta * self.v + grads
        return params - self.lr * self.v


class RMSprop:
    """
    RMSprop: Adapts learning rate per parameter.
    
    Divides by running average of squared gradients.
    
    s = β * s + (1-β) * gradient²
    θ = θ - lr * gradient / √(s + ε)
    """
    def __init__(self, lr=0.01, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.s = None
    
    def step(self, params, grads):
        if self.s is None:
            self.s = np.zeros_like(params)
        
        self.s = self.beta * self.s + (1 - self.beta) * grads**2
        return params - self.lr * grads / (np.sqrt(self.s) + self.eps)


class Adam:
    """
    Adam: Combines momentum and adaptive learning rates.
    
    m = β1 * m + (1-β1) * gradient       (momentum)
    v = β2 * v + (1-β2) * gradient²      (RMSprop)
    
    # Bias correction (important early in training)
    m_hat = m / (1 - β1^t)
    v_hat = v / (1 - β2^t)
    
    θ = θ - lr * m_hat / √(v_hat + ε)
    """
    def __init__(self, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0
    
    def step(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        self.t += 1
        
        # Update biased first and second moment estimates
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
        
        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

## 2. Test Functions

Classic optimization test functions with different characteristics.

In [None]:
# Test functions and their gradients

def quadratic(x):
    """Simple convex quadratic bowl. Minimum at (0, 0)."""
    return x[0]**2 + x[1]**2

def quadratic_grad(x):
    return np.array([2*x[0], 2*x[1]])


def elongated_quadratic(x):
    """Elongated quadratic - different curvature in each direction."""
    return x[0]**2 + 10*x[1]**2

def elongated_quadratic_grad(x):
    return np.array([2*x[0], 20*x[1]])


def rosenbrock(x):
    """
    Rosenbrock function: classic difficult optimization problem.
    Minimum at (1, 1). Has a narrow curved valley.
    """
    return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])


def saddle_point(x):
    """Saddle point at origin. Goes up in x, down in y."""
    return x[0]**2 - x[1]**2

def saddle_point_grad(x):
    return np.array([2*x[0], -2*x[1]])


def beale(x):
    """
    Beale function: multiple local minima.
    Global minimum at (3, 0.5).
    """
    term1 = (1.5 - x[0] + x[0]*x[1])**2
    term2 = (2.25 - x[0] + x[0]*x[1]**2)**2
    term3 = (2.625 - x[0] + x[0]*x[1]**3)**2
    return term1 + term2 + term3

def beale_grad(x):
    # Numerical gradient for simplicity
    eps = 1e-5
    grad = np.zeros(2)
    for i in range(2):
        x_plus = x.copy()
        x_plus[i] += eps
        x_minus = x.copy()
        x_minus[i] -= eps
        grad[i] = (beale(x_plus) - beale(x_minus)) / (2*eps)
    return grad

In [None]:
# Visualize the test functions

def plot_surface(func, xlim, ylim, title, ax=None, levels=50):
    """Plot contour of a 2D function."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    x = np.linspace(xlim[0], xlim[1], 200)
    y = np.linspace(ylim[0], ylim[1], 200)
    X, Y = np.meshgrid(x, y)
    Z = np.array([[func(np.array([xi, yi])) for xi, yi in zip(row_x, row_y)] 
                  for row_x, row_y in zip(X, Y)])
    
    # Use log scale for better visualization of valleys
    Z_plot = np.log(Z + 1)
    
    contour = ax.contour(X, Y, Z_plot, levels=levels, cmap='viridis', alpha=0.7)
    ax.contourf(X, Y, Z_plot, levels=levels, cmap='viridis', alpha=0.3)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(title)
    return ax


# Plot all test functions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

plot_surface(quadratic, (-3, 3), (-3, 3), 'Quadratic (Easy)', axes[0, 0])
axes[0, 0].plot(0, 0, 'r*', markersize=15, label='Minimum')
axes[0, 0].legend()

plot_surface(elongated_quadratic, (-3, 3), (-3, 3), 'Elongated Quadratic', axes[0, 1])
axes[0, 1].plot(0, 0, 'r*', markersize=15, label='Minimum')
axes[0, 1].legend()

plot_surface(rosenbrock, (-2, 2), (-1, 3), 'Rosenbrock (Hard)', axes[1, 0], levels=30)
axes[1, 0].plot(1, 1, 'r*', markersize=15, label='Minimum')
axes[1, 0].legend()

plot_surface(saddle_point, (-3, 3), (-3, 3), 'Saddle Point', axes[1, 1])
axes[1, 1].plot(0, 0, 'ro', markersize=10, label='Saddle')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## 3. Optimizer Trajectories

Run each optimizer on the test functions and visualize their paths.

In [None]:
def optimize(optimizer, func, grad_func, start, n_steps=100):
    """
    Run optimizer for n_steps.
    Returns trajectory of positions.
    """
    x = start.copy()
    trajectory = [x.copy()]
    
    for _ in range(n_steps):
        grad = grad_func(x)
        x = optimizer.step(x, grad)
        trajectory.append(x.copy())
    
    return np.array(trajectory)


def plot_trajectories(func, grad_func, xlim, ylim, title, start, n_steps=100,
                      optimizers=None, lr_scale=1.0):
    """
    Plot optimizer trajectories on a contour plot.
    """
    if optimizers is None:
        optimizers = {
            'SGD': SGD(lr=0.1 * lr_scale),
            'Momentum': Momentum(lr=0.1 * lr_scale, beta=0.9),
            'RMSprop': RMSprop(lr=0.1 * lr_scale),
            'Adam': Adam(lr=0.1 * lr_scale),
        }
    
    fig, ax = plt.subplots(figsize=(10, 8))
    plot_surface(func, xlim, ylim, title, ax)
    
    colors = {'SGD': 'red', 'Momentum': 'blue', 'RMSprop': 'green', 'Adam': 'orange'}
    
    for name, opt in optimizers.items():
        traj = optimize(opt, func, grad_func, start.copy(), n_steps)
        ax.plot(traj[:, 0], traj[:, 1], '-', color=colors[name], 
                linewidth=2, label=name, alpha=0.8)
        ax.plot(traj[0, 0], traj[0, 1], 'o', color=colors[name], markersize=8)
        ax.plot(traj[-1, 0], traj[-1, 1], 's', color=colors[name], markersize=8)
    
    ax.legend(loc='upper right')
    plt.show()
    
    return optimizers

In [None]:
# Quadratic - all optimizers should work well
start = np.array([2.5, 2.5])
plot_trajectories(quadratic, quadratic_grad, (-3, 3), (-3, 3), 
                  'Quadratic Function', start, n_steps=50);

In [None]:
# Elongated quadratic - momentum helps, SGD oscillates
start = np.array([2.5, 2.5])
plot_trajectories(elongated_quadratic, elongated_quadratic_grad, (-3, 3), (-3, 3),
                  'Elongated Quadratic', start, n_steps=100, lr_scale=0.3);

**Observation:** On the elongated quadratic, SGD oscillates because the gradient in the steep direction (y) is much larger. Momentum helps, but adaptive methods (RMSprop, Adam) handle different curvatures best.

In [None]:
# Rosenbrock - the classic hard problem
start = np.array([-1.0, 2.0])
plot_trajectories(rosenbrock, rosenbrock_grad, (-2, 2), (-1, 3),
                  'Rosenbrock Function', start, n_steps=500, lr_scale=0.01);

In [None]:
# Saddle point - most optimizers escape, SGD might not
start = np.array([0.1, 0.1])  # Start near saddle
plot_trajectories(saddle_point, saddle_point_grad, (-3, 3), (-3, 3),
                  'Saddle Point', start, n_steps=50);

## 4. Learning Rate Effects

Let's see what happens with different learning rates.

In [None]:
def compare_learning_rates(func, grad_func, xlim, ylim, start, lr_values):
    """Compare SGD with different learning rates."""
    fig, axes = plt.subplots(1, len(lr_values), figsize=(5*len(lr_values), 4))
    
    for ax, lr in zip(axes, lr_values):
        plot_surface(func, xlim, ylim, f'SGD, lr={lr}', ax, levels=30)
        
        opt = SGD(lr=lr)
        traj = optimize(opt, func, grad_func, start.copy(), n_steps=50)
        
        ax.plot(traj[:, 0], traj[:, 1], 'r-', linewidth=2, alpha=0.8)
        ax.plot(traj[0, 0], traj[0, 1], 'ro', markersize=10)
        ax.plot(traj[-1, 0], traj[-1, 1], 'rs', markersize=10)
        
        # Print final value
        final_val = func(traj[-1])
        ax.set_title(f'SGD, lr={lr}\nFinal loss: {final_val:.4f}')
    
    plt.tight_layout()
    plt.show()

# Compare learning rates on quadratic
start = np.array([2.5, 2.5])
compare_learning_rates(quadratic, quadratic_grad, (-3, 3), (-3, 3), 
                       start, [0.01, 0.1, 0.5, 0.9])

**Key insight:** 
- **Too small LR (0.01):** Slow convergence, might not reach minimum
- **Good LR (0.1-0.5):** Reaches minimum efficiently
- **Too large LR (0.9):** Overshoots, oscillates, may diverge

In [None]:
# Same for elongated quadratic - shows the problem more clearly
start = np.array([2.5, 2.5])
compare_learning_rates(elongated_quadratic, elongated_quadratic_grad, 
                       (-3, 3), (-3, 3), start, [0.01, 0.05, 0.1, 0.2])

## 5. Momentum Visualization

Momentum helps overcome oscillations by accumulating velocity.

In [None]:
def compare_momentum(func, grad_func, xlim, ylim, start, beta_values, lr=0.05):
    """Compare different momentum coefficients."""
    fig, ax = plt.subplots(figsize=(10, 8))
    plot_surface(func, xlim, ylim, 'Effect of Momentum', ax)
    
    colors = plt.cm.coolwarm(np.linspace(0, 1, len(beta_values)))
    
    for beta, color in zip(beta_values, colors):
        opt = Momentum(lr=lr, beta=beta)
        traj = optimize(opt, func, grad_func, start.copy(), n_steps=100)
        
        ax.plot(traj[:, 0], traj[:, 1], '-', color=color, 
                linewidth=2, label=f'β={beta}', alpha=0.8)
        ax.plot(traj[-1, 0], traj[-1, 1], 's', color=color, markersize=8)
    
    ax.plot(start[0], start[1], 'ko', markersize=10, label='Start')
    ax.legend()
    plt.show()

start = np.array([2.5, 2.5])
compare_momentum(elongated_quadratic, elongated_quadratic_grad,
                 (-3, 3), (-3, 3), start, [0.0, 0.5, 0.9, 0.99])

**Observation:** Higher momentum (β) smooths out oscillations but can overshoot. β=0.9 is a common default.

## 6. Animated Optimization

Watch the optimizers in action!

In [None]:
def create_animation(func, grad_func, xlim, ylim, start, n_steps=100, lr_scale=1.0):
    """Create animated comparison of optimizers."""
    
    # Get trajectories
    optimizers = {
        'SGD': SGD(lr=0.1 * lr_scale),
        'Momentum': Momentum(lr=0.1 * lr_scale, beta=0.9),
        'RMSprop': RMSprop(lr=0.1 * lr_scale),
        'Adam': Adam(lr=0.1 * lr_scale),
    }
    
    trajectories = {}
    for name, opt in optimizers.items():
        trajectories[name] = optimize(opt, func, grad_func, start.copy(), n_steps)
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Plot contour
    x = np.linspace(xlim[0], xlim[1], 100)
    y = np.linspace(ylim[0], ylim[1], 100)
    X, Y = np.meshgrid(x, y)
    Z = np.array([[func(np.array([xi, yi])) for xi, yi in zip(row_x, row_y)] 
                  for row_x, row_y in zip(X, Y)])
    Z_plot = np.log(Z + 1)
    ax.contourf(X, Y, Z_plot, levels=30, cmap='viridis', alpha=0.5)
    ax.contour(X, Y, Z_plot, levels=30, cmap='viridis', alpha=0.7)
    
    # Initialize lines and points
    colors = {'SGD': 'red', 'Momentum': 'blue', 'RMSprop': 'green', 'Adam': 'orange'}
    lines = {}
    points = {}
    
    for name in optimizers:
        lines[name], = ax.plot([], [], '-', color=colors[name], 
                               linewidth=2, label=name, alpha=0.8)
        points[name], = ax.plot([], [], 'o', color=colors[name], markersize=10)
    
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.legend(loc='upper right')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    
    # Animation function
    def animate(frame):
        for name, traj in trajectories.items():
            lines[name].set_data(traj[:frame+1, 0], traj[:frame+1, 1])
            points[name].set_data([traj[frame, 0]], [traj[frame, 1]])
        ax.set_title(f'Step {frame}')
        return list(lines.values()) + list(points.values())
    
    anim = animation.FuncAnimation(fig, animate, frames=n_steps,
                                   interval=50, blit=True)
    plt.close()
    return anim

In [None]:
# Animated optimization on elongated quadratic
start = np.array([2.5, 2.5])
anim = create_animation(elongated_quadratic, elongated_quadratic_grad,
                        (-3, 3), (-3, 3), start, n_steps=100, lr_scale=0.3)
HTML(anim.to_jshtml())

## 7. Convergence Comparison

Plot loss vs iterations for quantitative comparison.

In [None]:
def compare_convergence(func, grad_func, start, n_steps=100, lr_scale=1.0):
    """Plot loss curves for different optimizers."""
    
    optimizers = {
        'SGD': SGD(lr=0.1 * lr_scale),
        'Momentum': Momentum(lr=0.1 * lr_scale, beta=0.9),
        'RMSprop': RMSprop(lr=0.1 * lr_scale),
        'Adam': Adam(lr=0.1 * lr_scale),
    }
    
    colors = {'SGD': 'red', 'Momentum': 'blue', 'RMSprop': 'green', 'Adam': 'orange'}
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for name, opt in optimizers.items():
        traj = optimize(opt, func, grad_func, start.copy(), n_steps)
        losses = [func(x) for x in traj]
        ax.plot(losses, label=name, color=colors[name], linewidth=2)
    
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Loss')
    ax.set_title('Convergence Comparison')
    ax.legend()
    ax.set_yscale('log')
    ax.grid(True)
    plt.show()

# Quadratic
print("Quadratic function:")
start = np.array([2.5, 2.5])
compare_convergence(quadratic, quadratic_grad, start, n_steps=50)

# Elongated quadratic
print("\nElongated quadratic:")
compare_convergence(elongated_quadratic, elongated_quadratic_grad, start, 
                    n_steps=100, lr_scale=0.3)

# Rosenbrock
print("\nRosenbrock:")
start = np.array([-1.0, 2.0])
compare_convergence(rosenbrock, rosenbrock_grad, start, n_steps=500, lr_scale=0.01)

## 8. Learning Rate Schedules

Learning rate schedules can improve convergence.

In [None]:
def constant_lr(step, initial_lr):
    """Constant learning rate."""
    return initial_lr

def step_decay_lr(step, initial_lr, decay_rate=0.5, decay_steps=25):
    """Step decay: reduce by decay_rate every decay_steps."""
    return initial_lr * (decay_rate ** (step // decay_steps))

def exponential_decay_lr(step, initial_lr, decay_rate=0.99):
    """Exponential decay: lr = initial_lr * decay_rate^step."""
    return initial_lr * (decay_rate ** step)

def cosine_annealing_lr(step, initial_lr, total_steps):
    """Cosine annealing: smooth decay to 0."""
    return initial_lr * 0.5 * (1 + np.cos(np.pi * step / total_steps))

def warmup_cosine_lr(step, initial_lr, warmup_steps, total_steps):
    """Linear warmup then cosine decay."""
    if step < warmup_steps:
        return initial_lr * step / warmup_steps
    else:
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))

In [None]:
# Visualize learning rate schedules
total_steps = 100
initial_lr = 0.1
steps = np.arange(total_steps)

schedules = {
    'Constant': [constant_lr(s, initial_lr) for s in steps],
    'Step Decay': [step_decay_lr(s, initial_lr) for s in steps],
    'Exponential': [exponential_decay_lr(s, initial_lr) for s in steps],
    'Cosine': [cosine_annealing_lr(s, initial_lr, total_steps) for s in steps],
    'Warmup+Cosine': [warmup_cosine_lr(s, initial_lr, 10, total_steps) for s in steps],
}

fig, ax = plt.subplots(figsize=(10, 6))
for name, lrs in schedules.items():
    ax.plot(steps, lrs, label=name, linewidth=2)

ax.set_xlabel('Step')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate Schedules')
ax.legend()
ax.grid(True)
plt.show()

In [None]:
# Compare schedules on a problem
def optimize_with_schedule(schedule_fn, func, grad_func, start, n_steps, initial_lr, **schedule_kwargs):
    """Run SGD with a learning rate schedule."""
    x = start.copy()
    trajectory = [x.copy()]
    
    for step in range(n_steps):
        lr = schedule_fn(step, initial_lr, **schedule_kwargs)
        grad = grad_func(x)
        x = x - lr * grad
        trajectory.append(x.copy())
    
    return np.array(trajectory)

# Compare on elongated quadratic
start = np.array([2.5, 2.5])
n_steps = 100
initial_lr = 0.1

fig, ax = plt.subplots(figsize=(10, 6))

schedule_configs = {
    'Constant': {'schedule_fn': constant_lr, 'kwargs': {}},
    'Cosine': {'schedule_fn': cosine_annealing_lr, 'kwargs': {'total_steps': n_steps}},
    'Warmup+Cosine': {'schedule_fn': warmup_cosine_lr, 'kwargs': {'warmup_steps': 10, 'total_steps': n_steps}},
}

for name, config in schedule_configs.items():
    traj = optimize_with_schedule(
        config['schedule_fn'], 
        elongated_quadratic, 
        elongated_quadratic_grad,
        start.copy(), 
        n_steps, 
        initial_lr,
        **config['kwargs']
    )
    losses = [elongated_quadratic(x) for x in traj]
    ax.plot(losses, label=name, linewidth=2)

ax.set_xlabel('Iteration')
ax.set_ylabel('Loss')
ax.set_title('Effect of Learning Rate Schedules (SGD on Elongated Quadratic)')
ax.legend()
ax.set_yscale('log')
ax.grid(True)
plt.show()

## 9. Summary

| Optimizer | Key Idea | Best For | Hyperparameters |
|-----------|----------|----------|------------------|
| **SGD** | θ -= lr * grad | Simple problems | lr |
| **Momentum** | Accumulate velocity | Oscillating gradients | lr, β |
| **RMSprop** | Adaptive per-param LR | Different curvatures | lr, β, ε |
| **Adam** | Momentum + RMSprop | General purpose | lr, β1, β2, ε |

**Key takeaways:**

1. **Learning rate** is the most important hyperparameter
2. **Momentum** helps overcome oscillations in narrow valleys
3. **Adaptive methods** (Adam, RMSprop) handle different curvatures automatically
4. **Adam** is a good default for most problems
5. **Learning rate schedules** can improve final convergence

**Next:** [05-rnn-from-scratch.ipynb](05-rnn-from-scratch.ipynb) builds a character-level RNN.

## 10. Exercises

1. **Implement AdaGrad:** Similar to RMSprop but accumulates squared gradients without decay.

2. **Nesterov momentum:** Implement "lookahead" momentum variant.

3. **Find optimal learning rate:** For a given problem, implement learning rate finder (train for a few steps at increasing LR, plot loss).

4. **Weight decay:** Add L2 regularization to Adam. How does it affect trajectories?

In [None]:
# Exercise 1 starter: Implement AdaGrad
class AdaGrad:
    """
    AdaGrad: Accumulates all past squared gradients.
    
    s = s + gradient²
    θ = θ - lr * gradient / √(s + ε)
    """
    def __init__(self, lr=0.1, eps=1e-8):
        self.lr = lr
        self.eps = eps
        self.s = None
    
    def step(self, params, grads):
        # Your implementation here
        pass