# 🤖 Adaptive Optimizers: Adam and Beyond

> *"The Adam optimizer is the Swiss Army knife of deep learning. When in doubt, use Adam."*

We've seen how SGD with Momentum can accelerate convergence. However, it uses the same learning rate for all parameters. What if some parameters need larger updates and others need smaller ones? This is the idea behind **adaptive optimizers**.

This notebook introduces the most popular and effective adaptive optimizers, culminating in **Adam**, the de facto standard for training deep neural networks.

## 🎯 What You'll Master

- **The Need for Adaptive Learning Rates**: Why a single learning rate isn't always optimal.
- **RMSprop**: An optimizer that adapts the learning rate based on the magnitude of recent gradients.
- **Adam (Adaptive Moment Estimation)**: The powerhouse optimizer that combines the ideas of Momentum and RMSprop.
- **Visual Comparison**: Seeing how Adam often outperforms other optimizers in practice.

## 📚 Import Essential Libraries

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (14, 9)
plt.rcParams['font.size'] = 12

print("🤖 Libraries loaded for adaptive optimization!")

---

# 💡 Chapter 1: The Idea of Adaptive Learning Rates

Consider a loss surface with a long, gentle slope in one direction (e.g., parameter `w`) and a steep, narrow valley in another (e.g., parameter `b`).

- For `w`, we want to take **large steps** to make progress along the gentle slope.
- For `b`, we want to take **small steps** to avoid oscillating wildly across the steep valley.

A single learning rate `η` can't satisfy both needs. **Adaptive optimizers** solve this by maintaining a per-parameter learning rate that adjusts automatically during training.

### RMSprop (Root Mean Square Propagation)

RMSprop adapts the learning rate for each parameter by dividing it by a running average of the magnitudes of recent gradients for that parameter.

**Intuition**: 
- If recent gradients for a parameter have been **large** (steep slope), the learning rate for that parameter is **decreased** to prevent overshooting.
- If recent gradients have been **small** (gentle slope), the learning rate is effectively **increased** to speed up progress.

### The RMSprop Update Rule

1. **Update the squared gradient cache `S`**:
   $$ S_{new} = \gamma S_{old} + (1 - \gamma) (\nabla L)^2 $$

2. **Update the parameters `θ`**:
   $$ \theta_{new} = \theta_{old} - \frac{\eta}{\sqrt{S_{new}} + \epsilon} \nabla L(\theta) $$

Where:
- **`γ` (gamma)** is the decay rate for the cache (e.g., 0.9).
- `(∇L)²` is the element-wise square of the gradient.
- **`ε` (epsilon)** is a small number (e.g., 1e-8) to prevent division by zero.

---

# 👑 Chapter 2: Adam - The King of Optimizers

**Adam (Adaptive Moment Estimation)** combines the best of both worlds:

1. **Momentum**: It keeps a moving average of past gradients (the *first moment*), which helps accelerate convergence.
2. **RMSprop**: It keeps a moving average of past squared gradients (the *second moment*), which provides per-parameter adaptive learning rates.

Adam is generally considered robust, effective, and works well across a wide range of problems, making it the default choice for many deep learning applications.

### The Adam Update Rule

1. **Update biased first moment estimate (like Momentum)**:
   $$ m_{new} = \beta_1 m_{old} + (1 - \beta_1) \nabla L $$

2. **Update biased second moment estimate (like RMSprop)**:
   $$ v_{new} = \beta_2 v_{old} + (1 - \beta_2) (\nabla L)^2 $$

3. **Compute bias-corrected estimates**:
   $$ \hat{m} = \frac{m_{new}}{1 - \beta_1^t} $$
   $$ \hat{v} = \frac{v_{new}}{1 - \beta_2^t} $$
   (where `t` is the timestep)

4. **Update parameters `θ`**:
   $$ \theta_{new} = \theta_{old} - \frac{\eta}{\sqrt{\hat{v}} + \epsilon} \hat{m} $$

In [None]:
def loss_function(w, b):
    """ A challenging non-convex function (Himmelblau's). """
    return (w**2 + b - 11)**2 + (w + b**2 - 7)**2

def gradient(w, b):
    """ Gradient for the loss function. """
    grad_w = 4*w*(w**2 + b - 11) + 2*(w + b**2 - 7)
    grad_b = 2*(w**2 + b - 11) + 4*b*(w + b**2 - 7)
    return grad_w, grad_b

def run_optimizers():
    """
    Run SGD, Momentum, RMSprop, and Adam to compare their paths.
    """
    start_w, start_b = -0.5, 4.0
    n_steps = 100
    
    optimizers = {
        'SGD': {'path': [], 'lr': 0.01},
        'Momentum': {'path': [], 'lr': 0.01, 'beta': 0.9, 'v_w': 0, 'v_b': 0},
        'RMSprop': {'path': [], 'lr': 0.1, 'gamma': 0.9, 's_w': 0, 's_b': 0, 'eps': 1e-8},
        'Adam': {'path': [], 'lr': 0.1, 'beta1': 0.9, 'beta2': 0.999, 'm_w': 0, 'm_b': 0, 'v_w': 0, 'v_b': 0, 'eps': 1e-8}
    }
    
    for name, params in optimizers.items():
        w, b = start_w, start_b
        params['path'].append((w, b))
        for t in range(1, n_steps + 1):
            grad_w, grad_b = gradient(w, b)
            
            if name == 'SGD':
                w -= params['lr'] * grad_w
                b -= params['lr'] * grad_b
            elif name == 'Momentum':
                params['v_w'] = params['beta'] * params['v_w'] + (1 - params['beta']) * grad_w
                params['v_b'] = params['beta'] * params['v_b'] + (1 - params['beta']) * grad_b
                w -= params['lr'] * params['v_w']
                b -= params['lr'] * params['v_b']
            elif name == 'RMSprop':
                params['s_w'] = params['gamma'] * params['s_w'] + (1 - params['gamma']) * grad_w**2
                params['s_b'] = params['gamma'] * params['s_b'] + (1 - params['gamma']) * grad_b**2
                w -= (params['lr'] / (np.sqrt(params['s_w']) + params['eps'])) * grad_w
                b -= (params['lr'] / (np.sqrt(params['s_b']) + params['eps'])) * grad_b
            elif name == 'Adam':
                params['m_w'] = params['beta1'] * params['m_w'] + (1 - params['beta1']) * grad_w
                params['m_b'] = params['beta1'] * params['m_b'] + (1 - params['beta1']) * grad_b
                params['v_w'] = params['beta2'] * params['v_w'] + (1 - params['beta2']) * grad_w**2
                params['v_b'] = params['beta2'] * params['v_b'] + (1 - params['beta2']) * grad_b**2
                m_hat_w = params['m_w'] / (1 - params['beta1']**t)
                m_hat_b = params['m_b'] / (1 - params['beta1']**t)
                v_hat_w = params['v_w'] / (1 - params['beta2']**t)
                v_hat_b = params['v_b'] / (1 - params['beta2']**t)
                w -= (params['lr'] / (np.sqrt(v_hat_w) + params['eps'])) * m_hat_w
                b -= (params['lr'] / (np.sqrt(v_hat_b) + params['eps'])) * m_hat_b
            
            params['path'].append((w, b))
        params['path'] = np.array(params['path'])
        
    return optimizers

def plot_optimizer_comparison(optimizers):
    # Grid for plotting
    w_grid = np.linspace(-5, 5, 200)
    b_grid = np.linspace(-5, 5, 200)
    W, B = np.meshgrid(w_grid, b_grid)
    L = loss_function(W, B)
    
    plt.figure(figsize=(14, 12))
    plt.contour(W, B, L, levels=np.logspace(0, 5, 35), cmap='viridis', alpha=0.6)
    
    colors = {'SGD': 'red', 'Momentum': 'orange', 'RMSprop': 'blue', 'Adam': 'black'}
    for name, params in optimizers.items():
        path = params['path']
        plt.plot(path[:, 0], path[:, 1], '-o', markersize=3, color=colors[name], label=name, alpha=0.8)
    
    # Mark the minima
    minima = np.array([[3, 2], [-2.805, 3.131], [-3.779, -3.283], [3.584, -1.848]])
    plt.plot(minima[:, 0], minima[:, 1], 'y*', markersize=20, label='Local Minima', linestyle='none')
    
    plt.title('Comparison of Optimizer Paths on a Non-Convex Surface', fontsize=16, weight='bold')
    plt.xlabel('Parameter w')
    plt.ylabel('Parameter b')
    plt.legend()
    plt.axis('equal')
    plt.xlim(-5, 5)
    plt.ylim(-5, 5)
    plt.show()

optimizers = run_optimizers()
plot_optimizer_comparison(optimizers)

### Analysis of the Optimizer Paths

- **SGD (Red)**: Makes very slow progress. It's hampered by the small gradients in the flat regions and can't make decisive moves.
- **Momentum (Orange)**: Does much better than SGD. It builds up velocity to shoot across the flatter regions but can still struggle and oscillate.
- **RMSprop (Blue)**: Moves aggressively at the start because the adaptive learning rate helps it navigate the different curvatures. It finds a minimum efficiently.
- **Adam (Black)**: The clear winner. It combines the aggressive, adaptive steps of RMSprop with the path-smoothing, accelerating properties of Momentum. It takes the most direct and rapid path to the nearest minimum.

This visualization clearly shows why Adam is the preferred optimizer for most complex, high-dimensional problems found in deep learning.

---

# 🎯 Key Takeaways

## 🧠 The Big Idea
- **Adaptive Learning Rates**: The core innovation is to maintain a per-parameter learning rate, allowing the optimizer to adapt to the specific geometry of the loss surface for each parameter.

## 🛠️ The Building Blocks
- **RMSprop**: Adapts learning rates by dividing by a moving average of squared gradients. This helps balance progress on flat vs. steep directions.
- **Momentum**: Accelerates progress by adding a moving average of past gradients to the current update, smoothing the path.

## 👑 Adam: The Best of Both Worlds
- **Adam** combines the adaptive learning rate mechanism of RMSprop with the velocity-building mechanism of Momentum.
- It is robust, efficient, and generally requires less manual tuning of the learning rate than other optimizers, making it an excellent default choice.

---

# 🚀 What's Next?

This concludes our journey through the fundamentals of Optimization! We have seen how to navigate complex loss surfaces to find the best parameters for our models. We've built up from the basic idea of Gradient Descent to the sophisticated, state-of-the-art Adam optimizer.

The final section of this course, **Applications in Machine Learning**, will tie everything together. We'll see how Linear Algebra, Calculus, Probability, and Optimization all converge to build and train real machine learning models.

**Ready to see it all come together? Let's build some models! 🤖**