# 4. Adam Optimizer From Scratch

Adam (Adaptive Moment Estimation) is one of the most popular optimizers in deep learning.
It combines the best of **Momentum** and **RMSProp**.
Let's build it from scratch and see why it's so good!

In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

## 1. The Concepts: SGD vs. Adam

### SGD (Stochastic Gradient Descent)
Updates parameters by subtracting the gradient multiplied by a learning rate.
$w_{t+1} = w_t - \eta \cdot \nabla L(w_t)$

**Problem**: It can get stuck in local minima or oscillate in ravines.

### Adam
Adam keeps track of two things for each parameter:
1. **Momentum ($m$)**: The moving average of gradients (like a heavy ball rolling down).
2. **Variance ($v$)**: The moving average of squared gradients (scales learning rate based on how much the gradient changes).

Update rule:
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ (First Moment)
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ (Second Moment)

Bias correction:
$\hat{m}_t = m_t / (1 - \beta_1^t)$
$\hat{v}_t = v_t / (1 - \beta_2^t)$

Parameter update:
$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

## 2. Implementing Optimizers

Let's define a base class and implement SGD and Adam.

In [None]:
class Optimizer:
    def __init__(self, params, lr):
        self.params = list(params)
        self.lr = lr
        
    def step(self):
        raise NotImplementedError
        
    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()

class SGD(Optimizer):
    def step(self):
        with torch.no_grad():
            for p in self.params:
                if p.grad is None: continue
                p -= self.lr * p.grad

class Adam(Optimizer):
    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
        super().__init__(params, lr)
        self.betas = betas
        self.eps = eps
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
        self.t = 0
        
    def step(self):
        self.t += 1
        beta1, beta2 = self.betas
        
        with torch.no_grad():
            for i, p in enumerate(self.params):
                if p.grad is None: continue
                grad = p.grad
                
                # Update biased first moment estimate
                self.m[i] = beta1 * self.m[i] + (1 - beta1) * grad
                
                # Update biased second raw moment estimate
                self.v[i] = beta2 * self.v[i] + (1 - beta2) * (grad ** 2)
                
                # Compute bias-corrected first moment estimate
                m_hat = self.m[i] / (1 - beta1 ** self.t)
                
                # Compute bias-corrected second raw moment estimate
                v_hat = self.v[i] / (1 - beta2 ** self.t)
                
                # Update parameters
                p -= self.lr * m_hat / (torch.sqrt(v_hat) + self.eps)

## 3. The Test: Rosenbrock Function (The Banana Valley)

We'll test on a tricky function where SGD struggles: $f(x, y) = (1-x)^2 + 100(y-x^2)^2$.
It has a global minimum at $(1, 1)$ inside a long, narrow, parabolic valley.

In [None]:
def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def train_optimizer(optimizer_class, lr, steps=2000):
    # Start at (-1.5, -1) - far from (1, 1)
    x = torch.tensor([-1.5], requires_grad=True)
    y = torch.tensor([-1.0], requires_grad=True)
    
    optimizer = optimizer_class([x, y], lr=lr)
    path = []
    
    for _ in range(steps):
        path.append((x.item(), y.item()))
        
        loss = rosenbrock(x, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    return np.array(path)

# Run experiments
print("Training SGD...")
path_sgd = train_optimizer(SGD, lr=0.001)

print("Training Adam...")
path_adam = train_optimizer(Adam, lr=0.1) # Adam can handle larger LR

print("Done!")

## 4. Visualization: The Race

Let's see who gets to the minimum (1, 1) faster!

In [None]:
# Create a grid for contour plot
x_grid = np.linspace(-2, 2, 100)
y_grid = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100 * (Y - X**2)**2

plt.figure(figsize=(12, 8))
plt.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='jet', alpha=0.5)
plt.plot(1, 1, 'r*', markersize=15, label='Global Minimum (1, 1)')

# Plot paths
plt.plot(path_sgd[:, 0], path_sgd[:, 1], 'b-', label='SGD', linewidth=2)
plt.plot(path_adam[:, 0], path_adam[:, 1], 'g-', label='Adam', linewidth=2)

plt.legend()
plt.title("SGD vs Adam on Rosenbrock Function")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

## 5. Conclusion

**SGD** takes small steps and struggles to navigate the curved valley. It's slow!

**Adam** adapts its learning rate for each parameter:
- It builds momentum to go fast in the right direction.
- It scales gradients to handle different curvatures.
- It reaches the minimum much faster!

That's why Adam is the default choice for most deep learning tasks.