# ⚡ Stochastic Gradient Descent (SGD) and its Variants

> *"The great advantage of stochastic gradient descent is its simplicity and effectiveness. It's the workhorse of deep learning."*

In the previous notebook, we learned about Gradient Descent (GD). A major drawback of this "vanilla" GD is that it requires computing the gradient of the loss function over the **entire training dataset** for every single update. For modern datasets with millions of samples, this is incredibly slow and computationally expensive.

Enter **Stochastic Gradient Descent (SGD)**, a more efficient and widely used alternative.

## 🎯 What You'll Master

- **Batch vs. Mini-Batch vs. Stochastic GD**: Understanding the different flavors of gradient descent.
- **The Power of Noise**: How SGD's noisy updates can help escape local minima.
- **Convergence and Trade-offs**: Comparing the optimization paths of different GD variants.
- **Momentum**: A simple but powerful technique to accelerate SGD.

## 📚 Import Essential Libraries

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Create a synthetic dataset for linear regression
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]  # Add x0 = 1 to each instance for the bias term

print("⚡ Libraries and synthetic data loaded!")

---

# ⚖️ Chapter 1: The Gradient Descent Family

The difference between the variants lies in **how much data we use** to compute the gradient at each step.

1. **Batch Gradient Descent (GD)**
   - **Method**: Uses the **entire** training set for each gradient calculation and parameter update.
   - **Pros**: Smooth, direct convergence path.
   - **Cons**: Very slow and memory-intensive for large datasets.

2. **Stochastic Gradient Descent (SGD)**
   - **Method**: Uses **one single, randomly chosen** training instance for each gradient calculation and update.
   - **Pros**: Very fast per update, low memory usage.
   - **Cons**: Very noisy (stochastic) path. The loss can fluctuate wildly. Never truly 'settles' at the minimum.

3. **Mini-Batch Gradient Descent**
   - **Method**: A compromise. Uses a **small, random batch** of instances (e.g., 32, 64) for each update.
   - **Pros**: The best of both worlds. Faster and more stable than SGD, more efficient than Batch GD. This is the **standard approach** in deep learning.
   - **Cons**: Adds another hyperparameter (the batch size).

In [None]:
def plot_sgd_paths():
    """
    Implement and visualize the paths of Batch, Stochastic, and Mini-Batch GD.
    """
    n_epochs = 50
    t0, t1 = 5, 50  # Learning schedule hyperparameters for SGD/Mini-batch

    def learning_schedule(t):
        return t0 / (t + t1)

    # --- Batch GD ---
    theta_batch = np.random.randn(2, 1)
    path_batch = [theta_batch]
    learning_rate_batch = 0.1
    for epoch in range(n_epochs):
        gradients = 2/100 * X_b.T.dot(X_b.dot(theta_batch) - y)
        theta_batch = theta_batch - learning_rate_batch * gradients
        path_batch.append(theta_batch)
    path_batch = np.array(path_batch)

    # --- Stochastic GD ---
    theta_sgd = np.random.randn(2, 1)
    path_sgd = [theta_sgd]
    for epoch in range(n_epochs):
        for i in range(100):
            random_index = np.random.randint(100)
            xi = X_b[random_index:random_index+1]
            yi = y[random_index:random_index+1]
            gradients = 2 * xi.T.dot(xi.dot(theta_sgd) - yi)
            eta = learning_schedule(epoch * 100 + i)
            theta_sgd = theta_sgd - eta * gradients
            path_sgd.append(theta_sgd)
    path_sgd = np.array(path_sgd)

    # --- Mini-Batch GD ---
    theta_mini = np.random.randn(2, 1)
    path_mini = [theta_mini]
    batch_size = 20
    for epoch in range(n_epochs):
        shuffled_indices = np.random.permutation(100)
        X_b_shuffled = X_b[shuffled_indices]
        y_shuffled = y[shuffled_indices]
        for i in range(0, 100, batch_size):
            xi = X_b_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            gradients = 2/batch_size * xi.T.dot(xi.dot(theta_mini) - yi)
            eta = learning_schedule(epoch * (100/batch_size) + i/batch_size)
            theta_mini = theta_mini - eta * gradients
            path_mini.append(theta_mini)
    path_mini = np.array(path_mini)
    
    # Plotting
    plt.figure(figsize=(12, 8))
    plt.plot(path_batch[:, 0], path_batch[:, 1], "b-o", linewidth=3, label="Batch GD")
    plt.plot(path_sgd[:, 0], path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic GD", markersize=1, alpha=0.6)
    plt.plot(path_mini[:, 0], path_mini[:, 1], "g-^", linewidth=2, label="Mini-Batch GD", markersize=3)
    
    # True parameters
    true_theta = np.array([[4], [3]])
    plt.plot(true_theta[0], true_theta[1], 'y*', markersize=20, label='True Minimum')
    
    plt.legend()
    plt.xlabel(r"$\theta_0$ (Bias)")
    plt.ylabel(r"$\theta_1$ (Weight)")
    plt.title("Comparison of Gradient Descent Optimization Paths", fontsize=16, weight='bold')
    plt.axis([2.5, 4.5, 2, 4])
    plt.show()

plot_sgd_paths()

### Analysis of the Paths

- **Batch GD (Blue)**: Takes a smooth, direct path straight to the minimum. It's predictable but slow.
- **Stochastic GD (Red)**: Bounces around erratically. It gets to the vicinity of the minimum quickly but then continues to dance around it, never fully settling. The individual steps are very fast.
- **Mini-Batch GD (Green)**: A happy medium. It's less erratic than SGD but arrives at the minimum much faster than Batch GD. This is the clear winner for most applications.

---

# 🎢 Chapter 2: The Power of Momentum

A problem with standard SGD is that it can be slow when navigating long, flat valleys in the loss surface. It will tend to oscillate back and forth across the narrow axis while making slow progress along the main axis.

**Momentum** is a technique that helps accelerate SGD in the relevant direction and dampens oscillations. It adds a 'memory' of the previous gradient to the current update.

### The Momentum Update Rule

1. **Update the velocity vector `v`**: 
   $$ v_{new} = \beta v_{old} + \eta \nabla L(\theta) $$

2. **Update the parameters `θ`**:
   $$ \theta_{new} = \theta_{old} - v_{new} $$

Where:
- **`β` (beta)** is the momentum term (e.g., 0.9). It controls how much of the past velocity is carried over.
- The velocity `v` acts like a rolling average of the gradients.

**Intuition**: Imagine a ball rolling down a hill. It accumulates momentum, moving faster in the downhill direction and smoothing out its path over small bumps.

In [None]:
def non_convex_loss(w, b):
    """ A function with a long, narrow valley. """
    return 0.1 * w**2 + 5 * b**2

def non_convex_gradient(w, b):
    """ Gradient for the valley function. """
    return 0.2 * w, 10 * b

def compare_momentum():
    """
    Compare SGD with and without momentum on a challenging loss surface.
    """
    # Parameters
    learning_rate = 0.1
    n_steps = 40
    start_w, start_b = -9, 1.5
    
    # --- SGD without Momentum ---
    path_sgd = [(start_w, start_b)]
    w, b = start_w, start_b
    for _ in range(n_steps):
        grad_w, grad_b = non_convex_gradient(w, b)
        w -= learning_rate * grad_w
        b -= learning_rate * grad_b
        path_sgd.append((w, b))
    path_sgd = np.array(path_sgd)

    # --- SGD with Momentum ---
    path_momentum = [(start_w, start_b)]
    w, b = start_w, start_b
    beta = 0.9
    v_w, v_b = 0, 0
    for _ in range(n_steps):
        grad_w, grad_b = non_convex_gradient(w, b)
        v_w = beta * v_w + learning_rate * grad_w
        v_b = beta * v_b + learning_rate * grad_b
        w -= v_w
        b -= v_b
        path_momentum.append((w, b))
    path_momentum = np.array(path_momentum)
    
    # Plotting
    w_grid = np.linspace(-10, 10, 100)
    b_grid = np.linspace(-2, 2, 100)
    W, B = np.meshgrid(w_grid, b_grid)
    L = non_convex_loss(W, B)
    
    plt.figure(figsize=(14, 8))
    plt.contour(W, B, L, levels=20, cmap='viridis')
    plt.plot(path_sgd[:, 0], path_sgd[:, 1], 'r-o', label='SGD without Momentum')
    plt.plot(path_momentum[:, 0], path_momentum[:, 1], 'b-o', label='SGD with Momentum (β=0.9)')
    plt.plot(0, 0, 'y*', markersize=20, label='Minimum')
    plt.legend()
    plt.title('The Effect of Momentum', fontsize=16, weight='bold')
    plt.xlabel('Parameter w')
    plt.ylabel('Parameter b')
    plt.show()

compare_momentum()

### Analysis of Momentum

- **SGD without Momentum (Red)**: Oscillates wildly back and forth across the narrow valley (the `b` direction). The steps in the main downhill direction (`w`) are small, so it makes very slow progress.
- **SGD with Momentum (Blue)**: The momentum term averages out the oscillations in the `b` direction (the up-and-down gradients cancel each other out). It builds up speed in the consistent `w` direction, allowing it to barrel down the valley much more quickly and directly.

This is why optimizers with momentum (and more advanced techniques like Adam) are the default choice in modern deep learning.

---

# 🎯 Key Takeaways

## 👪 The Gradient Descent Family
- **Batch GD**: Accurate but slow. Uses all data.
- **Stochastic GD (SGD)**: Fast but noisy. Uses one data point.
- **Mini-Batch GD**: The practical choice. Uses a small batch of data, balancing speed and stability.

## 🎲 The Benefit of Noise
- The randomness of SGD and Mini-Batch GD can be a feature, not a bug.
- The noisy updates can help the algorithm **jump out of poor local minima** and find a better overall solution, which is especially important for the complex, non-convex loss surfaces in deep learning.

## 🎢 Accelerating with Momentum
- **Momentum** helps SGD to converge faster, especially on surfaces with long, narrow valleys.
- It acts as a rolling average of gradients, damping oscillations and accelerating in the consistent downhill direction.

---

# 🚀 What's Next?

While Momentum is a great improvement, the world of optimization has evolved even further. The next notebook will introduce **Adaptive Optimizers** like AdaGrad, RMSprop, and the king of them all, **Adam**.

- **Adaptive Learning Rates**: How can we give each parameter its own learning rate?
- **RMSprop and AdaGrad**: The building blocks of modern optimizers.
- **Adam**: The combination of Momentum and adaptive learning rates that has become the default optimizer for most deep learning tasks.

**Ready to adapt? Let's explore the state-of-the-art in optimization! 🤖**