# Gradient Descent and Optimization Intuition

---

## Learning Objectives

By the end of this notebook you will be able to:

- Explain **optimization** as the process of finding parameters that minimize a cost function
- Describe the **gradient descent** algorithm and its intuition (ball rolling downhill)
- Write and interpret the parameter update rule
- Explain the effect of the **learning rate** on convergence
- Distinguish between **batch**, **stochastic**, and **mini-batch** gradient descent
- Implement gradient descent from scratch for simple linear regression
- Visualize the optimization path on a contour/surface plot

## Prerequisites

- Understanding of loss and cost functions (Notebook 01)
- Basic linear regression concepts (ML200)
- Basic calculus intuition (what a derivative/gradient means)

## Table of Contents

1. [What Is Optimization?](#1)
2. [Gradient Descent Intuition](#2)
3. [The Update Rule](#3)
4. [Learning Rate Effects](#4)
5. [Batch vs Stochastic vs Mini-Batch](#5)
6. [Convex vs Non-Convex](#6)
7. [Gradient Descent from Scratch](#7)
8. [Visualizing the Optimization Path](#8)
9. [Common Mistakes](#9)
10. [Exercise](#10)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

<a id='1'></a>
## 1. What Is Optimization?

In machine learning, **optimization** means:

> Find the parameters $\theta$ that **minimize** the cost function $J(\theta)$.

$$\theta^* = \arg\min_{\theta} J(\theta)$$

For linear regression with MSE:

$$J(w, b) = \frac{1}{n}\sum_{i=1}^n (y^{(i)} - (wx^{(i)} + b))^2$$

We need to find the values of $w$ and $b$ that make $J$ as small as possible.

<a id='2'></a>
## 2. Gradient Descent Intuition

Imagine you are standing on a hilly surface **blindfolded** and want to reach the lowest point:

1. **Feel the slope** beneath your feet (compute the gradient)
2. **Take a step downhill** in the steepest direction (update parameters)
3. **Repeat** until the ground feels flat (gradient near zero)

The **gradient** $\nabla J$ tells you:
- The **direction** of steepest ascent
- The **magnitude** of the slope

We move in the **opposite** direction (steepest descent) to minimize the cost.

<a id='3'></a>
## 3. The Update Rule

The gradient descent update rule for a parameter $w$:

$$w := w - \alpha \frac{\partial J}{\partial w}$$

Where:
- $\alpha$ = **learning rate** (step size)
- $\frac{\partial J}{\partial w}$ = partial derivative of the cost with respect to $w$

For simple linear regression $\hat{y} = wx + b$:

$$\frac{\partial J}{\partial w} = \frac{-2}{n}\sum_{i=1}^{n} x^{(i)}(y^{(i)} - \hat{y}^{(i)})$$

$$\frac{\partial J}{\partial b} = \frac{-2}{n}\sum_{i=1}^{n} (y^{(i)} - \hat{y}^{(i)})$$

<a id='4'></a>
## 4. Learning Rate Effects

| Learning Rate | Behavior |
|---------------|----------|
| Too small ($\alpha \ll$) | Converges, but extremely slowly |
| Just right | Smooth convergence to the minimum |
| Too large ($\alpha \gg$) | Oscillates wildly, may **diverge** |

Let's visualize this with a simple 1D quadratic cost function $J(w) = (w - 3)^2$.

In [None]:
# --- Demonstrate learning rate effect on a simple 1D cost function ---
def cost_1d(w):
    return (w - 3) ** 2

def grad_1d(w):
    return 2 * (w - 3)

learning_rates = [0.01, 0.3, 0.95]
titles = ['Too Small ($\\alpha=0.01$)', 'Just Right ($\\alpha=0.3$)', 'Too Large ($\\alpha=0.95$)']
colors = ['steelblue', 'green', 'crimson']
n_steps = 30

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
w_range = np.linspace(-2, 8, 200)

for ax, lr, title, color in zip(axes, learning_rates, titles, colors):
    ax.plot(w_range, cost_1d(w_range), 'k-', linewidth=1.5, alpha=0.5)

    w = -1.0  # starting point
    path_w = [w]
    path_j = [cost_1d(w)]

    for _ in range(n_steps):
        w = w - lr * grad_1d(w)
        path_w.append(w)
        path_j.append(cost_1d(w))

    ax.plot(path_w, path_j, 'o-', color=color, markersize=5, linewidth=1.5, label=f'GD path')
    ax.set_title(title, fontsize=13)
    ax.set_xlabel('w')
    ax.set_ylabel('J(w)')
    ax.set_ylim(-1, 30)
    ax.set_xlim(-3, 9)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Too small:  Gets there eventually but wastes computation.")
print("Just right: Converges smoothly in a few steps.")
print("Too large:  Oscillates around the minimum (may diverge if > 1.0).")

<a id='5'></a>
## 5. Batch vs Stochastic vs Mini-Batch

| Variant | Gradient computed on | Pros | Cons |
|---------|---------------------|------|------|
| **Batch GD** | All $n$ samples | Stable, smooth convergence | Slow for large datasets |
| **Stochastic GD (SGD)** | 1 sample at a time | Fast updates, can escape local minima | Noisy, oscillates |
| **Mini-batch GD** | A batch of $m$ samples | Balance of speed and stability | Requires batch size tuning |

- **Batch GD** is what we implement below (and what sklearn uses for linear/logistic regression).
- **SGD** and **mini-batch** are essential for deep learning where datasets are too large to fit in memory.
- For tree-based models (Random Forest, XGBoost), gradient descent is **not** used -- those models use different optimization strategies.

<a id='6'></a>
## 6. Convex vs Non-Convex

- **Convex** cost function: has exactly **one global minimum** (shaped like a bowl)
  - Linear regression (MSE) is convex
  - Logistic regression (cross-entropy) is convex
  - Gradient descent is **guaranteed** to find the global minimum

- **Non-convex** cost function: has **multiple local minima**
  - Neural networks are non-convex
  - Gradient descent may get stuck in a local minimum

For this course, all models we study (linear regression, logistic regression) have convex cost functions, so gradient descent will always converge to the optimal solution (given an appropriate learning rate).

In [None]:
# --- Convex vs Non-Convex visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Convex
x_conv = np.linspace(-4, 4, 200)
y_conv = x_conv ** 2
axes[0].plot(x_conv, y_conv, 'steelblue', linewidth=2.5)
axes[0].scatter([0], [0], color='green', s=100, zorder=5, label='Global minimum')
axes[0].set_title('Convex: One Global Minimum', fontsize=13)
axes[0].set_xlabel('w')
axes[0].set_ylabel('J(w)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Non-convex
x_nc = np.linspace(-4, 4, 200)
y_nc = x_nc ** 4 - 8 * x_nc ** 2 + 5
axes[1].plot(x_nc, y_nc, 'crimson', linewidth=2.5)
# Mark local and global minima
local_min_x = np.array([-2.0, 2.0])
local_min_y = local_min_x ** 4 - 8 * local_min_x ** 2 + 5
axes[1].scatter(local_min_x, local_min_y, color='green', s=100, zorder=5, label='Minima')
axes[1].scatter([0], [5], color='orange', s=100, zorder=5, marker='^', label='Local maximum')
axes[1].set_title('Non-Convex: Multiple Minima', fontsize=13)
axes[1].set_xlabel('w')
axes[1].set_ylabel('J(w)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a id='7'></a>
## 7. Gradient Descent from Scratch

We will implement batch gradient descent for simple linear regression (1 feature):

$$\hat{y} = wx + b$$
$$J(w, b) = \frac{1}{n}\sum_{i=1}^n (y^{(i)} - wx^{(i)} - b)^2$$

In [None]:
# --- Generate simple linear data ---
np.random.seed(42)
n = 100
X_raw = 2 * np.random.rand(n)
y_data = 4 + 3 * X_raw + np.random.randn(n) * 0.5

# Standardize X for better gradient descent behavior
X_mean, X_std = X_raw.mean(), X_raw.std()
X_data = (X_raw - X_mean) / X_std

plt.scatter(X_data, y_data, alpha=0.6, s=20)
plt.xlabel('x (standardized)')
plt.ylabel('y')
plt.title('Simple Linear Regression Data')
plt.grid(True, alpha=0.3)
plt.show()
print(f"True relationship: y = 4 + 3*x  (before standardization)")
print(f"Samples: {n}")

In [None]:
# --- Gradient descent implementation ---
def gradient_descent(X, y, lr=0.1, n_iters=100):
    """
    Batch gradient descent for simple linear regression.

    Parameters
    ----------
    X : array, shape (n,)  -- feature values
    y : array, shape (n,)  -- target values
    lr : float             -- learning rate
    n_iters : int          -- number of iterations

    Returns
    -------
    w, b : final parameters
    history : dict with 'cost', 'w', 'b' lists
    """
    n = len(X)
    w = 0.0   # initialize weight
    b = 0.0   # initialize bias

    history = {'cost': [], 'w': [], 'b': []}

    for i in range(n_iters):
        # Forward pass: predictions
        y_hat = w * X + b

        # Compute cost (MSE)
        cost = np.mean((y - y_hat) ** 2)
        history['cost'].append(cost)
        history['w'].append(w)
        history['b'].append(b)

        # Compute gradients
        dw = (-2 / n) * np.sum(X * (y - y_hat))
        db = (-2 / n) * np.sum(y - y_hat)

        # Update parameters
        w = w - lr * dw
        b = b - lr * db

    return w, b, history

# Run gradient descent
w_final, b_final, hist = gradient_descent(X_data, y_data, lr=0.1, n_iters=200)

print(f"Learned parameters:  w = {w_final:.4f},  b = {b_final:.4f}")
print(f"Final cost (MSE):    {hist['cost'][-1]:.4f}")

In [None]:
# --- Plot: Cost vs iterations (convergence) ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cost curve
axes[0].plot(hist['cost'], color='steelblue', linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost (MSE)')
axes[0].set_title('Cost vs Iterations -- Convergence')
axes[0].grid(True, alpha=0.3)

# Fit line
axes[1].scatter(X_data, y_data, alpha=0.5, s=20, label='Data')
x_line = np.linspace(X_data.min(), X_data.max(), 100)
axes[1].plot(x_line, w_final * x_line + b_final, 'r-', linewidth=2, label=f'Fit: y={w_final:.2f}x+{b_final:.2f}')
axes[1].set_xlabel('x (standardized)')
axes[1].set_ylabel('y')
axes[1].set_title('Fitted Line after Gradient Descent')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# --- Compare different learning rates ---
learning_rates_cmp = [0.001, 0.01, 0.1, 0.5]
colors_cmp = ['gray', 'steelblue', 'green', 'crimson']

fig, ax = plt.subplots(figsize=(10, 6))

for lr, color in zip(learning_rates_cmp, colors_cmp):
    _, _, h = gradient_descent(X_data, y_data, lr=lr, n_iters=200)
    ax.plot(h['cost'], color=color, linewidth=2, label=f'$\\alpha={lr}$')

ax.set_xlabel('Iteration')
ax.set_ylabel('Cost (MSE)')
ax.set_title('Learning Rate Comparison')
ax.set_ylim(0, 30)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("alpha=0.001: Very slow convergence (needs more iterations).")
print("alpha=0.01:  Moderate convergence.")
print("alpha=0.1:   Fast, smooth convergence.")
print("alpha=0.5:   Still converges but less smoothly.")

<a id='8'></a>
## 8. Visualizing the Optimization Path

We can visualize gradient descent as a path on the **cost surface** (3D) or **contour plot** (2D).

In [None]:
# --- Build the cost surface for visualization ---
w_range = np.linspace(-2, 4, 100)
b_range = np.linspace(3, 8, 100)
W, B = np.meshgrid(w_range, b_range)

# Compute cost for each (w, b) pair
Cost_surface = np.zeros_like(W)
for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        y_hat = W[i, j] * X_data + B[i, j]
        Cost_surface[i, j] = np.mean((y_data - y_hat) ** 2)

print(f"Cost surface computed over a {W.shape[0]}x{W.shape[1]} grid.")

In [None]:
# --- 3D Surface Plot ---
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(W, B, Cost_surface, cmap='viridis', alpha=0.7, edgecolor='none')
ax.plot(hist['w'], hist['b'], hist['cost'], 'r.-', markersize=4, linewidth=1.5, label='GD path')

ax.set_xlabel('w')
ax.set_ylabel('b')
ax.set_zlabel('Cost J(w,b)')
ax.set_title('Cost Surface with Gradient Descent Path')
ax.view_init(elev=30, azim=220)
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# --- 2D Contour Plot with GD path ---
fig, ax = plt.subplots(figsize=(10, 8))

contour = ax.contour(W, B, Cost_surface, levels=30, cmap='viridis')
ax.clabel(contour, inline=True, fontsize=8)

# Plot the GD path
ax.plot(hist['w'], hist['b'], 'ro-', markersize=3, linewidth=1.5, label='GD path')
ax.plot(hist['w'][0], hist['b'][0], 'rs', markersize=12, label='Start')
ax.plot(hist['w'][-1], hist['b'][-1], 'r*', markersize=15, label='End')

ax.set_xlabel('w', fontsize=13)
ax.set_ylabel('b', fontsize=13)
ax.set_title('Contour Plot with Gradient Descent Path', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

print("The path starts at (w=0, b=0) and converges to the minimum of the bowl-shaped cost surface.")

<a id='9'></a>
## 9. Common Mistakes

1. **Wrong learning rate**
   - Too small: wastes compute, may appear to "not learn"
   - Too large: cost oscillates or diverges (NaN values)
   - Always plot cost vs iterations to check convergence

2. **Not scaling features before gradient descent**
   - Features on different scales create elongated contours
   - GD zigzags and converges very slowly
   - **Always standardize** (zero mean, unit variance) before running GD

3. **Expecting gradient descent for tree-based models**
   - Decision Trees, Random Forest use recursive splitting, not GD
   - Gradient Boosting uses gradients conceptually, but not in the same parameter-update sense
   - GD is for **parametric** models: linear regression, logistic regression, neural networks

4. **Confusing batch size terminology**
   - "Batch" GD uses ALL data per update
   - "Mini-batch" uses a subset (e.g., 32, 64 samples)
   - "Stochastic" uses 1 sample per update

5. **Not running enough iterations**
   - If the cost curve has not flattened, the model has not converged
   - Always check the cost-vs-iterations plot

<a id='10'></a>
## 10. Exercise

**Task:** Extend gradient descent to **multiple features**.

1. Generate 2-feature data: `X = np.random.randn(200, 2)`, `y = 3*X[:,0] + 5*X[:,1] + 2 + noise`
2. Modify the `gradient_descent` function to handle a weight vector `w` of shape `(2,)` and a scalar bias `b`
   - Hint: predictions become `y_hat = X @ w + b`
   - Gradient for `w`: `dw = (-2/n) * X.T @ (y - y_hat)`
3. Run GD with `lr=0.1` for 500 iterations
4. Print the learned `w` and `b` -- do they match the true values?
5. Plot cost vs iterations to verify convergence

In [None]:
# YOUR CODE HERE
# np.random.seed(42)
# X_ex = np.random.randn(200, 2)
# y_ex = 3 * X_ex[:, 0] + 5 * X_ex[:, 1] + 2 + np.random.randn(200) * 0.5
#
# def gradient_descent_multi(X, y, lr=0.1, n_iters=500):
#     ...
#     return w, b, history