<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Gradient%20Descent/Gradient%20Descent%20Code%20Walk%20Through.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Descent: Code Walk Through

This notebook walks through the **computational steps** of the Gradient Descent optimization algorithm from scratch.

## What We'll Cover:
1. **Visualize the data** - understand the dataset
2. **Initialize weights** - start with random values
3. **Iterative optimization** - gradually improve weights using gradients
4. **Compute loss** - measure prediction error (SSE)
5. **Compute gradients** - find direction to update weights
6. **Update weights** - move in direction that reduces loss
7. **Visualize convergence** - watch loss decrease over iterations

We'll show **both manual calculation** (to understand the logic) and **vectorized matrix operations** (for efficiency).

### Key Concept:
- Gradient descent finds the **best fit line** through **iterative optimization**
- Unlike Linear Regression's closed-form solution, it uses **multiple iterations**
- Updates weights in the direction that **reduces the loss function**
- Essential for: neural networks, large datasets, complex models

## Step 1: Import Libraries

We need:
- **NumPy** for numerical operations and matrix calculations
- **Matplotlib** for visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Step 2: Create Training Data

We use the **same data as Linear Regression** to compare approaches:
- **180 training points** generated from **x values ranging from -9.5 to 8.5** (step size 0.1)
- **Continuous target values** that follow a linear relationship with added Gaussian noise
- Our goal: find the line that best fits this data

The data follows the relationship: **y = x + 1 + noise** where noise ~ N(0, 2)

In [None]:
# Sample data (matching Linear Regression Code Walk Through)
# Generate X values from -9.5 to 8.5 with step size 0.1
X_train = np.arange(-9.5, 8.5, 0.1)

# Generate y values: y = x + 1 + noise
# where noise follows a normal distribution with mean 0 and std 2
np.random.seed(42)  # For reproducibility
data_points = X_train
y_train = data_points + 1 + np.random.normal(0, 2, len(data_points))

# Reshape X_train to be a column vector for matrix operations
X_train = X_train.reshape(-1, 1)

# Test point
X_test = np.array([5])

print("Training data shape:", X_train.shape)  # (180, 1) = 180 points, 1 feature
print("Target values shape:", y_train.shape)   # (180,) = 180 target values
print("\nFirst few training points:")
print(X_train[:3].ravel())
print("\nCorresponding target values:")
print(y_train[:3])
print(f"\nTarget value range: [{y_train.min():.3f}, {y_train.max():.3f}]")
print(f"\nTest point: x = {X_test[0]}")

## Step 3: Visualize the Data

Let's plot our training data to see the relationship between x and y.

We can see the points roughly follow a **linear trend** - perfect for gradient descent optimization!

In [None]:
# Scatter plot of training data
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train,
           c='lightblue',
           label='Training data')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Training Data: Looking for Linear Relationship', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print(f"We have {len(X_train)} training points")
print(f"Goal: Find the line y = w₁x + w₀ that best fits this data")
print(f"      using iterative gradient descent optimization!")

## Step 4: Add Column of Ones (Bias Term)

Just like in Linear Regression, we need to add a column of 1s to include an **intercept term**.

**Why?**
- Our model is: **y = w₁x + w₀**
- We can rewrite this as: **y = w₀(1) + w₁x**
- In matrix form: **y = [1, x] × [w₀, w₁]ᵀ**

This matrix is called **Φ** (Phi) or the **design matrix**.

In [None]:
# Add column of ones using np.c_[]
Phi = np.c_[np.ones(len(X_train)), X_train]

print("Design matrix Φ shape:", Phi.shape)  # (180, 2)
print("\nFirst few rows of Φ:")
print(Phi[:5])
print("\nEach row is now: [1, x]")

## Step 5: Initialize Weights Randomly

Unlike Linear Regression which computes optimal weights directly, gradient descent **starts with random weights** and improves them iteratively.

**Key Difference:**
- **Linear Regression (Normal Equation):** w = (ΦᵀΦ)⁻¹Φᵀy → Instant optimal solution
- **Gradient Descent:** Start random → Iteratively improve → Converge to optimal

We'll initialize weights with small random values.

In [None]:
# Initialize weights randomly
np.random.seed(42)
weights = np.random.randn(2) * 0.01  # Small random values

print("Initial weights [w₀, w₁]:")
print(weights)
print(f"\nInitial model: y = {weights[1]:.6f}x + {weights[0]:.6f}")
print("\nThese are random! Gradient descent will improve them.")

## Step 6: Define Hyperparameters

Gradient descent requires two key hyperparameters:

1. **Learning Rate (α):** How big of a step to take in each iteration
   - Too small → Slow convergence (many iterations needed)
   - Too large → Overshooting, divergence (loss increases!)
   - Typical values: 0.001 to 0.1

2. **Number of Iterations (Epochs):** How many times to update weights
   - Too few → Hasn't converged yet
   - Too many → Wasted computation
   - Monitor loss to know when to stop

In [None]:
# Hyperparameters
learning_rate = 0.01  # α (alpha)
num_iterations = 100  # Number of gradient descent steps

print(f"Learning rate (α): {learning_rate}")
print(f"Number of iterations: {num_iterations}")

# Store loss history for visualization
loss_history = []
weight_history = []

## Step 7: The Gradient Descent Algorithm

### The Core Loop:

```
For each iteration:
    1. Compute predictions: ŷ = Φw
    2. Compute loss (SSE): L = Σ(y - ŷ)²
    3. Compute gradients: ∇w = -2Φᵀ(y - ŷ)
    4. Update weights: w = w - α∇w
```

### Mathematical Derivation:

**Loss Function (Sum of Squared Errors):**
$$L(w) = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = \|y - \Phi w\|^2$$

**Gradient (derivative of loss with respect to weights):**
$$\nabla_w L = -2 \Phi^T (y - \Phi w)$$

**Weight Update Rule:**
$$w_{\text{new}} = w_{\text{old}} - \alpha \nabla_w L$$

### Step 7a: One Iteration (Breaking It Down)

Let's manually go through **one iteration** to understand each step.

In [None]:
print("="*60)
print("ITERATION 1: Step-by-Step Breakdown")
print("="*60)

# Current weights (from initialization)
print("\n1. Current weights:")
print(f"   w = {weights}")
print(f"   Model: y = {weights[1]:.6f}x + {weights[0]:.6f}")

In [None]:
print("\n2. Compute predictions: ŷ = Φw")
print("   (Matrix-vector multiplication)")

predictions = Phi @ weights

print(f"   Φ shape: {Phi.shape}")
print(f"   w shape: {weights.shape}")
print(f"   ŷ shape: {predictions.shape}")
print(f"\n   First few predictions: {predictions[:3]}")
print(f"   First few actual:      {y_train[:3]}")

In [None]:
print("\n3. Compute errors (residuals): e = y - ŷ")

errors = y_train - predictions

print(f"   Errors shape: {errors.shape}")
print(f"   First few errors: {errors[:3]}")
print(f"   Mean error: {errors.mean():.6f}")

In [None]:
print("\n4. Compute loss (SSE): L = Σ(y - ŷ)²")

N = len(y_train)
loss = np.sum(errors ** 2)

# Alternative calculation: using @ operator
loss_alternative = errors @ errors

print(f"   N = {N}")
print(f"   Loss (SSE) = {loss:.6f}")
print(f"   Alternative calculation: {loss_alternative:.6f}")

In [None]:
print("\n5. Compute gradients: ∇w = -2Φᵀ(y - ŷ)")
print("   (Direction to move weights to reduce loss)")

gradients = -2 * (Phi.T @ errors)

print(f"   Φᵀ shape: {Phi.T.shape}")
print(f"   errors shape: {errors.shape}")
print(f"   ∇w shape: {gradients.shape}")
print(f"\n   Gradients: {gradients}")
print(f"   ∇w₀ (intercept gradient) = {gradients[0]:.6f}")
print(f"   ∇w₁ (slope gradient) = {gradients[1]:.6f}")

In [None]:
print("\n6. Update weights: w_new = w_old - α∇w")
print(f"   Learning rate α = {learning_rate}")

weights_old = weights.copy()
weights_new = weights_old - learning_rate * gradients

print(f"\n   Old weights: {weights_old}")
print(f"   Update (α∇w): {learning_rate * gradients}")
print(f"   New weights: {weights_new}")
print(f"\n   Old model: y = {weights_old[1]:.6f}x + {weights_old[0]:.6f}")
print(f"   New model: y = {weights_new[1]:.6f}x + {weights_new[0]:.6f}")

# Update weights
weights = weights_new

### Step 7b: Full Gradient Descent Loop

Now let's run the complete algorithm for all iterations and track the loss.

In [None]:
# Reset weights to initial random values
np.random.seed(42)
weights = np.random.randn(2) * 0.01

print("Starting Gradient Descent...")
print(f"Initial weights: {weights}")
print(f"Initial model: y = {weights[1]:.6f}x + {weights[0]:.6f}")
print("\n" + "="*60)

# Gradient Descent Loop
loss_history = []
weight_history = []

for iteration in range(num_iterations):
    # 1. Compute predictions
    predictions = Phi @ weights
    
    # 2. Compute errors
    errors = y_train - predictions
    
    # 3. Compute loss (SSE)
    loss = np.sum(errors ** 2)
    
    # 4. Compute gradients
    gradients = -2 * (Phi.T @ errors)
    
    # 5. Update weights
    weights = weights - learning_rate * gradients
    
    # Store history
    loss_history.append(loss)
    weight_history.append(weights.copy())
    
    # Print progress every 10 iterations
    if (iteration + 1) % 10 == 0 or iteration == 0:
        print(f"Iteration {iteration+1:3d}: Loss = {loss:.6f}, w₀ = {weights[0]:7.4f}, w₁ = {weights[1]:7.4f}")

print("="*60)
print("\nGradient Descent Complete!")
print(f"\nFinal weights: {weights}")
print(f"Final model: y = {weights[1]:.6f}x + {weights[0]:.6f}")
print(f"Final loss (SSE): {loss_history[-1]:.6f}")

## Step 8: Visualize Convergence

Let's plot how the loss decreases over iterations. This shows that gradient descent is **working** - the model is learning!

In [None]:
# Plot loss over iterations
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_iterations + 1), loss_history, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (SSE)', fontsize=14)
plt.title('Gradient Descent Convergence', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial loss: {loss_history[0]:.6f}")
print(f"Final loss:   {loss_history[-1]:.6f}")
print(f"Loss reduction: {loss_history[0] - loss_history[-1]:.6f} ({(1 - loss_history[-1]/loss_history[0])*100:.2f}% improvement)")

## Step 9: Visualize Weight Evolution

Let's see how the weights changed over iterations.

In [None]:
# Extract weight histories
weight_history = np.array(weight_history)
w0_history = weight_history[:, 0]
w1_history = weight_history[:, 1]

# Plot
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot w0 (intercept)
axes[0].plot(range(1, num_iterations + 1), w0_history, 'r-', linewidth=2)
axes[0].axhline(y=weights[0], color='k', linestyle='--', alpha=0.5, label=f'Final: {weights[0]:.4f}')
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('w₀ (Intercept)', fontsize=12)
axes[0].set_title('Intercept Evolution', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot w1 (slope)
axes[1].plot(range(1, num_iterations + 1), w1_history, 'g-', linewidth=2)
axes[1].axhline(y=weights[1], color='k', linestyle='--', alpha=0.5, label=f'Final: {weights[1]:.4f}')
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('w₁ (Slope)', fontsize=12)
axes[1].set_title('Slope Evolution', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"w₀ changed from {w0_history[0]:.6f} to {w0_history[-1]:.6f}")
print(f"w₁ changed from {w1_history[0]:.6f} to {w1_history[-1]:.6f}")

## Step 10: Visualize the Final Fit

Let's see how well our gradient descent model fits the data!

In [None]:
# Create x values for plotting the line
x_line = np.linspace(X_train.min(), X_train.max(), 100)

# Compute y values using learned weights: y = w₁x + w₀
y_line = weights[1] * x_line + weights[0]

# Plot
plt.figure(figsize=(10, 6))

# Training data
plt.scatter(X_train, y_train,
           c='lightblue', alpha=0.6,
           edgecolors='black', linewidths=0.5,
           label='Training data', zorder=3)

# Best fit line from gradient descent
plt.plot(x_line, y_line,
        'r-', linewidth=3,
        label=f'Gradient Descent: y = {weights[1]:.3f}x + {weights[0]:.3f}')

plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Gradient Descent: Final Best Fit Line', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Model learned by Gradient Descent: y = {weights[1]:.3f}x + {weights[0]:.3f}")

## Step 11: Make Predictions on Test Data

Now let's use our trained model to predict a new value.

**Example:** What is the predicted y value when x = 5?

In [None]:
# Test point (defined earlier: x = 5)
X_test_reshaped = X_test.reshape(-1, 1)

# Add bias term
X_test_with_bias = np.c_[np.ones(len(X_test_reshaped)), X_test_reshaped]

# Compute prediction
prediction = X_test_with_bias @ weights

print(f"For x = {X_test[0]:.1f}:")
print(f"Predicted y = {prediction[0]:.3f}")
print(f"Calculation: y = {weights[1]:.3f} × {X_test[0]:.1f} + {weights[0]:.3f} = {prediction[0]:.3f}")

In [None]:
# Visualize the prediction
plt.figure(figsize=(10, 6))

# Training data
plt.scatter(X_train, y_train,
           c='lightblue',
           label='Training data', zorder=3)

# Best fit line
plt.plot(x_line, y_line,
        'k-', linewidth=2, alpha=0.8,
        label=f'Best fit line: y = {weights[1]:.3f}x + {weights[0]:.3f}')

# Test point and prediction
plt.scatter(X_test_reshaped, prediction,
           c='red', s=200, marker='*',
           edgecolors='black', linewidths=2,
           label=f'Prediction: x={X_test_reshaped[0,0]:.1f}, ŷ={prediction[0]:.3f}',
           zorder=4)

# Draw dashed lines to show prediction
plt.plot([X_train.min(), X_test_reshaped[0,0]], [prediction[0], prediction[0]], 'r--', alpha=0.5, linewidth=1)
plt.plot([X_test_reshaped[0,0], X_test_reshaped[0,0]], [y_train.min(), prediction[0]], 'r--', alpha=0.5, linewidth=1)

plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Gradient Descent: Making a Prediction', fontsize=16)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nFor test point x = {X_test_reshaped[0,0]:.1f}:")
print(f"Predicted value ŷ = {prediction[0]:.3f}")

## Step 12: Compare with Closed-Form Solution (Normal Equation)

Let's verify our gradient descent solution by comparing with Linear Regression's closed-form solution.

They should give **very similar** (if not identical) results!

In [None]:
# Compute weights using normal equation: w = (Φ^T Φ)^{-1} Φ^T y
weights_closed_form = np.linalg.inv(Phi.T @ Phi) @ Phi.T @ y_train

print("="*70)
print("COMPARISON: Gradient Descent vs Closed-Form Solution")
print("="*70)

print(f"\nGradient Descent weights: {weights}")
print(f"Closed-Form weights:      {weights_closed_form}")

print(f"\nGradient Descent: y = {weights[1]:.6f}x + {weights[0]:.6f}")
print(f"Closed-Form:      y = {weights_closed_form[1]:.6f}x + {weights_closed_form[0]:.6f}")

print(f"\nDifference in w₀ (intercept): {abs(weights[0] - weights_closed_form[0]):.6f}")
print(f"\nDifference in w₁ (slope):     {abs(weights[1] - weights_closed_form[1]):.6f}")

# Compute loss for closed-form solution
predictions_closed_form = Phi @ weights_closed_form
loss_closed_form = np.sum((y_train - predictions_closed_form) ** 2)

print(f"\nGradient Descent loss: {loss_history[-1]:.6f}")
print(f"Closed-Form loss:      {loss_closed_form:.6f}")

print("\n" + "="*70)
if abs(weights[0] - weights_closed_form[0]) < 0.01 and abs(weights[1] - weights_closed_form[1]) < 0.01:
    print("✓ Results match! Gradient descent converged to the optimal solution.")
else:
    print("⚠ Results differ. Try more iterations or a different learning rate.")
print("="*70)

## Step 13: Visualize Both Solutions Together

Let's plot both lines to see if they overlap (they should!).

In [None]:
# Compute y values for both models
y_line_gd = weights[1] * x_line + weights[0]
y_line_cf = weights_closed_form[1] * x_line + weights_closed_form[0]

# Plot
plt.figure(figsize=(10, 6))

# Training data
plt.scatter(X_train, y_train,
           c='lightblue', alpha=0.6,
           edgecolors='black', linewidths=0.5,
           label='Training data', zorder=3)

# Gradient descent line
plt.plot(x_line, y_line_gd,
        'r-', linewidth=3, alpha=0.7,
        label=f'Gradient Descent: y = {weights[1]:.3f}x + {weights[0]:.3f}')

# Closed-form line (should overlap with GD)
plt.plot(x_line, y_line_cf,
        'g--', linewidth=3, alpha=0.7,
        label=f'Closed-Form: y = {weights_closed_form[1]:.3f}x + {weights_closed_form[0]:.3f}')

plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Gradient Descent vs Closed-Form Solution', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("The two lines should overlap almost perfectly!")
print("This confirms gradient descent found the optimal solution.")

## Summary

We've walked through all the computational steps of Gradient Descent:

1. ✅ **Visualized data** - saw training points showing linear trend
2. ✅ **Added bias term** - transformed data by adding column of 1s
3. ✅ **Initialized weights randomly** - started with small random values
4. ✅ **Iterative optimization** - repeated for multiple iterations:
   - Computed predictions: ŷ = Φw
   - Computed loss (SSE): L = Σ(y - ŷ)²
   - Computed gradients: ∇w = -2Φᵀ(y - ŷ)
   - Updated weights: w = w - α∇w
5. ✅ **Visualized convergence** - watched loss decrease over time
6. ✅ **Made predictions** - computed ŷ for new test points
7. ✅ **Compared with closed-form** - verified we reached optimal solution

### Key Gradient Descent Concepts:

| Concept | Description |
|---------|-------------|
| **Learning Rate (α)** | Step size for weight updates (0.001 to 0.1 typical) |
| **Iterations/Epochs** | Number of times to update weights |
| **Loss Function** | SSE = Σ(y - ŷ)² measures prediction error |
| **Gradient (∇w)** | Direction to move weights to reduce loss |
| **Update Rule** | w_new = w_old - α∇w |
| **Convergence** | Loss stops decreasing significantly |

### Key NumPy Operations Used:

- **`@`** - matrix/vector multiplication (Φw, Φᵀe)
- **`.T`** - transpose matrix (Φ → Φᵀ)
- **`np.sum()`** - sum elements for loss calculation
- **Broadcasting** - operations on arrays of different shapes
- **`.copy()`** - copy arrays to preserve history

### Gradient Descent vs Normal Equation:

| Aspect | Gradient Descent | Normal Equation |
|--------|-----------------|----------------|
| **Computation** | Iterative (multiple steps) | Direct (one calculation) |
| **Speed** | Slower for small data | Fast for small data |
| **Large Data** | Scales well (N > 100,000) | Slow (matrix inversion O(d³)) |
| **Hyperparameters** | Needs α and iterations | None needed |
| **Convergence** | Approximate (can stop early) | Exact optimal solution |
| **Use Case** | Neural networks, large data | Small-medium regression |
| **Variants** | Batch, Stochastic, Mini-batch | Only one version |

### When to Use Gradient Descent:

✅ **Use Gradient Descent when:**
- Training neural networks (no closed-form solution exists)
- Dataset is very large (N > 100,000 samples)
- Online learning (updating model as new data arrives)
- Need stochastic/mini-batch variants for efficiency
- Model is non-linear (can extend gradient descent, not normal equation)

❌ **Use Normal Equation when:**
- Simple linear regression with small-medium data
- Want exact solution without tuning hyperparameters
- Number of features is small (d < 10,000)
- Don't need iterative updates