<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Gradient%20Descent/Gradient%20Descent%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Descent Hands-On Lab

In this lab, you will implement Gradient Descent optimization from scratch, understand the mathematics behind it, and apply it to real data. Along the way, you'll answer conceptual questions and create visualizations to deepen your understanding.

**Learning Objectives:**
- Understand the mathematics of Gradient Descent optimization
- Implement a custom Gradient Descent class from scratch
- Visualize convergence and loss curves
- Understand the impact of learning rate on convergence
- Apply feature scaling and understand why it's critical for gradient descent
- Compare gradient descent with closed-form solutions
- Understand batch, stochastic, and mini-batch variants
- Analyze model performance and convergence behavior

## Overview of Gradient Descent

Gradient Descent is an **iterative optimization algorithm** used to find the minimum of a function. In machine learning, we use it to **minimize the loss function** and find optimal model parameters.

**Key Idea:**
- Start with **random weights**
- Iteratively **update weights** in the direction that reduces loss
- Take steps proportional to the **negative gradient** of the loss function
- Continue until **convergence** (loss stops decreasing)

**The Update Rule:**
$$w_{\text{new}} = w_{\text{old}} - \alpha \nabla_w L$$

Where:
- **w** are the model weights (parameters)
- **α** is the learning rate (step size)
- **∇w L** is the gradient of the loss with respect to weights
- **L** is the loss function (e.g., Sum of Squared Errors)

**For Linear Regression with SSE Loss:**
- Loss: $L(w) = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$
- Gradient: $\nabla_w L = -2 \Phi^T (y - \Phi w)$
- Update: $w = w - \alpha \nabla_w L$
- Expanded form: $w = w - \alpha(-2\Phi^T(y - \Phi w)) = w + 2\alpha \Phi^T (y - \Phi w)$

**Advantages:**
- Works when no closed-form solution exists (e.g., neural networks)
- Scales well to large datasets
- Can be adapted to stochastic/mini-batch variants for efficiency
- Foundation for deep learning optimization

**Disadvantages:**
- Requires tuning hyperparameters (learning rate, iterations)
- Can be slow to converge
- May get stuck in local minima (for non-convex functions)
- Sensitive to feature scaling

> **Question**: Gradient Descent finds optimal weights by:
>
> A. Computing the exact optimal solution directly using matrix inversion like the normal equation
>
> B. Iteratively updating weights in the direction that reduces the loss function using gradients
>
> C. Randomly trying different weight combinations until finding the best performing set
>
> D. Using K-nearest neighbors to estimate optimal weight values from training data

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**
- **A is FALSE**: This describes the normal equation for Linear Regression: w = (ΦᵀΦ)⁻¹Φᵀy. The normal equation computes the exact optimal solution in one step, without any iteration. Gradient descent, in contrast, is an iterative method that gradually approaches the optimal solution.
- **B is TRUE**: Gradient descent computes the gradient ∇w L (the direction of steepest ascent of the loss function) and updates weights in the opposite direction (steepest descent): w_new = w_old - α∇w L. By repeatedly taking small steps downhill, it converges to a local minimum (global for convex functions like SSE).
- **C is FALSE**: Gradient descent is not random! It uses calculus (derivatives/gradients) to determine the exact direction to move weights. Random search would be extremely inefficient and wouldn't scale to high-dimensional problems.
- **D is FALSE**: KNN is an instance-based learning algorithm for prediction, not an optimization method for finding weights. Gradient descent doesn't use neighbors or training data directly during the update step - it only uses gradients computed from the loss function.

**Key Insight**: Gradient descent is a **deterministic, iterative, gradient-based optimization** method. It uses calculus to find the direction that most quickly reduces loss, then takes small steps in that direction.

</details>

## Gradient Descent vs Normal Equation (Closed-Form Solution)

For Linear Regression, we have **two ways** to find optimal weights:

### 1. Normal Equation (Closed-Form)
$$w = (\Phi^T \Phi)^{-1} \Phi^T y$$

**Pros:**
- ✅ Exact optimal solution in one calculation
- ✅ No hyperparameters to tune
- ✅ No iterations needed

**Cons:**
- ❌ Requires matrix inversion: O(d³) complexity (slow for many features)
- ❌ Doesn't scale to very large datasets (memory intensive)
- ❌ Only works for problems with closed-form solutions

### 2. Gradient Descent (Iterative)
$$w = w - \alpha \nabla_w L$$

**Pros:**
- ✅ Scales well to large datasets (especially mini-batch/stochastic variants)
- ✅ Works for any differentiable loss function
- ✅ Foundation for neural networks and deep learning
- ✅ Can stop early if convergence is good enough

**Cons:**
- ❌ Requires tuning learning rate and iterations
- ❌ Slower convergence (multiple iterations)
- ❌ Very sensitive to feature scaling

### When to Use Each:

| Scenario | Best Choice |
|----------|-------------|
| Small dataset (N < 10,000), few features (d < 1,000) | Normal Equation |
| Large dataset (N > 100,000) | Gradient Descent (Mini-batch) |
| Many features (d > 10,000) | Gradient Descent |
| Neural networks, non-linear models | Gradient Descent (only option) |
| Need exact optimal solution | Normal Equation |
| Online learning (streaming data) | Stochastic Gradient Descent |

## The Learning Rate: Critical Hyperparameter

The **learning rate (α)** controls how big of a step we take in each iteration.

### Impact of Different Learning Rates:

**α too small (e.g., 0.0001):**
- ✅ Stable convergence (doesn't overshoot)
- ❌ Very slow (needs many iterations)
- ❌ May get stuck in plateaus

**α optimal (e.g., 0.01-0.1):**
- ✅ Fast convergence
- ✅ Reaches minimum efficiently
- ✅ Smooth loss curve

**α too large (e.g., 1.0+):**
- ❌ Overshoots minimum
- ❌ Loss oscillates or increases
- ❌ May diverge (loss → ∞)

We'll visualize these effects later in the lab!

> **Question**: You're training a model with gradient descent and observe that the loss is increasing rather than decreasing over iterations. What is the MOST likely cause?
>
> A. The model is underfitting because it's too simple to capture the data patterns
>
> B. The learning rate is too large, causing the optimizer to overshoot the minimum
>
> C. The features need to be standardized using z-score normalization
>
> D. The number of iterations is too small and convergence hasn't been reached yet

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**
- **A is FALSE**: Underfitting means the model can't fit the training data well, which would result in HIGH but STABLE loss. The loss would remain consistently high across iterations, not increase. If loss is increasing, the optimization process itself is failing, not the model capacity.
- **B is TRUE**: When the learning rate α is too large, the weight update w_new = w_old - α∇w can overshoot the minimum. Instead of moving toward the optimal point, it jumps past it to a worse position with higher loss. In extreme cases, this causes divergence where loss → ∞. The classic symptom of too-large learning rate is loss increasing or wildly oscillating.
- **C is FALSE**: While feature scaling IS very important for gradient descent (unstandardized features cause slow/unstable convergence), it typically causes SLOW convergence or oscillation, not monotonically increasing loss. Unscaled features make the loss surface elongated, requiring careful learning rate tuning, but don't inherently cause divergence.
- **D is FALSE**: Too few iterations would mean you haven't reached the minimum yet, so loss would still be HIGH but DECREASING. If loss is increasing, stopping earlier wouldn't help - the problem is the update direction or step size is wrong.

**Key Insight**: Increasing loss during training almost always indicates α is too large. Solution: reduce learning rate by 10× (e.g., 0.1 → 0.01).

</details>

## Feature Scaling: Critical for Gradient Descent!

While feature scaling is recommended for Linear Regression's normal equation, it's **ESSENTIAL** for gradient descent.

**Why is scaling so important for gradient descent?**

1. **Convergence Speed:** Unscaled features create elongated loss surfaces
   - Gradient descent zigzags instead of going straight to minimum
   - Can be 100× slower or more!
   
2. **Learning Rate Sensitivity:** Different features need different learning rates
   - Small-scale features (0-1) might need α = 0.1
   - Large-scale features (0-10000) might need α = 0.00001
   - With one global α, impossible to optimize all features well
   
3. **Numerical Stability:** Large feature values can cause gradient explosion
   - Gradients become huge → weights explode → overflow errors

**Solution: Z-Score Standardization**
$$z = \frac{x - \mu}{\sigma}$$

This transforms all features to:
- Mean = 0
- Standard deviation = 1
- Similar scales → uniform convergence

**Critical Rule:** Fit scaler on training data ONLY!
```python
scaler.fit(X_train)  # Learn μ and σ from training data
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)    # Use same μ and σ
X_test_scaled = scaler.transform(X_test)  # Use same μ and σ
```

## Pseudocode for Gradient Descent

### Formal Pseudocode

```
============================================
Inputs
============================================
X       ← training features (N × d matrix)
y       ← training targets (N × 1 vector)
α       ← learning rate (e.g., 0.01)
max_iter ← maximum iterations (e.g., 1000)
tol     ← convergence tolerance (e.g., 1e-6)

============================================
----- fit -----
============================================
1. Add bias column: Φ ← [1, X]  # (N × (d+1))
2. Initialize weights randomly: w ← random small values
3. For iteration = 1 to max_iter:
     a. Compute predictions: ŷ ← Φw
     b. Compute errors: e ← y - ŷ
     c. Compute loss: L ← (1/N) Σ e²
     d. Compute gradients: ∇w ← -(2/N) Φᵀe
     e. Update weights: w ← w - α∇w
     f. If |L_new - L_old| < tol: STOP (converged)
4. Store final weights w

============================================
----- predict -----
============================================
For each query point in X_query:
1. Add bias: Φ_query ← [1, X_query]
2. Compute prediction: ŷ ← Φ_query · w
3. Return ŷ
```

### Key Observations
- **Iterative process:** Weights improve gradually over multiple iterations
- **Convergence check:** Stop when loss stops decreasing significantly
- **Prediction:** Same as Linear Regression (just matrix multiplication)
- **Memory efficient:** Only stores weights (not all training data)

## Implementing a Custom Gradient Descent Class

Below is a scaffold of the `MyGradientDescentRegressor` class. Fill in the TODO sections to complete the implementation:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

class MyGradientDescentRegressor(BaseEstimator, RegressorMixin):
    """
    Custom Gradient Descent implementation for Linear Regression.
    
    Parameters:
    -----------
    learning_rate : float, default=0.01
        Learning rate (α) for gradient descent updates
    max_iter : int, default=1000
        Maximum number of iterations
    tol : float, default=1e-6
        Tolerance for convergence (stop if loss change < tol)
    random_state : int, default=42
        Random seed for weight initialization
    
    Attributes:
    -----------
    weights_ : array of shape (n_features + 1,)
        Learned weights including bias term
    loss_history_ : list
        Loss value at each iteration
    n_iter_ : int
        Actual number of iterations performed
    """
    
    def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6, random_state=42):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
    
    def fit(self, X, y):
        """
        Fit the model using gradient descent.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data
        y : array-like of shape (n_samples,)
            Target values
        
        Returns:
        --------
        self
        """
        # TODO: Add column of ones for bias term
        # Hint: Use np.c_[np.ones(len(X)), X] to create design matrix Phi
        Phi = None  # Replace with your code
        
        # TODO: Initialize weights randomly with small values
        # Hint: np.random.seed(self.random_state)
        #       self.weights_ = np.random.randn(Phi.shape[1]) * 0.01
        np.random.seed(self.random_state)
        self.weights_ = None  # Replace with your code
        
        # Initialize loss history
        self.loss_history_ = []
        N = len(y)
        
        # Gradient Descent Loop
        for iteration in range(self.max_iter):
            # TODO: Compute predictions
            # Hint: predictions = Phi @ self.weights_
            predictions = None  # Replace with your code
            
            # TODO: Compute errors (residuals)
            # Hint: errors = y - predictions
            errors = None  # Replace with your code
            
            # TODO: Compute loss (MSE)
            # Hint: loss = (1/N) * np.sum(errors**2)
            loss = None  # Replace with your code
            
            # Store loss
            self.loss_history_.append(loss)
            
            # Check convergence
            if iteration > 0 and abs(self.loss_history_[-2] - self.loss_history_[-1]) < self.tol:
                self.n_iter_ = iteration + 1
                break
            
            # TODO: Compute gradients
            # Hint: gradients = -(2/N) * (Phi.T @ errors)
            gradients = None  # Replace with your code
            
            # TODO: Update weights
            # Hint: self.weights_ = self.weights_ - self.learning_rate * gradients
            # Replace this line with your code
            pass
        else:
            self.n_iter_ = self.max_iter
        
        return self
    
    def predict(self, X):
        """
        Predict using the learned model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict
        
        Returns:
        --------
        y_pred : array of shape (n_samples,)
            Predicted values
        """
        # TODO: Add column of ones for bias term
        Phi = None  # Replace with your code
        
        # TODO: Compute predictions
        y_pred = None  # Replace with your code
        
        return y_pred

### Test Your Implementation

Once you have filled in the implementation, let's test our custom gradient descent regressor on a simple dataset.

In [None]:
# Create simple test data
np.random.seed(42)
X_simple = np.array([[1], [2], [3], [4], [5]])
y_simple = np.array([2, 4, 6, 8, 10])  # Perfect linear relationship: y = 2x

# Fit model
model = MyGradientDescentRegressor(learning_rate=0.01, max_iter=1000)
model.fit(X_simple, y_simple)

# Make predictions
predictions = model.predict(X_simple)

print("Learned weights (w0=intercept, w1=slope):", model.weights_)
print("Expected: [0, 2] or very close to it")
print("\nPredictions:", predictions)
print("Actual:     ", y_simple)
print("\nFinal MSE:", model.loss_history_[-1])
print("Expected: very close to 0")
print(f"\nConverged in {model.n_iter_} iterations")

## Visualizing Convergence

Let's plot the loss over iterations to see how the model learned.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(model.loss_history_) + 1), model.loss_history_, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Gradient Descent Convergence', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial loss: {model.loss_history_[0]:.6f}")
print(f"Final loss:   {model.loss_history_[-1]:.6f}")
print(f"Improvement:  {(1 - model.loss_history_[-1]/model.loss_history_[0])*100:.2f}%")

> **Question**: In a well-tuned gradient descent setup, what should the loss curve look like?
>
> A. Steadily increasing as the model learns more complex patterns
>
> B. Randomly fluctuating up and down throughout training
>
> C. Steadily decreasing and eventually flattening as it approaches the minimum
>
> D. Oscillating with larger and larger amplitude as training progresses

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: Loss should DECREASE during training, not increase. Increasing loss means the model is getting worse at fitting the data, which indicates a problem (e.g., learning rate too high, gradient explosion, or wrong gradient calculation). The goal of optimization is to minimize loss.
- **B is FALSE**: Random fluctuation suggests instability in the optimization process. For batch gradient descent (using all training data), loss should decrease monotonically. Some fluctuation is normal for stochastic/mini-batch gradient descent (using subsets of data), but the overall trend should still be downward.
- **C is TRUE**: A healthy loss curve shows: (1) Rapid decrease initially (gradient is large far from minimum), (2) Gradual slowdown (gradient becomes smaller near minimum), (3) Flattening/plateau (convergence - gradient ≈ 0 at minimum). This is the signature of successful optimization.
- **D is FALSE**: Increasing oscillation amplitude indicates divergence, typically caused by a learning rate that's too large. The optimizer overshoots the minimum by increasing amounts, causing loss to bounce around wildly and potentially explode to infinity.

**Key Insight**: Loss should decrease monotonically (batch GD) or with downward trend (mini-batch/SGD) and flatten at convergence. Any other pattern indicates a problem with hyperparameters or implementation.

</details>

## A Dataset for Visualization

Let's work with the same synthetic dataset from the Linear Regression lab to directly compare approaches.

In [None]:
# Generate the same data as in Linear Regression Code Walk Through
np.random.seed(42)
X_train = np.arange(-9.5, 8.5, 0.1).reshape(-1, 1)
y_train = X_train.ravel() + 1 + np.random.normal(0, 2, len(X_train))

print(f"Training data: {len(X_train)} points")
print(f"X range: [{X_train.min():.1f}, {X_train.max():.1f}]")
print(f"y range: [{y_train.min():.1f}, {y_train.max():.1f}]")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Training Data: Linear Relationship with Noise', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

## Training and Visualizing the Model

In [None]:
# TODO: Fit your MyGradientDescentRegressor on the training data
# Hint: model = MyGradientDescentRegressor(learning_rate=0.01, max_iter=100)
#       model.fit(X_train, y_train)

model = None  # Replace with your code

print(f"Learned weights: {model.weights_}")
print(f"Model equation: y = {model.weights_[1]:.3f}x + {model.weights_[0]:.3f}")
print(f"Converged in {model.n_iter_} iterations")

In [None]:
# Visualize the fit
x_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
y_line = model.predict(x_line)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'GD fit: y={model.weights_[1]:.2f}x+{model.weights_[0]:.2f}')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Gradient Descent: Best Fit Line', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

## Visualizing Convergence Behavior

In [None]:
# Plot loss curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(model.loss_history_) + 1), model.loss_history_, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Training Loss Over Iterations', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial loss: {model.loss_history_[0]:.6f}")
print(f"Final loss:   {model.loss_history_[-1]:.6f}")

## Impact of Learning Rate

Let's experiment with different learning rates to see how they affect convergence.

In [None]:
# Try different learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5]
colors = ['blue', 'green', 'red', 'orange']

plt.figure(figsize=(12, 6))

for lr, color in zip(learning_rates, colors):
    model_lr = MyGradientDescentRegressor(learning_rate=lr, max_iter=100)
    model_lr.fit(X_train, y_train)
    
    plt.plot(range(1, len(model_lr.loss_history_) + 1), model_lr.loss_history_,
            linewidth=2, color=color, label=f'α = {lr} ({model_lr.n_iter_} iter)')

plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Impact of Learning Rate on Convergence', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.yscale('log')  # Log scale to see all curves
plt.show()

print("Observations:")
print("- α = 0.001: Very slow convergence (needs more iterations)")
print("- α = 0.01:  Good convergence speed")
print("- α = 0.1:   Fast convergence")
print("- α = 0.5:   May oscillate or diverge (too large)")

> **Question**: You train gradient descent with α=0.001 for 100 iterations and the loss is still decreasing steadily. What should you do?
>
> A. Decrease the learning rate to α=0.0001 for more stable convergence
>
> B. Stop training immediately since 100 iterations should be sufficient for any model
>
> C. Increase max_iter to allow more iterations, or increase α for faster convergence
>
> D. Switch to the normal equation since gradient descent isn't working properly

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: The learning rate α=0.001 is already quite small. Decreasing it further to 0.0001 would make convergence even SLOWER, requiring even MORE iterations. Since loss is steadily decreasing (not oscillating), there's no stability problem - the issue is just slow convergence.
- **B is FALSE**: There's no universal "sufficient" number of iterations. The required iterations depend on: learning rate, data size, feature scales, initialization, and convergence tolerance. If loss is still decreasing steadily after 100 iterations, the model hasn't converged yet and needs more training.
- **C is TRUE**: Steadily decreasing loss means gradient descent is working correctly but needs more time. Two solutions: (1) Increase max_iter (e.g., 1000 or 10000) to allow more iterations with the current learning rate, or (2) Increase α (e.g., to 0.01 or 0.1) to take bigger steps and converge faster. The second option is usually more efficient.
- **D is FALSE**: "Loss still decreasing steadily" means gradient descent IS working properly - it just hasn't finished yet. This is not a failure case. The normal equation would give the same final result but doesn't provide insight into convergence behavior. For this small problem, either approach works.

**Key Insight**: Steadily decreasing loss = working correctly but not converged. Solution: increase iterations or increase learning rate (carefully). Oscillating/increasing loss = problem with learning rate.

</details>

## Understanding Different Gradient Descent Variants

There are three main variants of gradient descent:

### 1. Batch Gradient Descent (What We Implemented)
- Uses **ALL training data** in each iteration
- Computes gradient using entire dataset: ∇w = -(2/N) Φᵀ(y - Φw)
- **Pros:** Stable, smooth convergence, exact gradient
- **Cons:** Slow for large datasets (must process all N samples per iteration)

### 2. Stochastic Gradient Descent (SGD)
- Uses **ONE random sample** in each iteration
- Computes gradient using single example: ∇w ≈ -2(yᵢ - ŷᵢ)φᵢ
- **Pros:** Very fast iterations, can escape local minima, online learning
- **Cons:** Noisy updates, erratic convergence, requires learning rate decay

### 3. Mini-Batch Gradient Descent (Most Popular!)
- Uses **SMALL BATCH** (e.g., 32, 64, 128 samples) in each iteration
- Computes gradient using batch: ∇w ≈ -(2/B) Σᵢ(yᵢ - ŷᵢ)φᵢ
- **Pros:** Balance of speed and stability, GPU-friendly, works with large data
- **Cons:** Adds batch size as hyperparameter

### Comparison:

| Aspect | Batch GD | Stochastic GD | Mini-Batch GD |
|--------|----------|---------------|---------------|
| **Samples per iter** | All N | 1 | B (e.g., 32) |
| **Gradient accuracy** | Exact | Noisy | Approximate |
| **Convergence** | Smooth | Erratic | Smooth-ish |
| **Speed per iter** | Slow | Very fast | Fast |
| **Memory usage** | High | Low | Medium |
| **Use case** | Small data | Online learning | Large data, deep learning |

**Note:** Scikit-learn's `SGDRegressor` uses Stochastic GD, which is why our results differ slightly!

> **Question**: You have a dataset with 10 million training examples and want to train a linear regression model. Which optimization approach is MOST practical?
>
> A. Normal equation (closed-form solution) for instant exact results
>
> B. Batch gradient descent using all 10 million samples per iteration
>
> C. Mini-batch gradient descent with batches of 128-256 samples
>
> D. Stochastic gradient descent using exactly 1 random sample per iteration

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: The normal equation requires computing (ΦᵀΦ)⁻¹Φᵀy, where Φ is 10M × d. This involves: (1) Matrix multiplication: O(Nd²) ≈ billions of operations, (2) Matrix inversion: O(d³), (3) Storing 10M × d matrix in memory (could be gigabytes). For N=10M, this is extremely slow and memory-intensive. Normal equation doesn't scale to large datasets.
- **B is FALSE**: Batch GD must process all 10 million samples in EACH iteration. Even if each iteration takes 10 seconds, and you need 100 iterations, that's 1000 seconds (16 minutes) just for computation - extremely slow. Also requires loading all 10M samples into memory simultaneously.
- **C is TRUE**: Mini-batch GD with batch size B=128 means: (1) Only 128 samples in memory at once (feasible), (2) Fast iterations (~milliseconds each), (3) Can process data in chunks from disk, (4) Parallelizes well on GPUs, (5) Gets good gradient estimates with much less computation than full batch. With B=128 and N=10M, each epoch is ~78,000 mini-batches, but each is very fast. This is the standard approach for large-scale ML.
- **D is FALSE**: While SGD (B=1) has very fast iterations, it's TOO noisy for stable convergence. With 10M samples, you'd need many epochs for a stable gradient estimate. Mini-batch (B=128-256) provides a much better balance: more stable than SGD, much faster than batch GD.

**Key Insight**: For large datasets (N > 100,000), mini-batch gradient descent is the practical choice. It balances computational efficiency, memory usage, and convergence stability. This is why deep learning uses mini-batch GD almost exclusively.

</details>

## Summary and Best Practices

### Key Takeaways

1. **Gradient Descent is an iterative optimization algorithm**
   - Updates weights in direction that reduces loss: w = w - α∇w L
   - Converges to optimal solution through many small steps
   - Foundation for training neural networks and deep learning

2. **Learning rate (α) is critical**
   - Too small → slow convergence (many iterations needed)
   - Too large → divergence (loss increases or oscillates)
   - Typical range: 0.001 to 0.1
   - Monitor loss curve to diagnose issues

3. **Feature scaling is ESSENTIAL for gradient descent**
   - Unscaled features cause slow/unstable convergence
   - Different features need different step sizes → impossible with one α
   - Always use StandardScaler or MinMaxScaler
   - Fit scaler on training data ONLY!

4. **Convergence monitoring**
   - Plot loss over iterations
   - Healthy curve: decreasing and flattening
   - Stop when loss change < tolerance
   - Early stopping prevents wasted computation

5. **Gradient descent vs Normal equation**
   - Normal equation: Fast for small data, exact solution, no tuning
   - Gradient descent: Scales to large data, works for any model, needs tuning
   - Both converge to same solution for linear regression

### When to Use Gradient Descent

✅ **Use Gradient Descent when:**
- Training neural networks (only option available)
- Dataset is very large (N > 100,000)
- Online learning (data arrives in streams)
- Need mini-batch or stochastic variants
- Working with distributed systems (can parallelize mini-batches)

❌ **Use Normal Equation when:**
- Small-medium dataset (N < 10,000)
- Few features (d < 1,000)
- Want exact solution without tuning
- Simple linear regression

### Gradient Descent Variants Summary

| Variant | Samples/Iter | Best For |
|---------|--------------|----------|
| **Batch GD** | All N | Small datasets, smooth convergence |
| **Stochastic GD** | 1 | Online learning, escaping local minima |
| **Mini-Batch GD** | 32-256 | Large datasets, deep learning (MOST COMMON) |

### Best Practices Checklist

- ✅ Always standardize features using StandardScaler
- ✅ Fit scaler on training data only (avoid data leakage)
- ✅ Start with learning rate α = 0.01, adjust based on loss curve
- ✅ Monitor loss over iterations to diagnose convergence
- ✅ Use early stopping to prevent wasted computation
- ✅ Initialize weights with small random values
- ✅ For large data, use mini-batch variant (B=32-256)
- ✅ Compare with closed-form solution when possible (validation)
- ✅ Use learning rate decay for better convergence (advanced)
- ✅ Visualize loss curves to understand convergence behavior

### Debugging Gradient Descent

| Problem | Likely Cause | Solution |
|---------|--------------|----------|
| Loss increasing | α too large | Decrease learning rate by 10× |
| Loss oscillating | α too large | Decrease learning rate |
| Very slow convergence | α too small OR unscaled features | Increase α or standardize features |
| Loss stuck at high value | Poor initialization OR bad α | Try different random seed or α |
| NaN/Inf values | Gradient explosion | Standardize features, decrease α |
| Doesn't converge after 10k iter | Unscaled features | Standardize features! |

> **Final Question**: You're training a neural network and observe that the training loss decreases smoothly for 50 epochs, then suddenly starts increasing. What is the MOST likely explanation?
>
> A. The model has successfully converged and entered the optimal region
>
> B. The learning rate should be increased to speed up convergence
>
> C. The learning rate might be too high for later epochs and needs decay/reduction
>
> D. The features weren't properly standardized before training started

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: If the model had converged optimally, the loss would plateau (stay constant), not increase. Increasing loss means the model is moving AWAY from the optimal solution, which is the opposite of convergence. This indicates a problem with the optimization, not success.
- **B is FALSE**: If loss is already increasing, making the learning rate LARGER will make the problem worse! The model is overshooting, so bigger steps would cause even more overshooting and potentially divergence to infinity.
- **C is TRUE**: This is a classic pattern: initially, when weights are far from optimal, a larger learning rate (e.g., α=0.1) works well for fast progress. But as the model gets close to the minimum (after ~50 epochs), those large steps start overshooting. The loss surface becomes very narrow near the minimum, so the same learning rate that worked early now causes instability. Solution: learning rate decay/scheduling (reduce α over time, e.g., α = 0.1 → 0.01 → 0.001).
- **D is FALSE**: If features weren't standardized, you'd see problems from the VERY FIRST epoch - extremely slow convergence, erratic behavior, or immediate divergence. The fact that loss decreased smoothly for 50 epochs proves features were scaled properly. This issue started LATER, indicating a learning rate problem specific to later training stages.

**Key Insight**: Loss decreasing then increasing suggests the learning rate is too large for the current optimization stage. Use learning rate schedules/decay: start high for fast progress, reduce over time for stable convergence. This is standard practice in deep learning.

</details>