<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Gradient%20Descent/Gradient%20Descent%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Descent Hands-On Lab

In this lab, you will implement Gradient Descent optimization from scratch, understand the mathematics behind it, and apply it to real data. Along the way, you'll answer conceptual questions and create visualizations to deepen your understanding.

**Learning Objectives:**
- Understand the mathematics of Gradient Descent optimization
- Implement a custom Gradient Descent class from scratch
- Visualize convergence and loss curves
- Understand the impact of learning rate on convergence
- Apply feature scaling and understand why it's critical for gradient descent
- Compare gradient descent with closed-form solutions
- Understand batch, stochastic, and mini-batch variants
- Analyze model performance and convergence behavior

## Overview of Gradient Descent

Gradient Descent is an **iterative optimization algorithm** used to find the minimum of a function. In machine learning, we use it to **minimize the loss function** and find optimal model parameters.

**Key Idea:**
- Start with **random weights**
- Iteratively **update weights** in the direction that reduces loss
- Take steps proportional to the **negative gradient** of the loss function
- Continue until **convergence** (loss stops decreasing)

**The Update Rule:**
$$w_{\text{new}} = w_{\text{old}} - \alpha \nabla_w L$$

Where:
- **w** are the model weights (parameters)
- **Œ±** is the learning rate (step size)
- **‚àáw L** is the gradient of the loss with respect to weights
- **L** is the loss function (e.g., Sum of Squared Errors)

**For Linear Regression with SSE Loss:**
- Loss: $L(w) = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$
- Gradient: $\nabla_w L = -2 \Phi^T (y - \Phi w)$
- Update: $w = w - \alpha \nabla_w L$
- Expanded form: $w = w - \alpha(-2\Phi^T(y - \Phi w)) = w + 2\alpha \Phi^T (y - \Phi w)$

**Advantages:**
- Works when no closed-form solution exists (e.g., neural networks)
- Scales well to large datasets
- Can be adapted to stochastic/mini-batch variants for efficiency
- Foundation for deep learning optimization

**Disadvantages:**
- Requires tuning hyperparameters (learning rate, iterations)
- Can be slow to converge
- May get stuck in local minima (for non-convex functions)
- Sensitive to feature scaling

> **Question**: Gradient Descent finds optimal weights by:
>
> A. Computing the exact optimal solution directly in one step using matrix inversion operations and ordinary least squares without any iterative optimization
>
> B. Iteratively updating weights in the direction opposite to the gradient, which is the direction that most quickly reduces the loss function
>
> C. Evaluating multiple random weight configurations through grid search and selecting the specific combination that achieves the lowest validation error score
>
> D. Approximating the closed-form solution through successive linearizations of the loss surface using second-order Taylor series expansion and Hessian matrices

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Key Insight:** Gradient descent is a first-order iterative optimization algorithm that uses calculus to find the direction of steepest descent (negative gradient) and repeatedly updates weights by taking small steps in that direction: w_new = w_old - Œ±‚àáw L.

**Explanation:**
- **A is FALSE**: This describes the normal equation for Linear Regression: w = (Œ¶·µÄŒ¶)‚Åª¬πŒ¶·µÄy. The normal equation computes the exact optimal solution in one step through direct matrix operations, without any iteration. Gradient descent, in contrast, is an iterative method that gradually approaches the optimal solution through many small updates.
- **B is TRUE**: Gradient descent computes the gradient ‚àáw L (the direction of steepest ascent) and updates weights in the opposite direction (steepest descent): w_new = w_old - Œ±‚àáw L. By repeatedly taking steps in the direction that reduces loss most quickly, it converges to a minimum. The learning rate Œ± controls step size.
- **C is FALSE**: While this describes a valid optimization approach (grid search or random search), it's not gradient descent. Gradient descent uses calculus-based gradients to determine the exact direction to move, not trial-and-error evaluation of different configurations. Grid search would be extremely inefficient for high-dimensional problems (e.g., millions of parameters in neural networks).
- **D is FALSE**: This might describe methods like Newton's method or successive quadratic approximations, which use second-order information (Hessian matrix containing second derivatives). Gradient descent uses only first-order gradients and doesn't approximate closed-form solutions - it directly minimizes the loss iteratively without needing the Hessian.

</details>

## Gradient Descent vs Normal Equation (Closed-Form Solution)

For Linear Regression, we have **two ways** to find optimal weights:

### 1. Normal Equation (Closed-Form)
$$w = (\Phi^T \Phi)^{-1} \Phi^T y$$

**Pros:**
- ‚úÖ Exact optimal solution in one calculation
- ‚úÖ No hyperparameters to tune
- ‚úÖ No iterations needed

**Cons:**
- ‚ùå Requires matrix inversion: O(d¬≥) complexity (slow for many features)
- ‚ùå Doesn't scale to very large datasets (memory intensive)
- ‚ùå Only works for problems with closed-form solutions

### 2. Gradient Descent (Iterative)
$$w = w - \alpha \nabla_w L$$

**Pros:**
- ‚úÖ Scales well to large datasets (especially mini-batch/stochastic variants)
- ‚úÖ Works for any differentiable loss function
- ‚úÖ Foundation for neural networks and deep learning
- ‚úÖ Can stop early if convergence is good enough

**Cons:**
- ‚ùå Requires tuning learning rate and iterations
- ‚ùå Slower convergence (multiple iterations)
- ‚ùå Very sensitive to feature scaling

### When to Use Each:

| Scenario | Best Choice |
|----------|-------------|
| Small dataset (N < 10,000), few features (d < 1,000) | Normal Equation |
| Large dataset (N > 100,000) | Gradient Descent (Mini-batch) |
| Many features (d > 10,000) | Gradient Descent |
| Neural networks, non-linear models | Gradient Descent (only option) |
| Need exact optimal solution | Normal Equation |
| Online learning (streaming data) | Stochastic Gradient Descent |

## The Learning Rate: Critical Hyperparameter

The **learning rate (Œ±)** controls how big of a step we take in each iteration.

### Impact of Different Learning Rates:

**Œ± too small (e.g., 0.0001):**
- ‚úÖ Stable convergence (doesn't overshoot)
- ‚ùå Very slow (needs many iterations)
- ‚ùå May get stuck in plateaus

**Œ± optimal (e.g., 0.01-0.1):**
- ‚úÖ Fast convergence
- ‚úÖ Reaches minimum efficiently
- ‚úÖ Smooth loss curve

**Œ± too large (e.g., 1.0+):**
- ‚ùå Overshoots minimum
- ‚ùå Loss oscillates or increases
- ‚ùå May diverge (loss ‚Üí ‚àû)

We'll visualize these effects later in the lab!

> **Question**: You're training a model with gradient descent and observe that the loss is increasing rather than decreasing over iterations. What is the MOST likely cause?
>
> A. The model architecture is too simple and lacks sufficient capacity to capture the underlying complex patterns in the training data effectively
>
> B. The learning rate is excessively large, causing the optimizer to overshoot the minimum and jump to positions with higher loss
>
> C. The features require standardization because different feature scales are causing destabilizing gradient magnitudes that prevent convergence throughout the entire training process
>
> D. The convergence tolerance threshold is set too loose, allowing the algorithm to stop prematurely at suboptimal solutions before reaching the minimum

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Key Insight:** Increasing loss during training is the hallmark of a learning rate that's too large. The gradient descent steps overshoot the minimum, landing at positions with higher loss. Solution: reduce Œ± by 10√ó (e.g., 0.1 ‚Üí 0.01).

**Explanation:**
- **A is FALSE**: Model underfitting means the model can't fit the training data well, resulting in HIGH but STABLE loss that plateaus at a suboptimal value. The loss would remain consistently high across iterations, not increase over time. If loss is increasing, the optimization process itself is failing, not the model's representational capacity. Underfitting would show as: iteration 1 loss=100, iteration 50 loss=98, iteration 100 loss=98 (stuck at high value).
- **B is TRUE**: When the learning rate Œ± is too large, the weight update w_new = w_old - Œ±‚àáw overshoots the minimum. Instead of moving toward the optimal point, it jumps past it to a worse position with higher loss. In extreme cases, this causes divergence where loss ‚Üí ‚àû. The classic symptom of excessive learning rate is monotonically increasing loss or wild oscillations. Example: iteration 1 loss=10, iteration 2 loss=25, iteration 3 loss=100, iteration 4 loss=‚àû.
- **C is FALSE**: While feature scaling IS very important for gradient descent stability, unscaled features typically cause SLOW and erratic convergence with oscillating loss, not monotonically increasing loss. Unscaled features create elongated loss surfaces that require careful learning rate tuning, but the loss would still trend downward overall, just very slowly and unstably. The symptom would be zigzagging loss curves, not consistent increases.
- **D is FALSE**: Convergence tolerance controls when training stops (when loss change falls below threshold). If tolerance is too loose (e.g., 1e-2 instead of 1e-6), training might stop early, but this would result in HIGH loss at stopping time, not INCREASING loss. Increasing loss indicates the optimizer is actively making things worse, not stopping too early. Early stopping from loose tolerance would show: decreasing loss that stops at iteration 10 instead of iteration 100.

</details>

## Feature Scaling: Critical for Gradient Descent!

While feature scaling is recommended for Linear Regression's normal equation, it's **ESSENTIAL** for gradient descent.

**Why is scaling so important for gradient descent?**

1. **Convergence Speed:** Unscaled features create elongated loss surfaces
   - Gradient descent zigzags instead of going straight to minimum
   - Can be 100√ó slower or more!
   
2. **Learning Rate Sensitivity:** Different features need different learning rates
   - Small-scale features (0-1) might need Œ± = 0.1
   - Large-scale features (0-10000) might need Œ± = 0.00001
   - With one global Œ±, impossible to optimize all features well
   
3. **Numerical Stability:** Large feature values can cause gradient explosion
   - Gradients become huge ‚Üí weights explode ‚Üí overflow errors

**Solution: Z-Score Standardization**
$$z = \frac{x - \mu}{\sigma}$$

This transforms all features to:
- Mean = 0
- Standard deviation = 1
- Similar scales ‚Üí uniform convergence

**Critical Rule:** Fit scaler on training data ONLY!
```python
scaler.fit(X_train)  # Learn Œº and œÉ from training data
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)    # Use same Œº and œÉ
X_test_scaled = scaler.transform(X_test)  # Use same Œº and œÉ
```

## Pseudocode for Gradient Descent

### Formal Pseudocode

```
============================================
Inputs
============================================
X       ‚Üê training features (N √ó d matrix)
y       ‚Üê training targets (N √ó 1 vector)
Œ±       ‚Üê learning rate (e.g., 0.01)
max_iter ‚Üê maximum iterations (e.g., 1000)
tol     ‚Üê convergence tolerance (e.g., 1e-6)

============================================
----- fit -----
============================================
1. Add bias column: Œ¶ ‚Üê [1, X]  # (N √ó (d+1))
2. Initialize weights randomly: w ‚Üê random small values
3. For iteration = 1 to max_iter:
     a. Compute predictions: ≈∑ ‚Üê Œ¶w
     b. Compute errors: e ‚Üê y - ≈∑
     c. Compute loss: L ‚Üê (1/N) Œ£ e¬≤
     d. Compute gradients: ‚àáw ‚Üê -(2/N) Œ¶·µÄe
     e. Update weights: w ‚Üê w - Œ±‚àáw
     f. If |L_new - L_old| < tol: STOP (converged)
4. Store final weights w

============================================
----- predict -----
============================================
For each query point in X_query:
1. Add bias: Œ¶_query ‚Üê [1, X_query]
2. Compute prediction: ≈∑ ‚Üê Œ¶_query ¬∑ w
3. Return ≈∑
```

### Key Observations
- **Iterative process:** Weights improve gradually over multiple iterations
- **Convergence check:** Stop when loss stops decreasing significantly
- **Prediction:** Same as Linear Regression (just matrix multiplication)
- **Memory efficient:** Only stores weights (not all training data)

## Why Gradient Descent?

Gradient descent is one of several optimization methods available. Understanding WHY we use it (and when NOT to) is crucial for choosing the right algorithm.

### Comparison with Alternative Optimizers

| Method | Time Complexity | Memory | Best Use Case | Limitations |
|--------|----------------|---------|---------------|-------------|
| **Closed-Form (Normal Equation)** | O(d¬≥) | O(d¬≤) | Small d (<1000), convex problems | Doesn't scale; only works for linear models |
| **Gradient Descent** | O(iterations √ó N √ó d) | O(d) | Large-scale, any differentiable loss | Requires tuning Œ±; iterative |
| **Newton's Method** | O(d¬≥) per iteration | O(d¬≤) | Small d, need fast convergence | Very expensive; requires Hessian |
| **Stochastic GD** | O(iterations √ó d) | O(d) | Very large N, online learning | Noisy; requires careful tuning |
| **Coordinate Descent** | O(iterations √ó d) | O(d) | Sparse problems, LASSO | Not for all loss functions |

**Key Variables:**
- **N** = number of training samples
- **d** = number of features (dimensions)
- **iterations** = typically 100-10,000 depending on convergence

### When to Use Gradient Descent

‚úÖ **Choose Gradient Descent when:**
- **Large-scale problems:** Millions of parameters (e.g., neural networks with d > 1,000,000)
- **No closed-form solution exists:** Most non-linear models (neural nets, logistic regression)
- **Memory-constrained:** Can't store d √ó d matrices (common in big data)
- **Online/streaming data:** Data arrives continuously, need to update model incrementally
- **Non-convex optimization:** Need stochastic variants to escape local minima
- **Deep learning:** The ONLY practical option for training neural networks

‚ùå **Don't Use Gradient Descent when:**
- **Small problems with closed-form solutions:** Linear regression with N < 10,000 and d < 1,000
  - Normal equation is faster and gives exact solution
  - No hyperparameter tuning needed
- **Need exact solution in one step:** Critical applications where iterative approximation isn't acceptable
- **Second-order methods are feasible:** d < 100 and you can afford O(d¬≥) computation
  - Newton's method converges much faster (quadratic vs linear convergence)

### Real-World Examples

**Gradient Descent is ESSENTIAL for:**
- Training neural networks (millions of parameters)
- Logistic regression (no closed form)
- Support Vector Machines with kernels
- Deep learning models (CNNs, RNNs, Transformers)
- Matrix factorization (e.g., recommender systems)

**Normal Equation is BETTER for:**
- Simple linear regression on tabular data (N < 10k, d < 1k)
- Ridge regression (closed form: w = (X·µÄX + ŒªI)‚Åª¬πX·µÄy)
- Prototyping and quick experimentation

**The Bottom Line:**
Gradient descent is the **foundation of modern machine learning** because it:
1. Scales to billions of parameters
2. Works for ANY differentiable loss function
3. Enables training of complex non-linear models
4. Can be parallelized across GPUs/distributed systems

While it requires careful tuning and is slower than closed-form solutions, its **flexibility and scalability** make it indispensable for real-world ML.

## Implementing a Custom Gradient Descent Class

Below is a scaffold of the `MyGradientDescentRegressor` class. Fill in the TODO sections to complete the implementation:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

class MyGradientDescentRegressor(BaseEstimator, RegressorMixin):
    """
    Custom Gradient Descent implementation for Linear Regression.
    
    Parameters:
    -----------
    learning_rate : float, default=0.01
        Learning rate (Œ±) for gradient descent updates
    max_iter : int, default=1000
        Maximum number of iterations
    tol : float, default=1e-6
        Tolerance for convergence (stop if loss change < tol)
    random_state : int, default=42
        Random seed for weight initialization
    
    Attributes:
    -----------
    weights_ : array of shape (n_features + 1,)
        Learned weights including bias term
    loss_history_ : list
        Loss value at each iteration
    n_iter_ : int
        Actual number of iterations performed
    """
    
    def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6, random_state=42):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
    
    def fit(self, X, y):
        """
        Fit the model using gradient descent.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data
        y : array-like of shape (n_samples,)
            Target values
        
        Returns:
        --------
        self
        """
        # TODO: Create design matrix Phi by adding column of ones for bias term
        Phi = None
        
        # TODO: Initialize weights randomly with small values (use self.random_state)
        np.random.seed(self.random_state)
        self.weights_ = None
        
        # Initialize loss history
        self.loss_history_ = []
        N = len(y)
        
        # Gradient Descent Loop
        for iteration in range(self.max_iter):
            # TODO: Compute predictions using current weights
            predictions = None
            
            # TODO: Compute errors (residuals)
            errors = None
            
            # TODO: Compute loss (Mean Squared Error)
            loss = None
            
            # Store loss
            self.loss_history_.append(loss)
            
            # Check convergence
            if iteration > 0 and abs(self.loss_history_[-2] - self.loss_history_[-1]) < self.tol:
                self.n_iter_ = iteration + 1
                break
            
            # TODO: Compute gradients using the formula: -(2/N) * Œ¶·µÄ(y - ≈∑)
            gradients = None
            
            # TODO: Update weights using gradient descent update rule
            pass
        else:
            self.n_iter_ = self.max_iter
        
        return self
    
    def predict(self, X):
        """
        Predict using the learned model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict
        
        Returns:
        --------
        y_pred : array of shape (n_samples,)
            Predicted values
        """
        # TODO: Create design matrix (same as in fit)
        Phi = None
        
        # TODO: Compute predictions
        y_pred = None
        
        return y_pred

### Test Your Implementation

Once you have filled in the implementation, let's test our custom gradient descent regressor on a simple dataset.

In [None]:
# Create simple test data
np.random.seed(42)
X_simple = np.array([[1], [2], [3], [4], [5]])
y_simple = np.array([2, 4, 6, 8, 10])  # Perfect linear relationship: y = 2x

# Fit model
model = MyGradientDescentRegressor(learning_rate=0.01, max_iter=1000)
model.fit(X_simple, y_simple)

# Make predictions
predictions = model.predict(X_simple)

print("Learned weights (w0=intercept, w1=slope):", model.weights_)
print("Expected: [0, 2] or very close to it")
print("\nPredictions:", predictions)
print("Actual:     ", y_simple)
print("\nFinal MSE:", model.loss_history_[-1])
print("Expected: very close to 0")
print(f"\nConverged in {model.n_iter_} iterations")

## Visualizing Convergence

Let's plot the loss over iterations to see how the model learned.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(model.loss_history_) + 1), model.loss_history_, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Gradient Descent Convergence', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial loss: {model.loss_history_[0]:.6f}")
print(f"Final loss:   {model.loss_history_[-1]:.6f}")
print(f"Improvement:  {(1 - model.loss_history_[-1]/model.loss_history_[0])*100:.2f}%")

> **Question**: In a well-tuned gradient descent setup, what should the loss curve look like?
>
> A. Monotonically decreasing at a constant linear rate throughout training until reaching exactly zero loss at the point of convergence completion
>
> B. Decreasing rapidly at first when gradients are large, then gradually slowing and flattening as it approaches the minimum near convergence
>
> C. Fluctuating randomly around a central value with gradually decreasing variance over iterations as the model stabilizes toward the optimal solution
>
> D. Decreasing in distinct discrete steps with plateaus between iterations where no measurable progress occurs before sudden drops to lower loss

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Key Insight:** A healthy loss curve shows fast initial decrease (large gradients far from optimum), gradual slowdown (gradients shrink near minimum), and flattening at convergence (gradient ‚âà 0). This "fast then slow" pattern is the signature of successful first-order optimization.

**Explanation:**
- **A is FALSE**: Loss rarely decreases at a constant rate or reaches exactly zero. The rate of decrease depends on the gradient magnitude, which changes as you approach the minimum (gradients get smaller ‚Üí slower progress). For noisy data, loss plateaus at a positive value (residual error), not zero. A constant-rate decrease would indicate the learning rate isn't being adjusted for the changing gradient landscape. Example: iteration 1 loss=100, iteration 10 loss=90, iteration 20 loss=80 (constant -1 per iteration) - this is unrealistic for real gradient descent.
- **B is TRUE**: A healthy loss curve shows: (1) Rapid decrease initially when gradients are large and weights are far from optimal (e.g., iteration 1-10: loss drops from 100 to 20), (2) Gradual slowdown as gradients become smaller near the minimum (iteration 10-50: loss drops from 20 to 5), (3) Flattening/plateau at convergence when gradient ‚âà 0 (iteration 50+: loss stays near 4.8). This "fast then slow" pattern is the signature of successful first-order optimization approaching a local minimum.
- **C is FALSE**: Random fluctuation is characteristic of stochastic gradient descent (SGD) using single samples or small mini-batches, not well-tuned batch gradient descent. For batch GD using all training data, the gradient is deterministic and loss should decrease monotonically. While SGD's fluctuation can help escape shallow local minima, it's not the expected behavior for standard batch GD. SGD would show: iteration 10 loss=50, iteration 11 loss=55, iteration 12 loss=48 (bouncing around).
- **D is FALSE**: Distinct steps with plateaus suggest the learning rate is poorly tuned or there are numerical precision issues. Smooth gradient descent should show continuous progress, not discrete jumps. Step-like behavior might indicate: batch updates (normal for mini-batch GD), learning rate schedules with sudden drops, or gradient clipping thresholds being hit. For well-tuned batch GD, the curve should be smooth, not step-wise.

</details>

## A Dataset for Visualization

Let's work with the same synthetic dataset from the Linear Regression lab to directly compare approaches.

In [None]:
# Generate the same data as in Linear Regression Code Walk Through
np.random.seed(42)
X_train = np.arange(-9.5, 8.5, 0.1).reshape(-1, 1)
y_train = X_train.ravel() + 1 + np.random.normal(0, 2, len(X_train))

print(f"Training data: {len(X_train)} points")
print(f"X range: [{X_train.min():.1f}, {X_train.max():.1f}]")
print(f"y range: [{y_train.min():.1f}, {y_train.max():.1f}]")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Training Data: Linear Relationship with Noise', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

## Training and Visualizing the Model

In [None]:
# TODO: Fit your MyGradientDescentRegressor on the training data
model = None

print(f"Learned weights: {model.weights_}")
print(f"Model equation: y = {model.weights_[1]:.3f}x + {model.weights_[0]:.3f}")
print(f"Converged in {model.n_iter_} iterations")

In [None]:
# Visualize the fit
x_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
y_line = model.predict(x_line)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'GD fit: y={model.weights_[1]:.2f}x+{model.weights_[0]:.2f}')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Gradient Descent: Best Fit Line', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

## Visualizing Convergence Behavior

In [None]:
# Plot loss curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(model.loss_history_) + 1), model.loss_history_, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Training Loss Over Iterations', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Initial loss: {model.loss_history_[0]:.6f}")
print(f"Final loss:   {model.loss_history_[-1]:.6f}")

## Impact of Learning Rate

Let's experiment with different learning rates to see how they affect convergence.

## Diagnosing Convergence from Loss Curves

Learning to read loss curves is a **critical debugging skill** for machine learning practitioners. Here's how to diagnose what's happening with your gradient descent optimization:

### Pattern 1: Healthy Convergence ‚úÖ

**What it looks like:**
- Loss decreases rapidly at first (steep decline)
- Gradually slows down (curve flattens)
- Stabilizes to a flat minimum
- Smooth, monotonic decrease (no jumps or spikes)

**Example behavior:**
```
Iteration 1:    Loss = 100.00
Iteration 10:   Loss = 25.00
Iteration 50:   Loss = 5.20
Iteration 100:  Loss = 4.02
Iteration 200:  Loss = 4.00
Iteration 201:  Loss = 4.00  ‚Üê Converged!
```

**What to do:**
‚úÖ **Training successful!** Model has converged to optimal solution.
- Can safely stop training
- Try increasing learning rate next time to converge faster
- Current learning rate is well-tuned

---

### Pattern 2: Too Slow Convergence ‚ö†Ô∏è

**What it looks like:**
- Loss decreasing, but very gradually
- Linear decrease that doesn't flatten
- Still decreasing at max_iter
- Never reaches plateau

**Example behavior:**
```
Iteration 1:     Loss = 100.00
Iteration 100:   Loss = 95.00
Iteration 500:   Loss = 90.00
Iteration 1000:  Loss = 85.00  ‚Üê Still decreasing!
```

**Diagnosis:**
- Learning rate **too small** for this problem
- Need many more iterations to converge
- Wasting computation time

**What to do:**
1. **Increase Œ±** by 2-10√ó: Try Œ± = 0.01 ‚Üí 0.05 or 0.1
2. **Increase max_iter**: Allow 5000-10000 iterations
3. **Check feature scaling**: Unscaled features can cause this!
4. **Monitor:** Plot loss curve to verify improvement

---

### Pattern 3: Oscillating (Unstable) ‚ö†Ô∏è

**What it looks like:**
- Loss bounces up and down
- Zigzag pattern around some value
- Never stabilizes
- Average trend might be downward, but very noisy

**Example behavior:**
```
Iteration 1:   Loss = 100.00
Iteration 10:  Loss = 15.00
Iteration 11:  Loss = 25.00  ‚Üê Jumped up!
Iteration 12:  Loss = 10.00  ‚Üê Dropped again
Iteration 20:  Loss = 18.00
Iteration 30:  Loss = 12.00  ‚Üê Bouncing around
```

**Diagnosis:**
- Learning rate **too large**
- Overshooting minimum on both sides
- Steps are bigger than the valley width

**What to do:**
1. **Decrease Œ±** by 2-10√ó: Try Œ± = 0.5 ‚Üí 0.1 or 0.05
2. **Check feature scaling**: Unscaled features exacerbate this!
3. **Use learning rate decay**: Start high, reduce over time
4. **Try smaller steps**: Œ± = 0.01 is often safe starting point

---

### Pattern 4: Diverging (Exploding) ‚ùå

**What it looks like:**
- Loss **increases** over iterations
- May reach infinity or NaN
- Gets progressively worse
- Model is getting further from optimal solution

**Example behavior:**
```
Iteration 1:   Loss = 100.00
Iteration 10:  Loss = 500.00
Iteration 20:  Loss = 2500.00
Iteration 30:  Loss = inf  ‚Üê Exploded!
```

**Diagnosis:**
- Learning rate **way too large**
- Each step jumps far past the minimum
- Gradient explosion (weights become huge)

**What to do:**
1. **Decrease Œ± dramatically**: Try Œ± = 0.5 ‚Üí 0.01 or even 0.001
2. **Standardize features IMMEDIATELY**: Most common cause!
3. **Check for NaN/Inf in data**: Data quality issues
4. **Reinitialize weights**: Try different random seed

---

### Pattern 5: Stuck at High Loss (Plateau) ‚ö†Ô∏è

**What it looks like:**
- Loss plateaus early at suboptimal value
- Flattens out but loss is still high
- No further progress after certain iteration
- Converged, but to wrong solution

**Example behavior:**
```
Iteration 1:    Loss = 100.00
Iteration 10:   Loss = 50.00
Iteration 50:   Loss = 45.00
Iteration 100:  Loss = 45.00  ‚Üê Stuck!
Iteration 500:  Loss = 45.00  ‚Üê Still stuck!
```

**Possible causes:**
- **Poor initialization:** Weights started in bad region
- **Local minimum:** (Less common for linear regression, but happens in neural nets)
- **Learning rate too small:** Can't escape saddle points
- **Feature scaling issues:** Some features dominate

**What to do:**
1. **Re-initialize weights:** Try different `random_state`
2. **Increase Œ± slightly:** Help escape plateaus
3. **Check feature scaling:** Ensure all features are standardized
4. **Try momentum:** Advanced technique to escape saddle points
5. **Verify data quality:** Check for outliers or data issues

---

### Pattern 6: Step-Like Decrease (Unusual)

**What it looks like:**
- Loss decreases in distinct steps
- Plateaus between jumps
- Not smooth

**Possible causes:**
- Using mini-batch GD (this is normal for mini-batch!)
- Learning rate schedule with discrete drops
- Numerical precision issues

**What to do:**
- If using mini-batch: This is expected behavior ‚úÖ
- If using batch GD: Check learning rate schedule
- Generally not a problem unless loss isn't decreasing overall

---

### Quick Diagnosis Flowchart

```
Is loss DECREASING overall?
‚îú‚îÄ YES: Is it SMOOTH and FLATTENING?
‚îÇ   ‚îú‚îÄ YES ‚Üí ‚úÖ Healthy convergence
‚îÇ   ‚îî‚îÄ NO: Is it OSCILLATING/BOUNCING?
‚îÇ       ‚îú‚îÄ YES ‚Üí ‚ö†Ô∏è Learning rate too large
‚îÇ       ‚îî‚îÄ NO: Still decreasing at max_iter?
‚îÇ           ‚îî‚îÄ YES ‚Üí ‚ö†Ô∏è Learning rate too small
‚îî‚îÄ NO: Is loss INCREASING?
    ‚îú‚îÄ YES ‚Üí ‚ùå Learning rate way too large
    ‚îî‚îÄ NO: Is loss STUCK at high value?
        ‚îî‚îÄ YES ‚Üí ‚ö†Ô∏è Poor initialization or local minimum
```

---

### Practical Tips

1. **Always plot loss curves** - Don't rely on final loss value alone
2. **Use log scale** for y-axis when comparing multiple learning rates
3. **Monitor first 10-20 iterations** - Catches divergence early
4. **Expected shape:** Sharp drop ‚Üí gradual slowdown ‚Üí flat plateau
5. **Save checkpoints:** Keep best weights seen so far (useful for oscillating loss)

**Remember:** The loss curve tells you the entire story of your optimization. Learn to read it!

In [None]:
# Try different learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5]
colors = ['blue', 'green', 'red', 'orange']

plt.figure(figsize=(12, 6))

for lr, color in zip(learning_rates, colors):
    model_lr = MyGradientDescentRegressor(learning_rate=lr, max_iter=100)
    model_lr.fit(X_train, y_train)
    
    plt.plot(range(1, len(model_lr.loss_history_) + 1), model_lr.loss_history_,
            linewidth=2, color=color, label=f'Œ± = {lr} ({model_lr.n_iter_} iter)')

plt.xlabel('Iteration', fontsize=14)
plt.ylabel('Loss (MSE)', fontsize=14)
plt.title('Impact of Learning Rate on Convergence', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.yscale('log')  # Log scale to see all curves
plt.show()

print("Observations:")
print("- Œ± = 0.001: Very slow convergence (needs more iterations)")
print("- Œ± = 0.01:  Good convergence speed")
print("- Œ± = 0.1:   Fast convergence")
print("- Œ± = 0.5:   May oscillate or diverge (too large)")

> **Question**: You train gradient descent with Œ±=0.001 for 100 iterations and the loss is still decreasing steadily. What should you do?
>
> A. Reduce the learning rate to Œ±=0.0001 or even smaller to ensure more stable and reliable convergence without oscillations or divergence
>
> B. Increase max_iter to allow more iterations for convergence, or increase Œ± to larger values to converge faster with fewer iterations
>
> C. Stop training immediately since 100 iterations provides sufficient convergence for most practical models and additional iterations provide diminishing returns for accuracy
>
> D. Switch to the normal equation approach which guarantees finding the globally optimal solution faster without requiring iterative optimization or hyperparameter tuning

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Key Insight:** Steadily decreasing loss means gradient descent is working correctly but hasn't converged yet. Solutions: (1) increase max_iter to allow more time, or (2) increase Œ± to take bigger steps and converge faster. Never decrease Œ± when loss is decreasing smoothly - that would make convergence even slower.

**Explanation:**
- **A is FALSE**: The learning rate Œ±=0.001 is already quite small. Decreasing it further to 0.0001 would make convergence even SLOWER, requiring even MORE iterations to reach the minimum. Since loss is steadily decreasing (not oscillating or diverging), there's no stability problem - the optimization is working correctly, just slowly. Reducing Œ± would turn 100 iterations into potentially 1000+ iterations needed. This is moving in the wrong direction!
- **B is TRUE**: Steadily decreasing loss means gradient descent is working correctly but needs more time to converge. Two solutions: (1) Increase max_iter (e.g., to 1000 or 10000) to allow more iterations with the current learning rate, or (2) Increase Œ± (e.g., to 0.01 or 0.1) to take bigger steps and converge faster. The second option is usually more efficient for computational cost. Example: with Œ±=0.01 instead of 0.001, you might converge in 50 iterations instead of 500.
- **C is FALSE**: There's no universal "sufficient" number of iterations. Required iterations depend on: learning rate, data size, feature scales, initialization, and convergence tolerance. If loss is still decreasing steadily after 100 iterations, the model hasn't converged yet. Stopping now would leave you with a suboptimal solution unnecessarily. Example: stopping at iteration 100 with loss=10 when you could reach loss=4 with more iterations wastes the training effort so far.
- **D is FALSE**: "Loss still decreasing steadily" means gradient descent IS working properly - it just hasn't finished yet. This is not a failure case requiring a different algorithm. The normal equation would give the same final solution but doesn't provide insight into convergence behavior. For small problems, either approach works; switching algorithms mid-optimization is unnecessary and would waste the 100 iterations already completed. Plus, gradient descent is the ONLY option for non-linear models where no closed form exists.

</details>

## Best Practices for Gradient Descent

This section synthesizes professional-level guidance for applying gradient descent effectively in real-world machine learning projects.

---

### 1. Learning Rate Selection

The learning rate Œ± is the **most critical hyperparameter** in gradient descent.

**Starting Point:**
- For **standardized features**: Start with Œ± = 0.01
- For **unscaled features**: Start with Œ± = 0.0001 (or better, standardize first!)
- Typical working range: 0.001 to 0.1

**Tuning Strategy:**
```python
# Try multiple learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0]
for lr in learning_rates:
    model = MyGradientDescentRegressor(learning_rate=lr)
    model.fit(X_train, y_train)
    # Plot loss curves and compare
```

**Advanced Techniques:**
- **Learning rate schedules:** Decay Œ± over time (e.g., Œ±_t = Œ±_0 / (1 + decay √ó t))
- **Adaptive methods:** Adam, RMSprop, AdaGrad (auto-adjust Œ± per parameter)
- **Warm restarts:** Periodically reset Œ± to escape plateaus
- **Line search:** Find optimal Œ± per iteration (expensive but effective)

**Rule of Thumb:**
- If loss oscillates ‚Üí Œ± too large ‚Üí decrease by 2-10√ó
- If loss decreases slowly ‚Üí Œ± too small ‚Üí increase by 2-10√ó
- Monitor loss curve for first 10-20 iterations to diagnose quickly

---

### 2. Feature Scaling

Feature scaling is **MANDATORY** for gradient descent (optional for normal equation).

**Why it's critical:**
- Unscaled features create elongated loss surfaces ‚Üí zigzag convergence
- Different features need different learning rates ‚Üí impossible with one global Œ±
- Large feature values cause gradient explosion ‚Üí NaN/Inf errors

**Best Practice: Z-Score Standardization**
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)                    # Learn Œº and œÉ from training data ONLY
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)  # Apply same Œº and œÉ
X_test_scaled = scaler.transform(X_test)
```

**CRITICAL:** 
- ‚úÖ **Fit scaler on training data ONLY** (avoid data leakage!)
- ‚úÖ **Transform all sets** (train, val, test) using the same scaler
- ‚úÖ **Scale before training** gradient descent
- ‚ùå **Never fit scaler on val/test data**

**Alternative Scaling Methods:**
- **MinMaxScaler:** Scales to [0, 1] - good for bounded data
- **RobustScaler:** Uses median/IQR - robust to outliers
- **Normalizer:** Scales each sample to unit norm - for text/sparse data

---

### 3. Weight Initialization

Poor initialization can lead to slow convergence or getting stuck in bad regions.

**Best Practice: Small Random Values**
```python
np.random.seed(42)
weights = np.random.randn(n_features) * 0.01  # Small values near 0
```

**Initialization Strategies:**
- **Random small values:** Standard for linear models
- **Zeros:** ‚ùå Creates symmetry problem (all neurons learn same thing)
- **Xavier/Glorot:** For deep networks: `w ~ N(0, sqrt(2/(n_in + n_out)))`
- **He initialization:** For ReLU networks: `w ~ N(0, sqrt(2/n_in))`

**Why small values?**
- Large initial weights ‚Üí large initial gradients ‚Üí potential divergence
- Too small ‚Üí very slow initial learning
- ~0.01 is good starting point

---

### 4. Convergence Criteria

Decide when to stop training to avoid wasted computation.

**Method 1: Maximum Iterations**
```python
max_iter = 1000  # Conservative default
# For large/complex problems, try 5000-10000
```

**Method 2: Tolerance on Loss Change**
```python
if abs(loss[t] - loss[t-1]) < tol:  # e.g., tol=1e-6
    break  # Converged!
```

**Method 3: Gradient Norm**
```python
if np.linalg.norm(gradient) < tol:  # e.g., tol=1e-6
    break  # At critical point
```

**Method 4: Early Stopping (for ML models with validation set)**
```python
if val_loss hasn't improved for 10 epochs:
    break  # Prevent overfitting
```

**Recommendation:**
- Use **both** max_iter AND tolerance for robustness
- Set reasonable max_iter (don't run forever if not converging)
- Monitor loss curve to verify convergence

---

### 5. Monitoring and Logging

Always track optimization progress for debugging and analysis.

**Essential Metrics to Log:**
```python
history = {
    'loss': [],           # Loss at each iteration
    'gradients': [],      # Gradient norms (for debugging)
    'learning_rate': [],  # If using decay/schedules
    'iteration': []
}
```

**Visualization:**
```python
# Loss curve (most important!)
plt.plot(history['loss'])
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.yscale('log')  # Log scale helpful for comparing rates

# Gradient norms (check for explosion/vanishing)
plt.plot(history['gradients'])
```

**Red Flags to Watch:**
- ‚ùå Loss increasing ‚Üí Œ± too large
- ‚ùå Loss = NaN/Inf ‚Üí Gradient explosion (scale features!)
- ‚ùå Loss plateaus early at high value ‚Üí Poor initialization or stuck
- ‚úÖ Loss decreasing and flattening ‚Üí Healthy convergence

---

### 6. When Gradient Descent Fails

Understand limitations and failure modes to troubleshoot effectively.

**Problem 1: Non-Convex Landscapes**
- Neural networks have many local minima
- **Solution:** Use stochastic/mini-batch GD (noise helps escape), momentum, or adaptive methods

**Problem 2: Saddle Points**
- Points where gradient = 0 but not a minimum
- Common in high-dimensional spaces
- **Solution:** Momentum methods (accumulate velocity to push through)

**Problem 3: Ill-Conditioned Problems**
- Loss surface has very different curvatures in different directions
- One learning rate can't work for all directions
- **Solution:** Feature scaling, adaptive methods (Adam), or second-order methods

**Problem 4: Vanishing/Exploding Gradients**
- Gradients become too small (learning stops) or too large (divergence)
- Common in deep networks
- **Solution:** Proper initialization, gradient clipping, batch normalization

**Problem 5: Poor Feature Scaling**
- **Most common cause** of gradient descent failure!
- **Solution:** ALWAYS standardize features first!

---

### 7. Advanced Techniques

Once you master basic gradient descent, explore these enhancements:

#### Momentum
Accumulates gradient history to smooth updates and accelerate convergence.
```python
velocity = 0.9 * velocity + learning_rate * gradient
weights = weights - velocity
```
**Benefits:** Faster convergence, dampens oscillations, escapes plateaus

#### Adaptive Learning Rates
Methods that adjust learning rate automatically per parameter.

| Method | Key Idea | Use Case |
|--------|----------|----------|
| **AdaGrad** | Larger updates for infrequent features | Sparse data, NLP |
| **RMSprop** | Exponential moving average of gradients | RNNs, non-stationary problems |
| **Adam** | Momentum + RMSprop | Default for deep learning (most popular!) |

**Adam is the industry standard** for deep learning - combines best of both worlds.

#### Learning Rate Schedules
Reduce Œ± over time for better convergence.
- **Step decay:** Reduce Œ± by 10√ó every N epochs
- **Exponential decay:** Œ±_t = Œ±_0 * e^(-kt)
- **Cosine annealing:** Œ±_t = Œ±_min + 0.5(Œ±_max - Œ±_min)(1 + cos(œÄt/T))

#### Line Search
Find optimal step size per iteration (expensive but effective).
```python
# Instead of fixed Œ±, search for best Œ± per iteration
Œ±_optimal = argmin_Œ± Loss(w - Œ± * gradient)
```

---

### 8. Gradient Descent Checklist

Before training with gradient descent, verify:

- ‚úÖ **Features are standardized** (StandardScaler fitted on training data)
- ‚úÖ **Learning rate is reasonable** (start with 0.01)
- ‚úÖ **Weights initialized** with small random values
- ‚úÖ **Max iterations set** (1000-10000 depending on problem)
- ‚úÖ **Convergence tolerance set** (e.g., 1e-6)
- ‚úÖ **Loss history tracked** for visualization
- ‚úÖ **Random seed set** for reproducibility

During training, monitor:

- ‚úÖ **Loss curve** - Should decrease and flatten
- ‚úÖ **First 10 iterations** - Catch divergence early
- ‚úÖ **Convergence message** - Verify it actually converged
- ‚úÖ **Final loss value** - Compare with expected range

After training, validate:

- ‚úÖ **Compare with closed-form solution** (for linear regression)
- ‚úÖ **Check learned weights** - Do they make sense?
- ‚úÖ **Verify predictions** - Test on held-out data
- ‚úÖ **Inspect residuals** - Look for patterns (suggests model issues)

---

### 9. Comparing Gradient Descent Variants

| Variant | Samples/Iteration | Memory | Speed/Iter | Convergence | Best For |
|---------|------------------|---------|------------|-------------|----------|
| **Batch GD** | All N | O(N√ód) | Slow | Smooth, stable | Small data (N<10k) |
| **Stochastic GD** | 1 | O(d) | Very fast | Noisy, erratic | Online learning |
| **Mini-Batch GD** | B (32-256) | O(B√ód) | Fast | Smooth-ish | Large data, deep learning |

**Industry Practice:**
- **Small datasets (N < 10,000):** Use batch GD or normal equation
- **Large datasets (N > 100,000):** Use mini-batch GD with B=32-256
- **Online learning (streaming data):** Use stochastic GD with learning rate decay
- **Deep learning:** Mini-batch GD with Adam optimizer (almost universal)

---

### 10. Summary: Gradient Descent in Practice

**Key Takeaways:**

1. **Gradient Descent = Foundation of Modern ML**
   - Only practical option for neural networks
   - Scales to billions of parameters
   - Works for any differentiable loss function

2. **Feature Scaling is Non-Negotiable**
   - **#1 cause** of gradient descent failure
   - Always use StandardScaler (fit on training data only!)
   - Enables faster, more stable convergence

3. **Learning Rate is Critical**
   - Start with Œ± = 0.01 for scaled features
   - Monitor loss curve to diagnose issues
   - Too large ‚Üí divergence, too small ‚Üí slow convergence

4. **Monitor Convergence**
   - Always plot loss curves
   - Check first 10-20 iterations for early warning signs
   - Use both max_iter and tolerance for robustness

5. **Advanced Methods Help**
   - Adam optimizer for deep learning
   - Momentum for faster convergence
   - Learning rate schedules for fine-tuning

**When to Use Gradient Descent:**
- ‚úÖ Neural networks (only option)
- ‚úÖ Large datasets (N > 100,000)
- ‚úÖ No closed-form solution exists
- ‚úÖ Online/streaming data

**When NOT to Use:**
- ‚ùå Small linear regression (use normal equation)
- ‚ùå Need exact solution in one step
- ‚ùå Can afford second-order methods

**The Bottom Line:**
Gradient descent is the workhorse of modern machine learning. Master its fundamentals (learning rate, feature scaling, convergence monitoring) and you'll be equipped to train everything from simple linear models to state-of-the-art deep neural networks.

In [None]:
# TODO: Experiment 3 - sklearn Validation
# Compare your implementation with sklearn's SGDRegressor

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

# Use the scaled data from Experiment 1
# If you haven't run Experiment 1, uncomment and run this:
# from sklearn.preprocessing import StandardScaler
# np.random.seed(42)
# X_temp = np.c_[np.random.uniform(0, 10, 200), np.random.uniform(0, 10000, 200)]
# y_temp = 2*X_temp[:, 0] + 0.001*X_temp[:, 1] + np.random.normal(0, 5, 200)
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_temp)
# y_mixed = y_temp

# TODO: Train YOUR custom implementation
# Use the best learning rate from Experiment 2 (or 0.01 if you skipped Experiment 2)
# Set max_iter=1000, tol=1e-6
my_model = None

print("="*80)
print("EXPERIMENT 3 RESULTS: Your Implementation vs sklearn")
print("="*80)

print("\n**Your Implementation:**")
print(f"  Learned weights: {my_model.weights_}")
print(f"  Iterations: {my_model.n_iter_}")
print(f"  Final loss: {my_model.loss_history_[-1]:.6f}")

# TODO: Train sklearn's SGDRegressor
# Parameters to match your implementation:
#   - learning_rate='constant'  # Use constant learning rate like yours
#   - eta0=0.01                 # Initial learning rate (same as yours)
#   - max_iter=1000            # Maximum iterations
#   - tol=1e-6                 # Convergence tolerance
#   - penalty=None             # No regularization (to match your implementation)
#   - random_state=42          # For reproducibility
# Hint: sklearn_model = SGDRegressor(params...)
sklearn_model = None

# sklearn stores intercept and coef separately, combine them like your implementation
sklearn_weights = np.concatenate([[sklearn_model.intercept_[0]], sklearn_model.coef_])

print("\n**sklearn's SGDRegressor:**")
print(f"  Learned weights: {sklearn_weights}")
print(f"  Iterations: {sklearn_model.n_iter_}")

# TODO: Compute sklearn's loss on training data
# Use mean_squared_error
sklearn_predictions = None
sklearn_loss = None

print(f"  Final loss: {sklearn_loss:.6f}")

# Compare weights
weight_diff = np.abs(my_model.weights_ - sklearn_weights)
print("\n**Comparison:**")
print(f"  Weight difference (absolute): {weight_diff}")
print(f"  Max weight difference: {weight_diff.max():.6f}")
print(f"  Loss difference: {abs(my_model.loss_history_[-1] - sklearn_loss):.6f}")

# Determine if implementation is correct
if weight_diff.max() < 0.1:
    print("\n‚úÖ **Validation PASSED!** Your implementation matches sklearn closely!")
    print("   Your gradient descent implementation is correct! üéâ")
elif weight_diff.max() < 1.0:
    print("\n‚ö†Ô∏è **Validation ACCEPTABLE.** Small differences due to implementation details.")
    print("   This is normal - sklearn uses additional optimizations.")
else:
    print("\n‚ùå **Validation FAILED.** Significant difference detected.")
    print("   Check your gradient computation and weight update logic.")

# Visualize loss curves comparison
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(my_model.loss_history_, linewidth=2, label='Your Implementation', color='blue')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Your Implementation - Loss Curve', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()

plt.subplot(1, 2, 2)
# Plot final predictions comparison
y_pred_my = my_model.predict(X_scaled)
y_pred_sklearn = sklearn_model.predict(X_scaled)

plt.scatter(y_mixed, y_pred_my, alpha=0.5, label='Your Model', color='blue')
plt.scatter(y_mixed, y_pred_sklearn, alpha=0.5, label='sklearn', color='red', marker='x')
plt.plot([y_mixed.min(), y_mixed.max()], [y_mixed.min(), y_mixed.max()], 
         'k--', linewidth=2, label='Perfect Prediction')
plt.xlabel('True Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title('Predictions Comparison', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° **Key Insight:** Your custom gradient descent implementation produces results")
print("   very similar to sklearn's professional implementation, validating your understanding!")

### Experiment 3: Validation with sklearn

**Goal:** Compare your custom implementation with sklearn's professional SGDRegressor to verify correctness.

**What you'll observe:**
- Similar final weights (proves your implementation is correct!)
- Similar loss values
- Possibly different iterations (sklearn uses advanced optimizations)

In [None]:
# TODO: Experiment 2 - Learning Rate Tuning
# Test multiple learning rates to find the optimal value

# Use the scaled data from Experiment 1
# If you haven't run Experiment 1, uncomment and run this:
# from sklearn.preprocessing import StandardScaler
# np.random.seed(42)
# X_temp = np.c_[np.random.uniform(0, 10, 200), np.random.uniform(0, 10000, 200)]
# y_temp = 2*X_temp[:, 0] + 0.001*X_temp[:, 1] + np.random.normal(0, 5, 200)
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_temp)
# y_mixed = y_temp

# TODO: Define a list of learning rates to test
# Try: [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0]
learning_rates_to_test = None

# Dictionary to store results
results = {}

# TODO: Train a model for each learning rate
# For each Œ± in learning_rates_to_test:
#   1. Create a MyGradientDescentRegressor with that learning rate
#   2. Fit it on X_scaled and y_mixed
#   3. Store the model, iterations, and final loss in results dict
# Hint: results[lr] = {'model': model, 'n_iter': model.n_iter_, 'loss': model.loss_history_[-1]}

for lr in None:  # Replace None with your learning_rates list
    pass  # TODO: Implement the training loop here

# Display results in a table
print("="*80)
print("EXPERIMENT 2 RESULTS: Learning Rate Tuning")
print("="*80)
print(f"\n{'Learning Rate':<15} {'Iterations':<12} {'Final Loss':<15} {'Status':<20}")
print("-" * 80)

for lr in learning_rates_to_test:
    n_iter = results[lr]['n_iter']
    final_loss = results[lr]['loss']
    
    # Determine status
    if np.isnan(final_loss) or np.isinf(final_loss):
        status = "‚ùå Diverged (NaN/Inf)"
    elif n_iter >= 1000:
        status = "‚ö†Ô∏è Slow (max iter)"
    elif final_loss > 100:
        status = "‚ùå Too large (high loss)"
    else:
        status = "‚úÖ Converged"
    
    print(f"{lr:<15.4f} {n_iter:<12} {final_loss:<15.6f} {status:<20}")

# TODO: Find the best learning rate (lowest final loss among converged models)
# Hint: Filter out NaN/Inf, then find min loss
best_lr = None
print(f"\nüéØ **Best Learning Rate:** Œ± = {best_lr}")
print(f"   Converged in {results[best_lr]['n_iter']} iterations")
print(f"   Final loss: {results[best_lr]['loss']:.6f}")

# Visualize all loss curves
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
for lr in learning_rates_to_test:
    if not (np.isnan(results[lr]['loss']) or np.isinf(results[lr]['loss'])):
        plt.plot(results[lr]['model'].loss_history_, 
                linewidth=2, label=f'Œ±={lr}', alpha=0.8)

plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Learning Rate Comparison - All Iterations', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

plt.subplot(1, 2, 2)
# Plot first 50 iterations for detail
for lr in learning_rates_to_test:
    if not (np.isnan(results[lr]['loss']) or np.isinf(results[lr]['loss'])):
        history = results[lr]['model'].loss_history_
        plt.plot(range(min(50, len(history))), history[:50], 
                linewidth=2, label=f'Œ±={lr}', alpha=0.8)

plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('First 50 Iterations (Detail)', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

plt.tight_layout()
plt.show()

print("\nüí° **Key Insights:**")
print("   - Too small Œ± (0.0001): Very slow, needs many iterations")
print("   - Optimal Œ± (0.01-0.1): Fast, smooth convergence")
print("   - Too large Œ± (>0.5): Oscillates or diverges")
print("   - Always tune Œ± based on loss curve behavior!")

### Experiment 2: Learning Rate Tuning

**Goal:** Systematically test different learning rates to find the optimal Œ± and understand the learning rate-convergence relationship.

**What you'll observe:**
- Too small Œ± ‚Üí very slow convergence
- Optimal Œ± ‚Üí fast, smooth convergence
- Too large Œ± ‚Üí oscillation or divergence

In [None]:
# TODO: Experiment 1 - Feature Scaling Impact
# Create a dataset with features at different scales to demonstrate why scaling matters

from sklearn.preprocessing import StandardScaler

# Create synthetic data with two features at vastly different scales
np.random.seed(42)
n_samples = 200

# Feature 1: small scale (0-10)
# Feature 2: large scale (0-10000)
# TODO: Create X_unscaled with shape (n_samples, 2)
#   - First column: random values between 0 and 10
#   - Second column: random values between 0 and 10000
# Hint: Use np.random.uniform(low, high, size)
X_unscaled = None

# TODO: Create target variable y_mixed that depends on both features
# Formula: y = 2*feature1 + 0.001*feature2 + noise
# Hint: noise = np.random.normal(0, 5, n_samples)
y_mixed = None

print("Dataset created:")
print(f"Feature 1 range: [{X_unscaled[:, 0].min():.2f}, {X_unscaled[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X_unscaled[:, 1].min():.2f}, {X_unscaled[:, 1].max():.2f}]")
print(f"Feature scale ratio: {X_unscaled[:, 1].max() / X_unscaled[:, 0].max():.0f}:1")

# TODO: Train model on UNSCALED data
# Use learning_rate=0.01, max_iter=1000
model_unscaled = None

# TODO: Standardize the features using StandardScaler
# Remember: fit on the data, then transform
scaler = StandardScaler()
# Fit and transform
X_scaled = None

print(f"\nAfter scaling:")
print(f"Feature 1 mean: {X_scaled[:, 0].mean():.4f}, std: {X_scaled[:, 0].std():.4f}")
print(f"Feature 2 mean: {X_scaled[:, 1].mean():.4f}, std: {X_scaled[:, 1].std():.4f}")

# TODO: Train model on SCALED data
# Use the same learning_rate=0.01, max_iter=1000
model_scaled = None

# Compare results
print("\n" + "="*70)
print("EXPERIMENT 1 RESULTS: Feature Scaling Impact")
print("="*70)
print(f"\n**Unscaled Features:**")
print(f"  Iterations to converge: {model_unscaled.n_iter_}")
print(f"  Final loss: {model_unscaled.loss_history_[-1]:.6f}")
print(f"  Learned weights: {model_unscaled.weights_}")

print(f"\n**Scaled Features:**")
print(f"  Iterations to converge: {model_scaled.n_iter_}")
print(f"  Final loss: {model_scaled.loss_history_[-1]:.6f}")
print(f"  Learned weights: {model_scaled.weights_}")

print(f"\n**Speedup:** {model_unscaled.n_iter_ / model_scaled.n_iter_:.1f}√ó faster with scaling!")

# Visualize convergence comparison
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(model_unscaled.loss_history_, linewidth=2, label='Unscaled', color='red')
plt.plot(model_scaled.loss_history_, linewidth=2, label='Scaled', color='green')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Convergence Comparison', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

plt.subplot(1, 2, 2)
# Plot first 100 iterations for detail
max_iter_plot = min(100, len(model_unscaled.loss_history_))
plt.plot(range(max_iter_plot), model_unscaled.loss_history_[:max_iter_plot], 
         linewidth=2, label='Unscaled', color='red')
plt.plot(range(max_iter_plot), model_scaled.loss_history_[:max_iter_plot], 
         linewidth=2, label='Scaled', color='green')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('First 100 Iterations (Detail)', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° **Key Insight:** Feature scaling is ESSENTIAL for gradient descent!")
print("   Without scaling, convergence is dramatically slower (or may not converge at all).")

## Hands-On Experiments

Now that you understand the theory, let's run three experiments to see these concepts in action! These experiments will help you develop intuition for how gradient descent behaves in different scenarios.

### Experiment 1: Feature Scaling Impact

**Goal:** Compare convergence with and without feature scaling to see why standardization is ESSENTIAL for gradient descent.

**What you'll observe:**
- Unscaled features ‚Üí very slow, zigzag convergence
- Scaled features ‚Üí fast, direct convergence
- Dramatic difference in iterations needed

> **Question**: You have a dataset with 10 million training examples and want to train a linear regression model. Which optimization approach is MOST practical?
>
> A. Normal equation using closed-form solution computing the exact optimal weights in one step through matrix inversion for guaranteed global optimum
>
> B. Batch gradient descent processing all 10 million samples in each iteration to compute exact gradients for stable and reliable convergence
>
> C. Mini-batch gradient descent with batches of 128-256 samples for computational efficiency, memory feasibility, and good gradient estimation with stability
>
> D. Stochastic gradient descent using exactly one randomly selected sample per iteration for maximum computational speed and fastest possible weight updates

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Key Insight:** For large datasets (N > 100,000), mini-batch gradient descent is the practical choice. It balances computational efficiency (fast iterations), memory usage (only B samples in memory), and convergence stability (better gradient estimates than pure SGD). This is why deep learning uses mini-batch GD almost exclusively.

**Explanation:**
- **A is FALSE**: The normal equation requires computing (Œ¶·µÄŒ¶)‚Åª¬πŒ¶·µÄy, where Œ¶ is 10M √ó d. This involves: (1) Matrix multiplication O(Nd¬≤) ‚âà billions of operations, (2) Matrix inversion O(d¬≥), (3) Storing a 10M √ó d matrix in memory (potentially gigabytes). For N=10M, this is extremely slow and memory-intensive. Even on modern hardware, storing and manipulating a matrix with 10 million rows is impractical. The normal equation doesn't scale to large datasets.
- **B is FALSE**: Batch GD must process all 10 million samples in EACH iteration to compute one gradient. Even if each iteration takes 10 seconds and you need 100 iterations, that's 1000 seconds (16 minutes) of computation. Additionally, loading all 10M samples into memory simultaneously is impractical (would require ~GB of RAM). Full batch GD doesn't scale to large datasets - it's too slow per iteration and too memory-intensive.
- **C is TRUE**: Mini-batch GD with batch size B=128 means: (1) Only 128 samples in memory at once (feasible even on laptops), (2) Fast iterations (~milliseconds each on GPU), (3) Can stream data from disk in chunks, (4) Good gradient estimates with much less computation than full batch. With B=128 and N=10M, each epoch is ~78,000 mini-batches, but each is very fast. This is standard for large-scale ML and deep learning.
- **D is FALSE**: While SGD (B=1) has very fast iterations, it's TOO noisy for stable convergence on 10M samples. The gradient from a single sample is a poor estimate of the true gradient direction, requiring many epochs (potentially hundreds) and careful learning rate decay to converge. Mini-batch (B=128-256) provides better balance: more stable than pure SGD (less noise), much faster than batch GD. Example: SGD might need 500 epochs vs 50 epochs for mini-batch to reach same loss.

**Numerical Comparison:**
```
Batch GD (N=10M):
  - Time per iteration: 10 seconds
  - Iterations needed: 100
  - Total time: 1000 seconds
  - Memory: All 10M samples (~10GB)

Mini-batch GD (B=256):
  - Time per iteration: 0.01 seconds
  - Iterations per epoch: 39,000
  - Epochs needed: 50
  - Total iterations: 1,950,000
  - Total time: 195 seconds (9√ó faster!)
  - Memory: 256 samples at a time (~2MB)
```

</details>

## Summary and Best Practices

### Key Takeaways

1. **Gradient Descent is an iterative optimization algorithm**
   - Updates weights in direction that reduces loss: w = w - Œ±‚àáw L
   - Converges to optimal solution through many small steps
   - Foundation for training neural networks and deep learning

2. **Learning rate (Œ±) is critical**
   - Too small ‚Üí slow convergence (many iterations needed)
   - Too large ‚Üí divergence (loss increases or oscillates)
   - Typical range: 0.001 to 0.1
   - Monitor loss curve to diagnose issues

3. **Feature scaling is ESSENTIAL for gradient descent**
   - Unscaled features cause slow/unstable convergence
   - Different features need different step sizes ‚Üí impossible with one Œ±
   - Always use StandardScaler or MinMaxScaler
   - Fit scaler on training data ONLY!

4. **Convergence monitoring**
   - Plot loss over iterations
   - Healthy curve: decreasing and flattening
   - Stop when loss change < tolerance
   - Early stopping prevents wasted computation

5. **Gradient descent vs Normal equation**
   - Normal equation: Fast for small data, exact solution, no tuning
   - Gradient descent: Scales to large data, works for any model, needs tuning
   - Both converge to same solution for linear regression

### When to Use Gradient Descent

‚úÖ **Use Gradient Descent when:**
- Training neural networks (only option available)
- Dataset is very large (N > 100,000)
- Online learning (data arrives in streams)
- Need mini-batch or stochastic variants
- Working with distributed systems (can parallelize mini-batches)

‚ùå **Use Normal Equation when:**
- Small-medium dataset (N < 10,000)
- Few features (d < 1,000)
- Want exact solution without tuning
- Simple linear regression

### Gradient Descent Variants Summary

| Variant | Samples/Iter | Best For |
|---------|--------------|----------|
| **Batch GD** | All N | Small datasets, smooth convergence |
| **Stochastic GD** | 1 | Online learning, escaping local minima |
| **Mini-Batch GD** | 32-256 | Large datasets, deep learning (MOST COMMON) |

### Best Practices Checklist

- ‚úÖ Always standardize features using StandardScaler
- ‚úÖ Fit scaler on training data only (avoid data leakage)
- ‚úÖ Start with learning rate Œ± = 0.01, adjust based on loss curve
- ‚úÖ Monitor loss over iterations to diagnose convergence
- ‚úÖ Use early stopping to prevent wasted computation
- ‚úÖ Initialize weights with small random values
- ‚úÖ For large data, use mini-batch variant (B=32-256)
- ‚úÖ Compare with closed-form solution when possible (validation)
- ‚úÖ Use learning rate decay for better convergence (advanced)
- ‚úÖ Visualize loss curves to understand convergence behavior

### Debugging Gradient Descent

| Problem | Likely Cause | Solution |
|---------|--------------|----------|
| Loss increasing | Œ± too large | Decrease learning rate by 10√ó |
| Loss oscillating | Œ± too large | Decrease learning rate |
| Very slow convergence | Œ± too small OR unscaled features | Increase Œ± or standardize features |
| Loss stuck at high value | Poor initialization OR bad Œ± | Try different random seed or Œ± |
| NaN/Inf values | Gradient explosion | Standardize features, decrease Œ± |
| Doesn't converge after 10k iter | Unscaled features | Standardize features! |

> **Final Question**: You're training a neural network and observe that the training loss decreases smoothly for 50 epochs, then suddenly starts increasing. What is the MOST likely explanation?
>
> A. The model has successfully converged to the optimal minimum region and is now oscillating around it with small fluctuations
>
> B. The learning rate should be increased to larger values to accelerate convergence and escape local plateaus or saddle points
>
> C. The learning rate is too high for later epochs when approaching minimum and needs decay or reduction to prevent overshooting
>
> D. Feature standardization was skipped during preprocessing, causing numerical instability that manifests only in later training stages after many weight updates

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Key Insight:** Loss decreasing then increasing indicates the learning rate is too large for the current optimization stage. Early in training (far from minimum), large steps work well. Near the minimum (after ~50 epochs), the same large steps overshoot. Solution: use learning rate decay/scheduling to reduce Œ± over time.

**Explanation:**
- **A is FALSE**: If the model had converged optimally, the loss would plateau (stay relatively constant with small fluctuations around the minimum), not suddenly increase. Increasing loss means the model is moving AWAY from the optimal solution, which is the opposite of convergence. Oscillation around a minimum would show small fluctuations (¬±0.01), not systematic increase (e.g., loss going from 0.5 to 0.8 to 1.2). A converged model would show: epoch 50 loss=0.50, epoch 51 loss=0.51, epoch 52 loss=0.49 (tiny variations).
- **B is FALSE**: If loss is already increasing after epoch 50, making the learning rate LARGER will make the problem dramatically worse! The model is already overshooting the minimum (taking steps too large for the narrow loss valley), so bigger steps would cause even more overshooting and potentially divergence to infinity. This would accelerate the problem, not solve it. Example: if Œ±=0.1 causes overshooting, increasing to Œ±=0.5 would cause massive divergence.
- **C is TRUE**: This is a classic pattern in neural network training: initially, when weights are far from optimal (random initialization), a larger learning rate (e.g., Œ±=0.1) works well for rapid progress. But as the model approaches the minimum (after ~50 epochs), those same large steps start overshooting because the loss surface becomes very narrow near the minimum. The learning rate that worked early now causes instability. Solution: learning rate decay/scheduling (reduce Œ± over time, e.g., Œ±_epoch50 = Œ±_initial √ó 0.1).
- **D is FALSE**: If features weren't standardized, you'd see problems from the VERY FIRST epoch - extremely slow convergence, erratic oscillations, or immediate divergence (loss increasing from epoch 1). The fact that loss decreased smoothly for 50 epochs proves features were properly scaled and the optimization was working correctly. This issue starting specifically at epoch 50 indicates a learning rate problem related to later training stages (approaching minimum), not a data preprocessing issue that would appear immediately.

**Real-world solution: Learning Rate Schedules**
```python
# Common schedules to prevent this problem:

# 1. Step Decay: Reduce Œ± every N epochs
if epoch == 50:
    learning_rate *= 0.1  # Reduce by 10√ó

# 2. Exponential Decay: Gradual reduction
learning_rate = initial_lr * exp(-decay_rate * epoch)

# 3. Cosine Annealing: Smooth curve
learning_rate = min_lr + 0.5*(max_lr - min_lr)*(1 + cos(œÄ*epoch/total_epochs))
```

This is standard practice in deep learning - start high for fast progress, reduce over time for stable convergence.

</details>