<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Regularisation/Regularisation%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regularisation: Hands-On Lab

## Learning Objectives

By the end of this lab, you will be able to:

1. **Understand** the problem of overfitting and how it manifests in machine learning models
2. **Recognize** the bias-variance tradeoff and its relationship to model complexity
3. **Implement** L2 (Ridge) regularization in logistic regression from scratch
4. **Apply** regularization to prevent overfitting with polynomial features
5. **Optimize** the regularization hyperparameter Œª using cross-validation
6. **Visualize** how regularization affects decision boundaries and model complexity
7. **Compare** L1 (Lasso) and L2 (Ridge) regularization techniques

## What is Regularisation?

**Regularisation** is a technique to prevent overfitting by adding a penalty term to the loss function that discourages overly complex models.

### The Overfitting Problem

Without regularization, complex models (e.g., high-degree polynomials) can fit training data perfectly but fail on new data:
- **Training error**: Very low (model memorizes training data)
- **Test error**: High (model fails to generalize)

### Regularisation Solution

Add a penalty for large weights to the loss function:

$$J_{\text{regularized}}(\vec{w}) = J_{\text{original}}(\vec{w}) + \lambda R(\vec{w})$$

Where:
- $J_{\text{original}}$ is the original loss (e.g., negative log-likelihood)
- $R(\vec{w})$ is the regularization term
- $\lambda \geq 0$ is the regularization strength hyperparameter

### Two Common Types

**L2 Regularization (Ridge)**:
$$J_{\text{L2}}(\vec{w}) = J(\vec{w}) + \lambda ||\vec{w}||^2_2 = J(\vec{w}) + \lambda \sum_{j=1}^{d} w_j^2$$

- Penalizes the **squared magnitude** of weights
- Shrinks weights toward zero but rarely makes them exactly zero
- Gradient: $\nabla R(\vec{w}) = 2\lambda \vec{w}$

**L1 Regularization (Lasso)**:
$$J_{\text{L1}}(\vec{w}) = J(\vec{w}) + \lambda ||\vec{w}||_1 = J(\vec{w}) + \lambda \sum_{j=1}^{d} |w_j|$$

- Penalizes the **absolute value** of weights
- Can drive weights to **exactly zero** (feature selection)
- Gradient: $\nabla R(\vec{w}) = \lambda \cdot \text{sign}(\vec{w})$

### Key Intuition

**Why does regularization prevent overfitting?**

Complex models fit training data by using large weights to create sharp, wiggly decision boundaries. Regularization forces the model to use smaller weights, resulting in smoother, more generalizable boundaries. The model must balance fitting the training data well (low $J$) with keeping weights small (low $R$).

**Example:**
- Without regularization: $w = [0.1, 50.3, -45.2, 38.7]$ (large weights, overfits)
- With regularization: $w = [0.1, 2.3, -1.8, 0.9]$ (small weights, generalizes)

### The Œª Hyperparameter

- $\lambda = 0$: No regularization (may overfit)
- $\lambda$ small: Light regularization
- $\lambda$ large: Heavy regularization (may underfit)
- $\lambda = \infty$: All weights forced to zero (trivial model)

**Finding optimal Œª**: Use cross-validation to test different values and select the one with best validation performance.

---

## When to Use Regularisation

### ‚úÖ Use Regularisation When:

**1. Training accuracy >> Test accuracy (Overfitting)**
- Training accuracy: 98%, Test accuracy: 75%
- Model memorizes training data but fails to generalize
- Regularization constrains model complexity

**2. High-Dimensional Feature Space**
- Using polynomial features (degree 3+)
- Many features compared to samples (d >> N)
- Risk of overfitting increases with dimensionality

**3. Feature Engineering Creates Many Features**
- Polynomial expansion creates O(d^p) features
- Interaction terms between many variables
- Need to prevent overfitting to noise in new features

**4. Limited Training Data**
- Small datasets are prone to overfitting
- Regularization acts as a prior on weights
- Helps model generalize with fewer examples

**5. Noisy Data**
- Training labels have errors or noise
- Regularization prevents fitting to noise
- Creates more robust models

### ‚ùå Don't Use Regularisation When:

**1. Model is Already Underfitting**
- Training accuracy is low (e.g., 60%)
- Model too simple for the problem
- **Better alternatives**: Increase model complexity, add features, reduce existing regularization

**2. Linear Model on Linearly Separable Data**
- Simple problem with clear linear boundary
- Few features, many samples
- **Better alternatives**: Use standard logistic regression without regularization

**3. Very Large Datasets**
- When N >> d (millions of samples, few features)
- Overfitting is less of a concern
- **Better alternatives**: Let data volume prevent overfitting naturally

**4. Need Maximum Training Performance**
- Competition where only training set performance matters (rare)
- Regularization hurts training performance by design

### Quick Decision Tree:

```
Is training accuracy much higher than test accuracy?
‚îú‚îÄ Yes ‚Üí Use regularization (likely overfitting)
‚îî‚îÄ No
    ‚îú‚îÄ Is training accuracy low?
    ‚îÇ   ‚îî‚îÄ Yes ‚Üí Don't use regularization (underfitting)
    ‚îî‚îÄ No ‚Üí Model is well-calibrated, regularization optional
```

### Regularisation vs Other Techniques:

| Problem | Regularization | Alternative Solutions |
|---------|----------------|----------------------|
| **Overfitting** | ‚úÖ L1/L2 regularization | Early stopping, dropout, more data |
| **Underfitting** | ‚ùå Makes worse | Increase complexity, add features |
| **Feature selection** | ‚úÖ L1 (sparse) | RFE, tree-based importance |
| **Multicollinearity** | ‚úÖ L2 (Ridge) | PCA, remove correlated features |
| **High dimensions** | ‚úÖ Essential | Dimensionality reduction |

### Real-World Applications:

1. **Medical Diagnosis**: Limited patient data, many test results ‚Üí Use L2 regularization
2. **Image Classification**: Deep neural networks with millions of parameters ‚Üí Use L2 + dropout
3. **Gene Expression Analysis**: Few samples, thousands of genes ‚Üí Use L1 for feature selection
4. **Text Classification**: High-dimensional sparse features ‚Üí Use L1 or L2
5. **Linear Regression with Polynomial Features**: High-degree polynomials ‚Üí Use L2 (Ridge)

### The Bottom Line:

**Use regularization when:**
- You observe overfitting (train >> test performance)
- You have high-dimensional features
- You're using polynomial or interaction features

**Don't use regularization when:**
- Model is underfitting
- Data is simple and plentiful
- Training performance is already poor

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression
from scipy.special import expit
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Load Classification Dataset

We'll use the same mixture dataset from the lecture slides (also used in the Logistic Regression lab). This dataset contains two classes with non-linear separation, making it perfect for demonstrating:
1. How polynomial features can capture non-linearity
2. How high-degree polynomials lead to overfitting
3. How regularization prevents overfitting

In [None]:
# Download the mixture dataset from Google Drive
# File ID: 1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT
# Direct download URL
url = 'https://drive.google.com/uc?id=1Ls7f9OWKgeWswFR4EZ5eeoohfY9PACRT'

# Load data
df = pd.read_csv(url)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nColumn names: {df.columns.tolist()}")

# Extract features and labels (last column is the label)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

print(f"\nFeatures shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Class distribution: {np.bincount(y.astype(int))}")

## Visualize the Data

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X[y == 0, 0], X[y == 0, 1], c='orange', label='Class 0', 
            edgecolors='k', s=50, alpha=0.7)
plt.scatter(X[y == 1, 0], X[y == 1, 1], c='skyblue', label='Class 1', 
            edgecolors='k', s=50, alpha=0.7)
plt.xlabel('$x_1$', fontsize=14)
plt.ylabel('$x_2$', fontsize=14)
plt.title('Mixture Dataset (Non-Linear Boundary)', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

## Split into Train and Test Sets

In [None]:
# Split data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

### ‚úÖ Checkpoint Question 1: Why can't linear logistic regression solve this problem?

A) Linear logistic regression can only create straight-line decision boundaries, but our classes are separated by a non-linear curve that cannot be captured by a straight line

B) Linear logistic regression requires all features to be normalized to zero mean and unit variance before training, but our dataset has unnormalized features

C) Linear logistic regression can only handle datasets with exactly two features and fails when applied to higher-dimensional feature spaces with three or more variables

D) Linear logistic regression uses gradient descent which cannot converge properly on non-linear data patterns, always getting stuck in local minima during optimization

<details>
<summary>Click to see answer</summary>

**Answer: A**

**Key Insight:** Linear logistic regression creates decision boundaries of the form $w_0 + w_1x_1 + w_2x_2 = 0$, which is always a straight line in 2D (or hyperplane in higher dimensions). Our dataset requires a curved boundary that cannot be expressed as a linear combination of $x_1$ and $x_2$ alone.

**Detailed Explanation:**

A linear classifier can only create decision boundaries that are:
- **2D**: Straight lines
- **3D**: Flat planes  
- **Higher dimensions**: Hyperplanes

For our mixture dataset:
- The two classes are intermingled in a non-linear pattern
- Optimal boundary requires curves or complex shapes
- A straight line cannot separate the classes effectively

**Visual example:**
```
Linear boundary:        Needed boundary:
   |  O O O               O O O
   |O  X  O            O    ‚ü≤    O
---*------          O    ‚ü≤  X  O
 X | X O             O  X  ‚ü≤  O
  X|X O                 O O O
```

**Solution:** Use polynomial features to create non-linear terms:
- Add $x_1^2, x_2^2, x_1x_2$ to features
- Now the model can learn: $w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^2 + w_5x_1x_2 = 0$
- This allows curved and complex decision boundaries

**Why other answers are incorrect:**

- **B is FALSE**: While feature normalization is a best practice for gradient descent convergence, it's not why linear models fail on this problem. Even with perfectly normalized features, a straight line cannot separate non-linearly distributed classes. Normalization helps with optimization speed and stability, but doesn't change the fundamental limitation of linear boundaries.

- **C is FALSE**: Linear logistic regression works with any number of features (2, 3, 100, 10000, etc.). The limitation is not the number of features but the type of decision boundary (linear vs non-linear). In fact, we'll solve this problem by *adding* features (polynomial terms) while still using logistic regression.

- **D is FALSE**: Gradient descent converges just fine on non-linear data. The negative log-likelihood loss function for logistic regression is convex, meaning there are no local minima - gradient descent will always reach the global optimum. The issue is that this optimum corresponds to the best *linear* boundary, which still can't separate non-linearly distributed classes effectively.

</details>

## Exercise 1: Demonstrate Overfitting with Polynomial Features

In this exercise, we'll:
1. Apply polynomial feature transformation to our data
2. Train logistic regression models with different polynomial degrees
3. Observe how higher degrees lead to overfitting (high train accuracy, low test accuracy)

**What you'll implement:**
- Complete logistic regression class (from previous labs)
- Train models with degrees 1, 2, 4, 6, 8
- Compare training vs test accuracy to identify overfitting

In [None]:
class LogisticRegressionRegularized(BaseEstimator, ClassifierMixin):
    """
    Logistic Regression with L2 regularization.
    
    Parameters
    ----------
    learning_rate : float, default=0.01
        Step size for gradient descent
    max_iter : int, default=1000  
        Maximum number of iterations
    lambda_reg : float, default=0.0
        L2 regularization strength (Œª ‚â• 0)
    tol : float, default=1e-6
        Convergence tolerance
    random_state : int, default=None
        Random seed for weight initialization
    """
    
    def __init__(self, learning_rate=0.01, max_iter=1000, lambda_reg=0.0, 
                 tol=1e-6, random_state=None):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.lambda_reg = lambda_reg
        self.tol = tol
        self.random_state = random_state
    
    def _sigmoid(self, z):
        """Sigmoid function using scipy.special.expit for numerical stability."""
        # TODO: Implement sigmoid using expit
        # Hint: return expit(z)
        return None
    
    def _compute_gradient(self, Phi, y, probabilities):
        """
        Compute gradient with L2 regularization.
        
        Gradient = Œ¶·µÄ(p - y) + 2Œªw
        
        Note: We don't regularize the bias term (first weight)
        """
        # TODO: Compute base gradient (without regularization)
        # Hint: Œ¶·µÄ @ (probabilities - y)
        gradient = None
        
        # TODO: Add L2 regularization term to gradient
        # Hint: Create a regularization vector with bias=0
        #       reg_weights = self.weights_.copy()
        #       reg_weights[0] = 0  # Don't regularize bias
        #       gradient += 2 * self.lambda_reg * reg_weights
        if self.lambda_reg > 0:
            pass  # Add regularization here
        
        return gradient
    
    def fit(self, X, y):
        """Fit the model using gradient descent."""
        # Create design matrix
        Phi = np.c_[np.ones(X.shape[0]), X]
        
        # Initialize weights
        if self.random_state is not None:
            np.random.seed(self.random_state)
        self.weights_ = np.random.randn(Phi.shape[1]) * 0.01
        
        # Loss history
        self.loss_history_ = []
        
        # Gradient descent
        for iteration in range(self.max_iter):
            # TODO: Compute probabilities
            # Step 1: scores = Phi @ self.weights_
            # Step 2: probabilities = self._sigmoid(scores)
            scores = None
            probabilities = None
            
            # TODO: Compute NLL loss (with regularization)
            epsilon = 1e-15
            p_safe = np.clip(probabilities, epsilon, 1 - epsilon)
            nll = -np.sum(y * np.log(p_safe) + (1 - y) * np.log(1 - p_safe))
            
            # Add L2 penalty to loss (don't regularize bias)
            # Loss uses Œª||w||¬≤ (without the 1/2 factor)
            if self.lambda_reg > 0:
                nll += self.lambda_reg * np.sum(self.weights_[1:]**2)
            
            self.loss_history_.append(nll)
            
            # TODO: Compute gradient using _compute_gradient
            gradient = None
            
            # Check convergence
            if np.linalg.norm(gradient) < self.tol:
                break
            
            # TODO: Update weights
            # Hint: self.weights_ = self.weights_ - self.learning_rate * gradient
            pass
        
        self.n_iter_ = iteration + 1
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        Phi = np.c_[np.ones(X.shape[0]), X]
        scores = Phi @ self.weights_
        p1 = self._sigmoid(scores)
        return np.column_stack([1 - p1, p1])
    
    def predict(self, X):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= 0.5).astype(int)

In [None]:
print("=" * 70)
print("EXERCISE 1: Demonstrating Overfitting with Polynomial Features")
print("=" * 70)

# Test different polynomial degrees WITHOUT regularization
degrees = [1, 2, 4, 6, 8]
results_no_reg = {}

for degree in degrees:
    print(f"\nTesting degree {degree}...")
    
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    # Standardize features
    scaler = StandardScaler()
    X_train_poly = scaler.fit_transform(X_train_poly)
    X_test_poly = scaler.transform(X_test_poly)
    
    # Train model WITHOUT regularization (lambda=0)
    model = LogisticRegressionRegularized(
        learning_rate=0.1,
        max_iter=2000,
        lambda_reg=0.0,  # No regularization!
        random_state=42
    )
    model.fit(X_train_poly, y_train)
    
    # Evaluate
    train_acc = accuracy_score(y_train, model.predict(X_train_poly))
    test_acc = accuracy_score(y_test, model.predict(X_test_poly))
    
    results_no_reg[degree] = {
        'poly': poly,
        'scaler': scaler,
        'model': model,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'n_features': X_train_poly.shape[1],
        'weight_magnitude': np.linalg.norm(model.weights_)
    }
    
    print(f"  Features: {X_train_poly.shape[1]}")
    print(f"  Train Accuracy: {train_acc:.3f}")
    print(f"  Test Accuracy:  {test_acc:.3f}")
    print(f"  Overfitting Gap: {train_acc - test_acc:.3f}")
    print(f"  Weight Magnitude: {np.linalg.norm(model.weights_):.2f}")

print("\n" + "=" * 70)
print("OBSERVATION: Notice how higher degrees have:")
print("  1. Higher training accuracy (good fit to training data)")
print("  2. Lower test accuracy (poor generalization)")
print("  3. Larger weight magnitudes (complex models)")
print("  4. Larger overfitting gap (train_acc - test_acc)")
print("\nThis is OVERFITTING! Regularization will help.")
print("=" * 70)

### Visualize Overfitting: Training vs Test Accuracy

In [None]:
# Extract results for plotting
train_accs = [results_no_reg[d]['train_acc'] for d in degrees]
test_accs = [results_no_reg[d]['test_acc'] for d in degrees]
n_features_list = [results_no_reg[d]['n_features'] for d in degrees]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Accuracy vs Polynomial Degree
ax = axes[0]
ax.plot(degrees, train_accs, 'o-', linewidth=3, markersize=10, label='Training Accuracy', color='green')
ax.plot(degrees, test_accs, 's-', linewidth=3, markersize=10, label='Test Accuracy', color='red')
ax.fill_between(degrees, train_accs, test_accs, alpha=0.2, color='gray', label='Overfitting Gap')
ax.set_xlabel('Polynomial Degree', fontsize=14)
ax.set_ylabel('Accuracy', fontsize=14)
ax.set_title('Overfitting: Training vs Test Accuracy (Œª=0)', fontsize=16)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
ax.set_xticks(degrees)

# Plot 2: Accuracy vs Number of Features
ax = axes[1]
ax.plot(n_features_list, train_accs, 'o-', linewidth=3, markersize=10, label='Training Accuracy', color='green')
ax.plot(n_features_list, test_accs, 's-', linewidth=3, markersize=10, label='Test Accuracy', color='red')
ax.set_xlabel('Number of Features', fontsize=14)
ax.set_ylabel('Accuracy', fontsize=14)
ax.set_title('Model Complexity vs Accuracy (Œª=0)', fontsize=16)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### ‚úÖ Checkpoint Question 2: What indicates overfitting in the plots above?

A) Training and test accuracy both increase together as polynomial degree increases, showing the model is learning meaningful patterns from the data
B) Training accuracy increases while test accuracy decreases as degree increases, indicating the model memorizes training data but fails to generalize
C) Both training and test accuracy remain constant across all polynomial degrees, suggesting the model has reached its optimal performance level
D) Test accuracy is consistently higher than training accuracy across all degrees, demonstrating excellent generalization to unseen data samples

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** The hallmark of overfitting is a growing gap between training and test performance. As we add more polynomial features, the model gains flexibility to fit the training data more precisely (training accuracy ‚Üë), but this comes at the cost of fitting noise and training-specific patterns that don't generalize (test accuracy ‚Üì or stagnates).

**Detailed Explanation:**

Let's analyze what happens as polynomial degree increases:

**Degree 1 (Linear):**
- 2 features: $[x_1, x_2]$
- Cannot fit circular boundary
- Train accuracy: ~60%, Test accuracy: ~60%
- Underfitting: model too simple

**Degree 2 (Quadratic):**
- 5 features: $[x_1, x_2, x_1^2, x_1x_2, x_2^2]$
- Can fit circular boundary!
- Train accuracy: ~90%, Test accuracy: ~88%
- Good fit: slight gap is acceptable

**Degree 6:**
- 27 features
- Train accuracy: ~98%, Test accuracy: ~82%
- **Overfitting!** Gap = 16%

**Degree 8:**
- 44 features  
- Train accuracy: ~99%, Test accuracy: ~78%
- **Severe overfitting!** Gap = 21%

**Why does this happen?**

High-degree polynomials create very flexible decision boundaries that can:
1. Perfectly wrap around every training point
2. Fit to random noise in training data
3. Create complex, wiggly boundaries that don't reflect the true pattern

Example with 1 point slightly off-pattern:
```
Degree 2 (good):       Degree 8 (overfitting):
   O O O                 O O O
 O       O             O    X  O  ‚Üê boundary bends
O    X    O           O          O    to capture
 O  X X  O             O  X X  O     this outlier
   O O O                 O O O
```

**Visual indicators of overfitting:**
- ‚ÜóÔ∏è Training accuracy keeps increasing
- ‚ÜòÔ∏è Test accuracy starts decreasing (or stops improving)
- üìè Growing gap between the two curves
- üéØ Model achieves near-perfect training accuracy but poor test accuracy

**Why other answers are incorrect:**

- **A is FALSE**: If both accuracies increased together, that would indicate good generalization (the ideal scenario). We want both to improve together. When they diverge, with training accuracy increasing and test accuracy plateauing or decreasing, that's when we have overfitting.

- **C is FALSE**: Constant accuracy across degrees would suggest:
  1. The model can't benefit from added complexity (already optimal), OR
  2. The features aren't helping (implementation problem)
  
  This is not what we observe. We clearly see training accuracy increasing with degree, indicating the model IS using the additional features - just not in a way that generalizes.

- **D is FALSE**: Test accuracy higher than training accuracy is extremely rare and usually indicates:
  1. A lucky test set (not representative)
  2. Data leakage (test data seen during training)
  3. Very few samples
  
  This is the opposite of overfitting. With proper random splits, training accuracy should always be ‚â• test accuracy because the model is optimized on the training set.

</details>

## Pseudocode for L2 Regularisation

Before implementing L2 regularization, let's understand the algorithm structure:

### Helper Function: Exclude Intercept from Regularization

```
Function REG_VECTOR(w)
    r ‚Üê copy(w)
    r[0] ‚Üê 0           # Don't regularize intercept
    Return r
End
```

**Why?** The intercept (bias term) just shifts the decision boundary and shouldn't be penalized.

### L2 Regularization Components

**L2 Loss:** 
$$\lambda \cdot ||\text{REG\_VECTOR}(w)||^2 = \lambda \sum_{j=1}^{d} w_j^2$$

**L2 Gradient:** 
$$2\lambda \cdot \text{REG\_VECTOR}(w)$$

Note: The gradient is $2\lambda w$ because we differentiate $\lambda w^2$ to get $2\lambda w$.

### Main Optimization Loop

```
# Inputs:
#   base_grad(w)  ‚Üê gradient of base loss (no regularization)
#   w0            ‚Üê initial parameters
#   Œ∑             ‚Üê learning rate
#   max_iter      ‚Üê maximum iterations
#   tol           ‚Üê stop when ||g|| ‚â§ tol
#   Œª ‚â• 0         ‚Üê L2 strength (intercept w[0] never penalized)

w ‚Üê w0

FOR t = 1 TO max_iter DO
    g_base ‚Üê base_grad(w)                    # Gradient of base loss
    g_reg ‚Üê 2 ¬∑ Œª ¬∑ REG_VECTOR(w)           # L2 gradient (no intercept)
    g ‚Üê g_base + g_reg                       # Total gradient
    
    IF norm(g) ‚â§ tol THEN BREAK             # Convergence check
    
    w ‚Üê w ‚àí Œ∑ ¬∑ g                            # Update weights
END FOR

Return w
```

### Key Points

1. **Base gradient**: $\Phi^T(p - y)$ (from negative log-likelihood)
2. **L2 gradient**: $2\lambda w$ (from regularization term)
3. **Total gradient**: $\Phi^T(p - y) + 2\lambda w$ (excluding intercept)
4. **Weight update**: $w \leftarrow w - \eta \cdot (\Phi^T(p - y) + 2\lambda w)$

---

## Exercise 2: Implement L2 Regularization

Now we'll fix the overfitting problem by implementing L2 regularization!

**What you need to do:**

Go back to the `LogisticRegressionRegularized` class above and complete the TODO sections:

1. **`_sigmoid()`**: Implement using `expit(z)`
2. **`_compute_gradient()`**: Add the L2 regularization term `2Œªw` to the gradient
   - Base gradient: `Œ¶·µÄ(p - y)`
   - Regularization: `+ 2Œªw`
   - Important: Don't regularize the bias term (first weight)
3. **`fit()`**: Complete the gradient descent loop
   - Compute scores and probabilities
   - Compute gradient using `_compute_gradient()`
   - Update weights: `w = w - Œ± * gradient`

**Then run the cell below to test your implementation!**

In [None]:
print("=" * 70)
print("EXERCISE 2 VERIFICATION: Testing L2 Regularization Implementation")
print("=" * 70)

# Test on degree 6 polynomial with and without regularization
degree_test = 6
lambda_test = 1.0

# Create polynomial features
poly_test = PolynomialFeatures(degree=degree_test, include_bias=False)
X_train_poly_test = poly_test.fit_transform(X_train)
X_test_poly_test = poly_test.transform(X_test)

# Standardize
scaler_test = StandardScaler()
X_train_poly_test = scaler_test.fit_transform(X_train_poly_test)
X_test_poly_test = scaler_test.transform(X_test_poly_test)

# Train WITHOUT regularization
model_no_reg_test = LogisticRegressionRegularized(
    learning_rate=0.1, max_iter=2000, lambda_reg=0.0, random_state=42
)
model_no_reg_test.fit(X_train_poly_test, y_train)
train_acc_no_reg = accuracy_score(y_train, model_no_reg_test.predict(X_train_poly_test))
test_acc_no_reg = accuracy_score(y_test, model_no_reg_test.predict(X_test_poly_test))

# Train WITH regularization
model_reg_test = LogisticRegressionRegularized(
    learning_rate=0.1, max_iter=2000, lambda_reg=lambda_test, random_state=42
)
model_reg_test.fit(X_train_poly_test, y_train)
train_acc_reg = accuracy_score(y_train, model_reg_test.predict(X_train_poly_test))
test_acc_reg = accuracy_score(y_test, model_reg_test.predict(X_test_poly_test))

print(f"\nTesting with Degree {degree_test} Polynomial:")
print("\nWithout Regularization (Œª=0):")
print(f"  Train Accuracy: {train_acc_no_reg:.3f}")
print(f"  Test Accuracy:  {test_acc_no_reg:.3f}")
print(f"  Overfitting Gap: {train_acc_no_reg - test_acc_no_reg:.3f}")
print(f"  Weight Magnitude: {np.linalg.norm(model_no_reg_test.weights_):.2f}")

print(f"\nWith L2 Regularization (Œª={lambda_test}):")
print(f"  Train Accuracy: {train_acc_reg:.3f}")
print(f"  Test Accuracy:  {test_acc_reg:.3f}")
print(f"  Overfitting Gap: {train_acc_reg - test_acc_reg:.3f}")
print(f"  Weight Magnitude: {np.linalg.norm(model_reg_test.weights_):.2f}")

print("\n" + "=" * 70)
print("‚úÖ SUCCESS CRITERIA:")
print("  1. Test accuracy should IMPROVE with regularization")
print("  2. Overfitting gap should DECREASE with regularization")
print("  3. Weight magnitude should be SMALLER with regularization")
print("\nIf your implementation is correct, you should see:")
print("  ‚Ä¢ Test accuracy increases (e.g., 0.78 ‚Üí 0.87)")
print("  ‚Ä¢ Overfitting gap shrinks (e.g., 0.21 ‚Üí 0.05)")
print("  ‚Ä¢ Weight magnitude decreases (e.g., 50 ‚Üí 15)")
print("=" * 70)

### ‚úÖ Checkpoint Question 3: How does L2 regularization prevent overfitting?

A) L2 regularization adds a penalty term proportional to the squared magnitude of weights to the loss function, forcing the model to keep weights small unless they significantly improve the fit, which results in simpler and smoother decision boundaries that generalize better to unseen data

B) L2 regularization removes features with small weights from the model entirely by setting their coefficients to exactly zero, performing automatic feature selection during training to reduce model complexity

C) L2 regularization increases the learning rate during gradient descent optimization, allowing the model to converge faster to the global minimum and avoid getting trapped in local minima

D) L2 regularization adds random noise to the training data during each iteration, preventing the model from memorizing specific training examples and forcing it to learn more robust patterns

<details>
<summary>Click to see answer</summary>

**Answer: A**

**Key Insight:** L2 regularization modifies the loss function to penalize large weights, creating a trade-off between fitting the training data and keeping the model simple. The modified loss is: $J_{\text{total}} = J_{\text{NLL}} + \lambda||w||^2$. This forces the optimization to balance data fit with weight magnitude, resulting in smoother decision boundaries that generalize better.

**Detailed Explanation:**

**Without L2 regularization ($\lambda = 0$):**
- Loss function: $J = -\sum [y \log p + (1-y)\log(1-p)]$
- Goal: Minimize NLL only
- Weights can grow arbitrarily large to perfectly fit training data
- Result: Complex, wiggly boundaries that overfit

**With L2 regularization ($\lambda > 0$):**
- Loss function: $J = -\sum [y \log p + (1-y)\log(1-p)] + \lambda\sum w_j^2$
- Goal: Minimize NLL AND keep weights small
- Large weights are penalized quadratically
- Result: Simpler boundaries, better generalization

**Example with numbers:**

Consider a model trying to fit a training point:
```
Without regularization (Œª=0):
- Can use weights: w = [50, -45, 38] (magnitude = 77)
- Training loss: 0.01 (perfect fit)
- Test performance: poor (overfits)

With regularization (Œª=1):
- Forced to use: w = [2, -1.5, 1.2] (magnitude = 2.7)
- Training loss: 0.15 (good but not perfect fit)
- Test performance: good (generalizes)
- Total loss: 0.15 + 1*(2.7¬≤) = 0.15 + 7.3 = 7.45
```

**How gradient descent changes with L2:**

Standard gradient: $\nabla J = \Phi^T(p - y)$

With L2: $\nabla J = \Phi^T(p - y) + 2\lambda w$

Update rule: $w \leftarrow w - \alpha[\Phi^T(p-y) + 2\lambda w] = (1-2\alpha\lambda)w - \alpha\Phi^T(p-y)$

The term $(1-2\alpha\lambda)w$ shrinks weights at each step (weight decay)!

**Visual interpretation:**
```
Without regularization:      With regularization:
      O O O                      O O O
    O  ‚Üê  O                    O       O
   O   ‚Üì   O                  O    X    O
  O ‚Üê X ‚Üí O                  O    X X   O
   O   ‚Üë  O                     O  X X O
    O   O                         O O O
   (wiggly, fits                (smooth,
    every point)               generalizes)
```

**Effect of Œª:**
- Œª = 0: No regularization, may overfit
- Œª small (0.01): Light penalty, slight smoothing
- Œª medium (1.0): Balanced, often optimal
- Œª large (100): Heavy penalty, may underfit (weights too small)

**Why other answers are incorrect:**

- **B is FALSE**: This describes L1 (Lasso) regularization, not L2 (Ridge). L1 uses $\lambda \sum |w_j|$ which has a gradient of $\lambda \cdot \text{sign}(w)$, producing exactly zero weights (sparse solutions). L2 uses $\lambda \sum w_j^2$ with gradient $2\lambda w$, which only asymptotically approaches zero but never reaches it. L2 shrinks all weights but doesn't perform feature selection.

- **C is FALSE**: L2 regularization does not affect the learning rate $\alpha$. The learning rate is a separate hyperparameter that controls step size in gradient descent. L2 regularization modifies the gradient itself by adding $2\lambda w$, but the learning rate remains the same. Increasing the learning rate could actually make convergence worse, not better.

- **D is FALSE**: This describes data augmentation or dropout, not L2 regularization. L2 regularization is a deterministic modification to the loss function - no randomness involved. The training data remains unchanged; only the optimization objective changes. Dropout (random neuron deactivation) and data augmentation (adding noisy copies) are different regularization techniques used primarily in neural networks.

</details>

## Exercise 3: Hyperparameter Tuning - Finding Optimal Œª

We've seen that regularization helps, but how do we choose the best value of Œª?

**The process:**
1. Define a range of Œª values to test (e.g., 0.001, 0.01, 0.1, 1, 10)
2. For each Œª, train a model and evaluate using cross-validation
3. Select the Œª that gives the best validation performance
4. Retrain final model on all training data with optimal Œª
5. Evaluate on held-out test set

**Why cross-validation?**
- Using test set for Œª selection would leak information
- CV uses only training data to estimate generalization
- Test set remains untouched until final evaluation

In [None]:
print("=" * 70)
print("EXERCISE 3: Finding Optimal Œª via Cross-Validation")
print("=" * 70)

# Use degree 6 polynomial (we know this overfits without regularization)
degree = 6

# Create polynomial features
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Standardize
scaler = StandardScaler()
X_train_poly = scaler.fit_transform(X_train_poly)
X_test_poly = scaler.transform(X_test_poly)

# Test different Œª values
lambdas = [0, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0]
cv_results = {}

# 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

print(f"\nUsing degree {degree} polynomial ({X_train_poly.shape[1]} features)")
print("Testing Œª values with 5-fold cross-validation...\n")

for lambda_val in lambdas:
    print(f"Œª = {lambda_val:6.3f} ", end="")
    
    # Cross-validation scores
    cv_scores = []
    
    for train_idx, val_idx in kf.split(X_train_poly):
        X_cv_train = X_train_poly[train_idx]
        y_cv_train = y_train.iloc[train_idx] if hasattr(y_train, 'iloc') else y_train[train_idx]
        X_cv_val = X_train_poly[val_idx]
        y_cv_val = y_train.iloc[val_idx] if hasattr(y_train, 'iloc') else y_train[val_idx]
        
        # Train model
        model = LogisticRegressionRegularized(
            learning_rate=0.1,
            max_iter=2000,
            lambda_reg=lambda_val,
            random_state=42
        )
        model.fit(X_cv_train, y_cv_train)
        
        # Validate
        val_score = accuracy_score(y_cv_val, model.predict(X_cv_val))
        cv_scores.append(val_score)
    
    mean_cv_score = np.mean(cv_scores)
    std_cv_score = np.std(cv_scores)
    
    # Train on full training set to get train/test accuracy
    model_full = LogisticRegressionRegularized(
        learning_rate=0.1,
        max_iter=2000,
        lambda_reg=lambda_val,
        random_state=42
    )
    model_full.fit(X_train_poly, y_train)
    
    train_acc = accuracy_score(y_train, model_full.predict(X_train_poly))
    test_acc = accuracy_score(y_test, model_full.predict(X_test_poly))
    
    cv_results[lambda_val] = {
        'cv_mean': mean_cv_score,
        'cv_std': std_cv_score,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'model': model_full
    }
    
    print(f"‚Üí CV: {mean_cv_score:.3f} (¬±{std_cv_score:.3f}), "
          f"Train: {train_acc:.3f}, Test: {test_acc:.3f}")

# Find optimal Œª
optimal_lambda = max(cv_results, key=lambda k: cv_results[k]['cv_mean'])
print("\n" + "=" * 70)
print(f"‚úÖ OPTIMAL Œª = {optimal_lambda}")
print(f"   Cross-validation score: {cv_results[optimal_lambda]['cv_mean']:.3f}")
print(f"   Test accuracy: {cv_results[optimal_lambda]['test_acc']:.3f}")
print("=" * 70)

### Visualize Œª Tuning Curve

In [None]:
# Extract results
lambda_vals = list(cv_results.keys())
cv_means = [cv_results[l]['cv_mean'] for l in lambda_vals]
cv_stds = [cv_results[l]['cv_std'] for l in lambda_vals]
train_accs = [cv_results[l]['train_acc'] for l in lambda_vals]
test_accs = [cv_results[l]['test_acc'] for l in lambda_vals]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: CV score vs Œª (log scale)
ax = axes[0]
ax.errorbar(lambda_vals, cv_means, yerr=cv_stds, fmt='o-', linewidth=2, 
            markersize=8, capsize=5, label='CV Score ¬± Std')
ax.axvline(optimal_lambda, color='red', linestyle='--', linewidth=2, 
           label=f'Optimal Œª={optimal_lambda}')
ax.set_xscale('symlog', linthresh=0.001)
ax.set_xlabel('Regularization Strength (Œª)', fontsize=14)
ax.set_ylabel('Cross-Validation Accuracy', fontsize=14)
ax.set_title('Hyperparameter Tuning: Finding Optimal Œª', fontsize=16)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# Plot 2: Train vs Test accuracy
ax = axes[1]
ax.plot(lambda_vals, train_accs, 'o-', linewidth=2, markersize=8, 
        label='Training Accuracy', color='green')
ax.plot(lambda_vals, test_accs, 's-', linewidth=2, markersize=8, 
        label='Test Accuracy', color='red')
ax.axvline(optimal_lambda, color='red', linestyle='--', linewidth=2, 
           label=f'Optimal Œª={optimal_lambda}')
ax.set_xscale('symlog', linthresh=0.001)
ax.set_xlabel('Regularization Strength (Œª)', fontsize=14)
ax.set_ylabel('Accuracy', fontsize=14)
ax.set_title('Bias-Variance Tradeoff', fontsize=16)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### ‚úÖ Checkpoint Question 4: How does Œª affect the bias-variance tradeoff?

A) Increasing Œª decreases both bias and variance simultaneously, creating models that fit training data perfectly while also generalizing excellently to test data
B) Increasing Œª increases bias and decreases variance, trading some training accuracy for better generalization by constraining model flexibility and preventing overfitting
C) Increasing Œª decreases bias and increases variance, allowing the model to fit complex patterns in the training data but at the cost of generalization
D) The value of Œª has no effect on bias or variance, it only controls the learning rate convergence speed during gradient descent optimization

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** Regularization strength Œª controls the bias-variance tradeoff. Larger Œª forces simpler models (higher bias, lower variance), while smaller Œª allows more complex models (lower bias, higher variance). The optimal Œª balances these two sources of error to minimize total error.

**Detailed Explanation:**

**Bias-Variance Decomposition:**

Total Error = Bias¬≤ + Variance + Irreducible Error

- **Bias**: Error from wrong assumptions (model too simple)
- **Variance**: Error from sensitivity to training data (model too complex)
- **Irreducible**: Noise in data (can't be reduced)

**How Œª affects bias and variance:**

**Œª = 0 (No regularization):**
```
- Bias: LOW (flexible model can fit true pattern)
- Variance: HIGH (fits noise, changes drastically with different training sets)
- Training accuracy: 99% (overfits)
- Test accuracy: 78% (doesn't generalize)
- Decision boundary: Wiggly, complex
```

**Œª = 1.0 (Moderate regularization):**
```
- Bias: MEDIUM (some flexibility to fit pattern)
- Variance: MEDIUM (stable across training sets)
- Training accuracy: 92% (doesn't overfit)
- Test accuracy: 88% (generalizes well)
- Decision boundary: Smooth, simple
```

**Œª = 100 (Heavy regularization):**
```
- Bias: HIGH (model too constrained, can't fit pattern)
- Variance: LOW (very stable, but consistently wrong)
- Training accuracy: 65% (underfits)
- Test accuracy: 63% (too simple)
- Decision boundary: Nearly linear
```

**Visual representation:**
```
Error ^
      |
      |     Variance
      |      /
      |     /
      |    /---___
      |   /       ---___  Total Error
      |  /              ---___
      | /                     ---___
      |/___________________________---___ Bias¬≤
      +---------------------------------> Œª
      0     optimal Œª              ‚àû
```

**Why increasing Œª increases bias:**

Larger Œª penalizes weights more heavily, forcing them toward zero:
- Reduces model capacity to fit complex patterns
- Creates simpler, more restricted decision boundaries
- May not capture the true underlying pattern (systematic error)

Example: True boundary is a circle, but high Œª forces nearly linear boundary ‚Üí bias!

**Why increasing Œª decreases variance:**

Larger Œª constrains how much weights can change:
- Model is less sensitive to specific training examples
- Small changes in training data ‚Üí small changes in learned weights
- More stable across different random samples

Example: With high Œª, adding/removing a few training points barely changes the boundary ‚Üí low variance!

**Finding optimal Œª:**

```
Œª too small ‚Üí overfit (low bias, high variance)
   ‚îú‚îÄ Training accuracy very high
   ‚îî‚îÄ Test accuracy low

Œª optimal ‚Üí balanced (medium bias, medium variance)
   ‚îú‚îÄ Training accuracy good
   ‚îî‚îÄ Test accuracy good (minimizes total error)

Œª too large ‚Üí underfit (high bias, low variance)
   ‚îú‚îÄ Training accuracy low
   ‚îî‚îÄ Test accuracy low
```

**Mathematical intuition:**

Without regularization: $\min_w J(w)$
- Solution can use large weights
- Fits training data closely (low bias)
- Changes drastically with different data (high variance)

With regularization: $\min_w [J(w) + \frac{\lambda}{2}||w||^2]$
- Solution constrained to small weights
- Can't fit training data as closely (higher bias)
- More stable across datasets (lower variance)

**Why other answers are incorrect:**

- **A is FALSE**: This is the "free lunch" scenario that doesn't exist in machine learning. You cannot simultaneously decrease both bias and variance - there's always a tradeoff. Regularization helps generalization by accepting some bias (training error) to reduce variance (sensitivity to training data). If we could decrease both, we'd always use maximum regularization!

- **C is FALSE**: This is backwards. Increasing Œª DECREASES model complexity (by penalizing large weights), which INCREASES bias and DECREASES variance. The description in C (decreases bias, increases variance) is what happens when you DECREASE Œª or add more features without regularization.

- **D is FALSE**: Œª fundamentally affects model capacity and behavior, not optimization speed. The learning rate Œ± controls gradient descent convergence speed, while Œª controls the bias-variance tradeoff. These are independent hyperparameters with different purposes. You can have Œª=0 (no regularization) with any learning rate, and Œª=10 (heavy regularization) with any learning rate.

</details>

## Visualize Decision Boundaries: Effect of Regularization

In [None]:
# Select interesting Œª values to visualize
lambdas_to_plot = [0, 0.01, 0.1, 1.0, 10.0, 50.0]

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Data ranges for mesh
x1_min, x1_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
x2_min, x2_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

for idx, lambda_val in enumerate(lambdas_to_plot):
    ax = axes[idx]
    
    # Get or train model
    if lambda_val in cv_results:
        model = cv_results[lambda_val]['model']
        test_acc = cv_results[lambda_val]['test_acc']
    else:
        model = LogisticRegressionRegularized(
            learning_rate=0.1, max_iter=2000, lambda_reg=lambda_val, random_state=42
        )
        model.fit(X_train_poly, y_train)
        test_acc = accuracy_score(y_test, model.predict(X_test_poly))
    
    # Create mesh
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                            np.linspace(x2_min, x2_max, 200))
    X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
    X_mesh_poly = poly.transform(X_mesh)
    X_mesh_poly = scaler.transform(X_mesh_poly)
    probs_mesh = model.predict_proba(X_mesh_poly)[:, 1].reshape(xx1.shape)
    
    # Plot contours
    ax.contourf(xx1, xx2, probs_mesh, levels=20, cmap='RdBu_r', alpha=0.6)
    ax.contour(xx1, xx2, probs_mesh, levels=[0.5], colors='black', linewidths=2.5)
    
    # Plot data
    ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
               c='orange', edgecolors='k', s=40, alpha=0.7, label='Class 0 (train)')
    ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
               c='skyblue', edgecolors='k', s=40, alpha=0.7, label='Class 1 (train)')
    
    # Determine if this is optimal
    is_optimal = "‚úÖ OPTIMAL" if lambda_val == optimal_lambda else ""
    
    ax.set_title(f'Œª={lambda_val}, Test Acc={test_acc:.2f} {is_optimal}', 
                 fontsize=13, fontweight='bold' if lambda_val == optimal_lambda else 'normal')
    ax.set_xlabel('$x_1$', fontsize=11)
    ax.set_ylabel('$x_2$', fontsize=11)
    ax.axis('equal')
    ax.set_xlim(x1_min, x1_max)
    ax.set_ylim(x2_min, x2_max)
    if idx == 0:
        ax.legend(fontsize=9, loc='upper right')

plt.tight_layout()
plt.show()

### ‚úÖ Checkpoint Question 5: What happens to the decision boundary as Œª increases?

A) The decision boundary becomes more complex and wiggly with more curves and details as the model fits individual training points more precisely
B) The decision boundary becomes smoother and simpler, gradually approaching a nearly linear boundary as regularization constrains the weights toward zero
C) The decision boundary remains completely unchanged regardless of lambda value because regularization only affects training speed not the final model shape
D) The decision boundary becomes more circular and fits the true pattern better as regularization helps the model learn the underlying circular distribution

<details>
<summary>Click to see answer</summary>

**Answer: B**

**Key Insight:** Increasing Œª constrains weights toward smaller values, which reduces the contribution of high-degree polynomial terms. This makes the decision boundary progressively simpler and smoother. At very high Œª, most polynomial terms are suppressed, and the boundary approaches linear.

**Detailed Explanation:**

**Why boundaries become smoother with larger Œª:**

Recall our decision boundary equation with degree 6 polynomial:
$$w_0 + w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x_2^2 + \ldots + w_{27}x_2^6 = 0$$

**Œª = 0 (No regularization):**
```
w = [0.5, 2.1, -1.8, 15.3, -12.7, 18.9, ..., -45.2, 38.7]
         (linear)    (quadratic)        (high-order)
              ‚Üë            ‚Üë                  ‚Üë
           small       medium            LARGE!
```
- High-degree terms have large weights
- Decision boundary: $2.1x_1 - 1.8x_2 + 15.3x_1^2 - 12.7x_1x_2 + ... - 45.2x_1^3x_2^3 + ... = 0$
- Result: Complex, wiggly boundary that bends around individual training points

**Œª = 1.0 (Moderate regularization):**
```
w = [0.5, 2.0, -1.7, 3.2, -2.8, 3.5, ..., -0.8, 0.6]
         (linear)   (quadratic)      (high-order)
              ‚Üë            ‚Üë                ‚Üë
          preserved    preserved        suppressed
```
- High-degree terms shrunk toward zero
- Decision boundary: $2.0x_1 - 1.7x_2 + 3.2x_1^2 - 2.8x_1x_2 + 3.5x_2^2 + \text{(tiny terms)} \approx 0$
- Result: Smooth circular/elliptical boundary (dominated by quadratic terms)

**Œª = 100 (Heavy regularization):**
```
w = [0.5, 1.8, -1.6, 0.3, -0.2, 0.3, ..., -0.01, 0.008]
         (linear)    (quadratic)       (high-order)
              ‚Üë            ‚Üë                ‚Üë
         dominant      weak           negligible
```
- All polynomial terms heavily suppressed
- Decision boundary: $1.8x_1 - 1.6x_2 + \text{(negligible terms)} \approx 0$
- Result: Nearly linear boundary (underfits circular pattern)

**Visual progression:**
```
Œª=0:                Œª=1:              Œª=100:
  O O O              O O O             O O O
O  ‚ü≤  O            O       O         O   |   O
O  ‚ü≤ ‚Üí O     ‚Üí    O    O    O   ‚Üí   O   |   O
 O ‚Üê ‚ü≤ O           O  O O  O         O  | O O
  O O O              O O O             O O O
(wiggly)          (smooth circle)    (almost line)
```

**Mathematical explanation:**

Regularized loss: $J = \text{NLL} + \frac{\lambda}{2}\sum w_j^2$

To minimize this, the model must balance:
1. Fitting training data (low NLL)
2. Keeping weights small (low $\sum w_j^2$)

As Œª increases:
- The penalty term dominates
- Weights are forced toward zero
- High-degree polynomial terms (which have less predictive power) are suppressed first
- Only the most important features (typically low-degree) retain significant weights

**Effect on decision boundary curvature:**

The "wiggliness" of a boundary is determined by high-degree terms:
- $x^6$ terms can create 5-6 bends in the curve
- $x^2$ terms create smooth curves (ellipses, parabolas)
- $x^1$ terms create straight lines

Regularization reduces high-degree contributions ‚Üí smoother boundaries

**Practical observation from plots:**
```
Œª     Boundary Description              Test Acc
---   --------------------------------  ---------
0     Wiggly, overfits training noise   0.78
0.1   Smooth circle, captures pattern   0.87
1.0   Clean ellipse, generalizes well   0.88 ‚Üê optimal
10    Slightly curved                   0.85
100   Nearly linear, underfits          0.70
```

**Why other answers are incorrect:**

- **A is FALSE**: This is the opposite of what happens. Increasing Œª makes boundaries LESS complex, not more. Complex, wiggly boundaries occur with LOW Œª (or Œª=0) when the model overfits. The confusion might arise from thinking "more regularization = more complex," but regularization actually reduces complexity.

- **C is FALSE**: Regularization fundamentally changes the optimization objective, which directly affects the learned weights and therefore the decision boundary shape. It's not just about training speed. With Œª=0, the model minimizes NLL; with Œª>0, it minimizes NLL + Œª||w||¬≤. These produce different optimal solutions and thus different boundaries.

- **D is FALSE**: While moderate regularization (Œª ‚âà 1) helps fit the circular pattern, INCREASING Œª beyond optimal makes it worse. Very high Œª (e.g., Œª=100) produces nearly linear boundaries that cannot capture the circular distribution. The relationship between Œª and fit quality is not monotonic - there's an optimal value, and going too high hurts performance.

</details>

## Comparison: L1 vs L2 Regularization

We've focused on L2 (Ridge) regularization. Let's compare it with L1 (Lasso) regularization to understand when to use each.

### ‚úÖ Checkpoint Question 6: What is the key difference between L1 and L2 regularization?

A) L1 adds the absolute values of weights to the loss function and can drive weights to exactly zero performing feature selection, while L2 adds squared weights and only shrinks weights toward zero without making them exactly zero
B) L1 regularization is used only for regression problems and linear models, while L2 regularization is exclusively designed for classification tasks with logistic regression
C) L1 uses gradient descent optimization for training while L2 uses a closed-form analytical solution, making L2 much faster to compute for large datasets
D) L1 and L2 are identical in their mathematical formulation and produce the same learned weights, they only differ in their implementation details and computational cost

<details>
<summary>Click to see answer</summary>

**Answer: A**

**Key Insight:** L1 and L2 regularization differ fundamentally in how they penalize weights. L1 uses absolute values ($\sum |w_j|$), which produces sparse solutions (many exactly-zero weights), while L2 uses squared values ($\sum w_j^2$), which produces small but non-zero weights. This difference makes L1 useful for feature selection and L2 useful for general overfitting prevention.

**Detailed Explanation:**

**L2 Regularization (Ridge):**
$$J_{\text{L2}} = J_{\text{NLL}} + \frac{\lambda}{2}\sum_{j=1}^{d} w_j^2$$

Gradient: $\nabla R(w) = \lambda w$

- Penalty is smooth and differentiable
- Gradient is proportional to weight magnitude
- Weights shrink toward zero but rarely reach exactly zero
- All features remain in the model with small weights

**L1 Regularization (Lasso):**
$$J_{\text{L1}} = J_{\text{NLL}} + \lambda \sum_{j=1}^{d} |w_j|$$

Gradient: $\nabla R(w) = \lambda \cdot \text{sign}(w)$

- Penalty has a "corner" at zero (not differentiable at zero)
- Gradient is constant (either +Œª or -Œª)
- Weights are driven to exactly zero
- Performs automatic feature selection

**Visual comparison of penalty functions:**
```
Penalty
   ^
   |        L2 (smooth, differentiable)
   |         ___
   |      __/   \__
   |   __/         \__
   | _/               \_
   |/___________________\___> w
   |
   |    L1 (corner at zero)
   |        /\
   |       /  \
   |      /    \
   |     /      \
   |    /        \
   |___/__________\________> w
       0
```

**Example with numbers:**

Consider a model with 5 features, Œª = 1:

**Without regularization:**
```
w = [2.0, 5.0, -3.0, 0.5, -4.5]
All features used
```

**With L2 (Ridge):**
```
w = [0.8, 2.1, -1.2, 0.2, -1.8]
All weights shrunk, but all non-zero
All 5 features still in model
```

**With L1 (Lasso):**
```
w = [0.0, 3.2, -1.8, 0.0, -2.1]
Some weights exactly zero!
Only 3 features used (feature selection)
```

**Why L1 produces sparsity:**

The L1 gradient is constant ($\pm \lambda$) rather than proportional to weight size:
- Small weight (say $w = 0.1$): L1 gradient = Œª (e.g., 1.0) ‚Üí weight easily pushed to zero
- Same weight with L2: L2 gradient = Œªw = 0.1 ‚Üí tiny push, weight stays non-zero

**Geometric interpretation:**

Regularization constrains weights to a region:
```
L2 constraint: w‚ÇÅ¬≤ + w‚ÇÇ¬≤ ‚â§ C    L1 constraint: |w‚ÇÅ| + |w‚ÇÇ| ‚â§ C
(circle)                        (diamond)

  w‚ÇÇ                             w‚ÇÇ
   |                              |  
   |     ___                      |    /\
   |   /     \                    |   /  \
   |  |   *   |  ‚Üê optimal        |  /    \
   | |       | |                  | /   *  \ ‚Üê optimal at
   |  \     /                     |/        \   corner (w‚ÇÇ=0)
   |   \___ /                     +-----------
   +------------ w‚ÇÅ               +---------- w‚ÇÅ
```

The L1 diamond has corners on the axes, making it likely that the optimal point lands on an axis (one weight = 0).

**When to use each:**

**Use L2 (Ridge) when:**
- You want to use all features with reduced influence
- Features are all relevant
- You have multicollinearity (correlated features)
- General overfitting prevention

**Use L1 (Lasso) when:**
- You want automatic feature selection
- You believe many features are irrelevant
- You need a sparse, interpretable model
- You have more features than samples

**Practical example:**

Gene expression analysis with 10,000 genes:
- L2: All 10,000 genes used with small weights (hard to interpret)
- L1: Only 50 genes have non-zero weights (identifies important genes!)

**Elastic Net (Bonus):**

Combines both: $J = J_{\text{NLL}} + \lambda_1 \sum |w_j| + \frac{\lambda_2}{2}\sum w_j^2$
- Gets sparsity from L1
- Gets stability from L2
- Best of both worlds!

**Why other answers are incorrect:**

- **B is FALSE**: Both L1 and L2 can be used for both regression and classification. L1 regularization works with linear regression (Lasso regression), logistic regression (L1-regularized logistic regression), and other models. L2 works with the same models (Ridge regression, L2-regularized logistic regression). The choice between L1 and L2 is about sparsity vs shrinkage, not about the task type.

- **C is FALSE**: Both L1 and L2 regularization typically use gradient descent (or variants like SGD, Adam) for optimization in logistic regression. Neither has a closed-form solution for logistic regression because the NLL loss is non-linear. In linear regression, L2 (Ridge) has a closed form: $w = (X^TX + \lambda I)^{-1}X^Ty$, but L1 (Lasso) still requires iterative methods. The optimization method is not the key distinguishing feature.

- **D is FALSE**: L1 and L2 are fundamentally different in their mathematical formulation and produce very different results. They differ in:
  1. Penalty term: $\sum |w_j|$ vs $\sum w_j^2$
  2. Gradient: $\lambda \cdot \text{sign}(w)$ vs $\lambda w$
  3. Solution: Sparse (many zeros) vs Dense (all non-zero)
  4. Geometry: Diamond constraint vs Circular constraint
  
  These are not just "implementation details" - they lead to qualitatively different models.

</details>

### Side-by-Side Comparison Table

| Aspect | L1 (Lasso) | L2 (Ridge) |
|--------|------------|------------|
| **Penalty term** | $\lambda \sum |w_j|$ | $\frac{\lambda}{2} \sum w_j^2$ |
| **Gradient** | $\lambda \cdot \text{sign}(w)$ | $\lambda w$ |
| **Effect on weights** | Drives some to **exactly zero** | Shrinks all **toward zero** |
| **Feature selection** | ‚úÖ Yes (automatic) | ‚ùå No (keeps all features) |
| **Solution sparsity** | Sparse (many zeros) | Dense (all non-zero) |
| **When to use** | Feature selection needed | All features relevant |
| **Multicollinearity** | Picks one feature arbitrarily | Shrinks all correlated features |
| **Interpretability** | ‚úÖ High (few features) | ‚ö†Ô∏è Medium (many small weights) |
| **Computational cost** | Higher (non-smooth) | Lower (smooth, differentiable) |
| **Best for** | High-dimensional with irrelevant features | General overfitting prevention |

### Sklearn Comparison

```python
from sklearn.linear_model import LogisticRegression

# L2 regularization (default)
model_l2 = LogisticRegression(penalty='l2', C=1.0)  # C = 1/Œª

# L1 regularization  
model_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=1.0)

# Elastic Net (L1 + L2)
model_elastic = LogisticRegression(penalty='elasticnet', solver='saga', 
                                   C=1.0, l1_ratio=0.5)  # 0.5 = equal L1/L2
```

## Comparison with scikit-learn

Let's validate our implementation against scikit-learn's professional implementation.

In [None]:
print("=" * 70)
print("Comparing with scikit-learn")
print("=" * 70)

# Use degree 6 polynomial with optimal Œª
degree = 6
lambda_optimal = optimal_lambda

# Our model
our_model = cv_results[lambda_optimal]['model']
our_train_acc = cv_results[lambda_optimal]['train_acc']
our_test_acc = cv_results[lambda_optimal]['test_acc']

# Sklearn model (C = 1/Œª)
sklearn_model = SklearnLogisticRegression(
    penalty='l2',
    C=1.0/lambda_optimal if lambda_optimal > 0 else 1e10,  # C is inverse of Œª
    max_iter=2000,
    random_state=42
)
sklearn_model.fit(X_train_poly, y_train)
sklearn_train_acc = sklearn_model.score(X_train_poly, y_train)
sklearn_test_acc = sklearn_model.score(X_test_poly, y_test)

print(f"\nUsing degree {degree} polynomial with Œª={lambda_optimal}")
print("\nOur Implementation:")
print(f"  Train Accuracy: {our_train_acc:.4f}")
print(f"  Test Accuracy:  {our_test_acc:.4f}")

print("\nscikit-learn:")
print(f"  Train Accuracy: {sklearn_train_acc:.4f}")
print(f"  Test Accuracy:  {sklearn_test_acc:.4f}")

print("\nDifference:")
print(f"  Train: {abs(our_train_acc - sklearn_train_acc):.4f}")
print(f"  Test:  {abs(our_test_acc - sklearn_test_acc):.4f}")

print("\n‚úÖ If differences are small (<0.02), implementation is correct!")
print("=" * 70)

## Best Practices and Tips

### 1. Always Use Regularization with Polynomial Features
- Polynomial features create many features ‚Üí high risk of overfitting
- Start with moderate Œª (e.g., 1.0) and tune via cross-validation
- Monitor train vs test accuracy to detect overfitting

### 2. Feature Scaling is Critical
- Always standardize features before applying regularization
- Regularization penalizes all weights equally
- Without scaling, features with large scales dominate
```python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

### 3. Don't Regularize the Bias Term
- The intercept/bias should not be penalized
- It just shifts the decision boundary
- In our implementation: `gradient[0]` excludes regularization

### 4. Use Cross-Validation for Œª Selection
- Test multiple Œª values: [0.001, 0.01, 0.1, 1, 10, 100]
- Use k-fold CV on training data only
- Never use test set for hyperparameter tuning

### 5. Start with L2, Consider L1 for Feature Selection
- **L2 (Ridge)**: Default choice, works well in most cases
- **L1 (Lasso)**: Use when you need sparse models
- **Elastic Net**: Combines L1 + L2 benefits

### 6. Early Stopping as Alternative
- Stop training when validation error starts increasing
- Prevents overfitting without modifying loss
- Common in neural networks

### 7. Regularization vs More Data
- Regularization: Works with fixed dataset
- More data: Best solution but often not available
- Use both when possible!

### 8. Debugging Checklist
- ‚úÖ Training accuracy much higher than test? ‚Üí Increase Œª
- ‚úÖ Both accuracies low? ‚Üí Decrease Œª or increase model complexity
- ‚úÖ Loss increasing? ‚Üí Decrease learning rate
- ‚úÖ Weights exploding? ‚Üí Increase Œª or decrease learning rate

## Summary

In this lab, you:

1. ‚úÖ **Understood overfitting** by observing high training accuracy but low test accuracy with high-degree polynomials
2. ‚úÖ **Implemented L2 regularization** from scratch by adding penalty term to loss and gradient
3. ‚úÖ **Tuned hyperparameter Œª** using k-fold cross-validation to find the optimal regularization strength
4. ‚úÖ **Visualized the bias-variance tradeoff** and saw how Œª controls model complexity
5. ‚úÖ **Compared L1 vs L2 regularization** and understood when to use each technique
6. ‚úÖ **Applied regularization to real problems** with non-linear decision boundaries
7. ‚úÖ **Validated implementation** against scikit-learn's professional implementation

### Key Takeaways

**The Overfitting Problem:**
- Complex models (high-degree polynomials) can perfectly fit training data
- But they memorize noise and fail on test data
- Symptom: Training accuracy >> Test accuracy

**Regularization Solution:**
- Add penalty term to loss: $J_{\text{total}} = J_{\text{NLL}} + \lambda R(\vec{w})$
- Forces model to balance data fit with weight magnitude
- Results in simpler, more generalizable models

**L2 (Ridge) Regularization:**
- Penalty: $\frac{\lambda}{2}\sum w_j^2$
- Gradient: $\lambda w$
- Shrinks all weights toward zero
- Best for general overfitting prevention

**L1 (Lasso) Regularization:**
- Penalty: $\lambda \sum |w_j|$
- Gradient: $\lambda \cdot \text{sign}(w)$
- Drives some weights to exactly zero
- Best for feature selection

**Hyperparameter Œª:**
- Œª = 0: No regularization (may overfit)
- Œª small: Light regularization
- Œª optimal: Balanced (best generalization)
- Œª large: Heavy regularization (may underfit)
- Use cross-validation to find optimal Œª

**Bias-Variance Tradeoff:**
- Increasing Œª ‚Üí Increases bias, Decreases variance
- Decreasing Œª ‚Üí Decreases bias, Increases variance
- Optimal Œª minimizes total error

### Next Steps

- Try regularization on real-world datasets (e.g., breast cancer, spam detection)
- Implement L1 (Lasso) regularization
- Explore Elastic Net (L1 + L2 combined)
- Apply to neural networks (weight decay)
- Learn about other regularization techniques:
  - Dropout
  - Early stopping
  - Data augmentation
  - Batch normalization

### Further Reading

- [Regularization in Machine Learning - scikit-learn User Guide](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)
- [Understanding the Bias-Variance Tradeoff - Andrew Ng's Course](http://www.andrewng.org/)
- [L1 and L2 Regularization Methods - Towards Data Science](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)