# 1.1 Introduction to Regularization in Machine Learning

### **Table of Contents**

<div style="overflow-x: auto;">

- [Introduction](#scrollTo=intro)
- [1. The Overfitting Problem](#scrollTo=section1)
  - [1.1 What is Overfitting?](#scrollTo=section1_1)
  - [1.2 Detecting Overfitting](#scrollTo=section1_2)
- [2. Regularization: The Solution](#scrollTo=section2)
  - [2.1 The Intuition Behind Regularization](#scrollTo=section2_1)
  - [2.2 The Bias-Variance Trade-off](#scrollTo=section2_2)
- [3. Types of Regularization](#scrollTo=section3)
  - [3.1 L2 Regularization (Ridge)](#scrollTo=section3_1)
  - [3.2 L1 Regularization (Lasso)](#scrollTo=section3_2)
  - [3.3 ElasticNet: Combining L1 and L2](#scrollTo=section3_3)
- [4. Regularization in the Context of Logistic Regression](#scrollTo=section4)
- [5. Summary](#scrollTo=section5)

</div>

## Introduction

In Course 2, we built logistic regression models to predict student departure. While these models performed well, we used `penalty=None`, meaning we did not apply any regularization. In this module, we explore **regularization**—a powerful technique that improves model performance by preventing overfitting and, in some cases, performing automatic feature selection.

This notebook introduces the concepts behind regularization. In subsequent notebooks, we will implement regularized logistic regression models and compare their performance to our baseline.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Explain what overfitting is and why it's problematic
2. Understand the bias-variance trade-off
3. Describe how regularization addresses overfitting
4. Differentiate between L1, L2, and ElasticNet regularization

## 1. The Overfitting Problem

### 1.1 What is Overfitting?

**Overfitting** occurs when a model learns the training data too well—including its noise and random fluctuations—rather than learning the underlying patterns. An overfitted model performs excellently on training data but poorly on new, unseen data.

Think of it like a student who memorizes every answer on practice tests but can't generalize to new exam questions. The student has "overfit" to the practice material.

In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Generate synthetic data to illustrate overfitting
np.random.seed(42)
X = np.linspace(0, 10, 20)
y_true = 2 * X + 1  # True relationship
y_noisy = y_true + np.random.normal(0, 3, len(X))  # Add noise

# Create three fits: underfitting, good fit, overfitting
X_smooth = np.linspace(0, 10, 100)

# Underfitting: constant (degree 0)
y_underfit = np.ones_like(X_smooth) * np.mean(y_noisy)

# Good fit: linear (degree 1)
coeffs_good = np.polyfit(X, y_noisy, 1)
y_goodfit = np.polyval(coeffs_good, X_smooth)

# Overfitting: high-degree polynomial
coeffs_overfit = np.polyfit(X, y_noisy, 15)
y_overfit = np.polyval(coeffs_overfit, X_smooth)

# Create subplot figure
fig = make_subplots(rows=1, cols=3, subplot_titles=('Underfitting', 'Good Fit', 'Overfitting'))

# Add data points to all subplots
for col in [1, 2, 3]:
    fig.add_trace(go.Scatter(x=X, y=y_noisy, mode='markers', name='Data', 
                             marker=dict(color='blue', size=8), showlegend=(col==1)), row=1, col=col)

# Add fits
fig.add_trace(go.Scatter(x=X_smooth, y=y_underfit, mode='lines', name='Model', 
                         line=dict(color='red', width=2), showlegend=False), row=1, col=1)
fig.add_trace(go.Scatter(x=X_smooth, y=y_goodfit, mode='lines', name='Model', 
                         line=dict(color='green', width=2), showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=X_smooth, y=y_overfit, mode='lines', name='Model', 
                         line=dict(color='red', width=2), showlegend=False), row=1, col=3)

fig.update_layout(height=400, title_text="The Fitting Spectrum: From Underfitting to Overfitting")
fig.show()

**Interpretation:**

- **Underfitting (Left)**: The model is too simple. It doesn't capture the trend in the data.
- **Good Fit (Center)**: The model captures the underlying pattern without being distracted by noise.
- **Overfitting (Right)**: The model follows every data point, including the noise. It won't generalize well.

### 1.2 Detecting Overfitting

The classic sign of overfitting is a significant gap between training performance and validation/test performance:

| Scenario | Training Performance | Test Performance | Diagnosis |
|:---------|:--------------------|:-----------------|:----------|
| Good fit | High | High | Model generalizes well |
| Overfitting | Very High | Low | Model memorized training data |
| Underfitting | Low | Low | Model is too simple |

In Course 2, we used **cross-validation** to detect this. If a model performs much better on training folds than validation folds, it's likely overfitting.

## 2. Regularization: The Solution

### 2.1 The Intuition Behind Regularization

**Regularization** adds a penalty term to the loss function that discourages the model from having large coefficient values. The key insight is that complex, overfitted models tend to have large coefficients—they're working hard to fit every data point.

By penalizing large coefficients, we encourage simpler models that generalize better.

**Without regularization (Course 2):**
$$\text{Loss} = \text{Prediction Error}$$

**With regularization:**
$$\text{Loss} = \text{Prediction Error} + \lambda \times \text{Penalty on Coefficients}$$

Where $\lambda$ (lambda) controls the strength of regularization:
- $\lambda = 0$: No regularization (same as Course 2)
- Small $\lambda$: Weak regularization
- Large $\lambda$: Strong regularization (simpler model)

### 2.2 The Bias-Variance Trade-off

Regularization is intimately connected to the **bias-variance trade-off**, a fundamental concept in machine learning.

**Total Prediction Error = Bias² + Variance + Irreducible Noise**

- **Bias**: Error from overly simplistic assumptions. High bias = underfitting.
- **Variance**: Error from sensitivity to fluctuations in training data. High variance = overfitting.

Regularization *increases bias* slightly but *decreases variance* substantially, often leading to better overall performance.

In [None]:
# Visualize bias-variance trade-off
complexity = np.linspace(0.1, 3, 100)
bias_squared = 1 / complexity
variance = 0.3 * complexity ** 2
total_error = bias_squared + variance + 0.1  # 0.1 is irreducible noise

fig = go.Figure()

fig.add_trace(go.Scatter(x=complexity, y=bias_squared, mode='lines', name='Bias²', 
                         line=dict(color='blue', width=2)))
fig.add_trace(go.Scatter(x=complexity, y=variance, mode='lines', name='Variance', 
                         line=dict(color='orange', width=2)))
fig.add_trace(go.Scatter(x=complexity, y=total_error, mode='lines', name='Total Error', 
                         line=dict(color='red', width=3)))

# Add vertical line at optimal point
optimal_idx = np.argmin(total_error)
fig.add_vline(x=complexity[optimal_idx], line_dash="dash", line_color="green", 
              annotation_text="Optimal Complexity")

fig.update_layout(
    title='The Bias-Variance Trade-off',
    xaxis_title='Model Complexity',
    yaxis_title='Error',
    height=400,
    showlegend=True
)

fig.show()

**Key Insight**: Regularization helps us stay near the optimal complexity by preventing the model from becoming too complex.

## 3. Types of Regularization

### 3.1 L2 Regularization (Ridge)

**L2 regularization** adds a penalty equal to the *sum of squared coefficients*:

$$\text{Loss}_{L2} = \text{Prediction Error} + \lambda \sum_{j=1}^{p} \beta_j^2$$

**Characteristics:**
- Shrinks all coefficients toward zero, but rarely to exactly zero
- Keeps all features in the model
- Works well when many features contribute to prediction
- Handles multicollinearity by shrinking correlated feature coefficients

**Analogy**: Ridge regression is like putting all coefficients on a diet—everyone loses weight, but no one disappears entirely.

### 3.2 L1 Regularization (Lasso)

**L1 regularization** adds a penalty equal to the *sum of absolute values of coefficients*:

$$\text{Loss}_{L1} = \text{Prediction Error} + \lambda \sum_{j=1}^{p} |\beta_j|$$

**Characteristics:**
- Can shrink coefficients to exactly zero, effectively removing features
- Performs **automatic feature selection**
- Produces sparse models (fewer active features)
- Useful when you suspect only a few features matter

**Analogy**: Lasso is like a reality TV show elimination—weak coefficients get voted off the island.

### 3.3 ElasticNet: Combining L1 and L2

**ElasticNet** combines both L1 and L2 penalties:

$$\text{Loss}_{ElasticNet} = \text{Prediction Error} + \lambda_1 \sum_{j=1}^{p} |\beta_j| + \lambda_2 \sum_{j=1}^{p} \beta_j^2$$

Or equivalently, controlled by a mixing parameter $\alpha$:
- $\alpha = 1$: Pure L1 (Lasso)
- $\alpha = 0$: Pure L2 (Ridge)
- $0 < \alpha < 1$: Mixture of both

**Characteristics:**
- Gets the best of both worlds
- Can select groups of correlated features (unlike pure Lasso)
- Often performs better than either L1 or L2 alone

### Comparison Table

| Property | L2 (Ridge) | L1 (Lasso) | ElasticNet |
|:---------|:-----------|:-----------|:-----------|
| Penalty Term | Sum of squared coefficients | Sum of absolute coefficients | Both |
| Feature Selection | No | Yes (zeros out coefficients) | Yes |
| Handles Multicollinearity | Well | May arbitrarily select one | Well |
| Sparse Model | No | Yes | Depends on mixing |
| Best When | Many small effects | Few large effects | Groups of correlated features |

In [None]:
# Visualize the geometry of L1 vs L2 regularization
theta = np.linspace(0, 2*np.pi, 100)

# L2 (circle)
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)

# L1 (diamond)
t = np.linspace(0, 1, 25)
x_l1 = np.concatenate([1-t, -t, -1+t, t])
y_l1 = np.concatenate([t, 1-t, -t, -1+t])

fig = make_subplots(rows=1, cols=2, subplot_titles=('L2 (Ridge) Constraint', 'L1 (Lasso) Constraint'))

fig.add_trace(go.Scatter(x=x_l2, y=y_l2, mode='lines', fill='toself', 
                         fillcolor='rgba(0,100,200,0.3)', line=dict(color='blue', width=2),
                         name='L2 Region'), row=1, col=1)

fig.add_trace(go.Scatter(x=x_l1, y=y_l1, mode='lines', fill='toself', 
                         fillcolor='rgba(200,100,0,0.3)', line=dict(color='orange', width=2),
                         name='L1 Region'), row=1, col=2)

# Add axes
for col in [1, 2]:
    fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=col)
    fig.add_vline(x=0, line_dash="dash", line_color="gray", row=1, col=col)

fig.update_layout(height=400, title_text="Geometry of Regularization Constraints",
                  showlegend=False)
fig.update_xaxes(title_text="β₁", range=[-1.5, 1.5])
fig.update_yaxes(title_text="β₂", range=[-1.5, 1.5])

fig.show()

**Geometric Interpretation:**

The regularization constraint defines a region where coefficients must live. The diamond shape of L1 has corners on the axes—this is why L1 naturally produces zero coefficients (the solution often lands on a corner). The circular L2 constraint has no corners, so solutions rarely hit exactly zero.

## 4. Regularization in the Context of Logistic Regression

In logistic regression, we minimize the negative log-likelihood (or equivalently, maximize log-likelihood). With regularization, this becomes:

$$\text{Loss} = -\log L(\vec{\beta}) + \lambda \cdot \text{Penalty}(\vec{\beta})$$

Where $\log L(\vec{\beta})$ is the log-likelihood from our logistic model.

**In scikit-learn's LogisticRegression:**

```python
# No regularization (Course 2)
LogisticRegression(penalty=None)

# L2 regularization (Ridge) - default in scikit-learn
LogisticRegression(penalty='l2', C=1.0)

# L1 regularization (Lasso)
LogisticRegression(penalty='l1', C=1.0, solver='saga')

# ElasticNet
LogisticRegression(penalty='elasticnet', C=1.0, solver='saga', l1_ratio=0.5)
```

**Note on C parameter**: In scikit-learn, `C` is the *inverse* of regularization strength:
- Large C = weak regularization (closer to unregularized model)
- Small C = strong regularization (simpler model)

### Why Regularization Matters for Student Departure Prediction

In our student departure prediction problem:

1. **Feature Selection**: L1 regularization can help identify which features (GPA, DFW rate, demographics) are most predictive
2. **Handling Correlations**: L2/ElasticNet handle correlated features (e.g., GPA_1 and GPA_2) gracefully
3. **Improved Generalization**: Regularization often improves performance on new cohorts of students
4. **Interpretability**: Sparser models from L1 are easier to explain to stakeholders

## 5. Summary

In this notebook, we covered:

1. **Overfitting**: When models learn noise instead of signal, they fail to generalize

2. **Bias-Variance Trade-off**: Total error = Bias² + Variance; regularization trades some bias for reduced variance

3. **L2 (Ridge) Regularization**: Shrinks coefficients; keeps all features

4. **L1 (Lasso) Regularization**: Can zero out coefficients; performs feature selection

5. **ElasticNet**: Combines L1 and L2; often best of both worlds

### Key Takeaways

| Concept | Remember |
|:--------|:---------|
| Regularization | Adds penalty to prevent large coefficients |
| L2 (Ridge) | Shrinks but keeps all features |
| L1 (Lasso) | Feature selection via zeroing coefficients |
| ElasticNet | Combines both penalties |
| C parameter | Inverse of regularization strength (small C = more regularization) |

### Next Steps

In the next notebook, we will implement regularized logistic regression models on our student departure dataset and compare their performance to the unregularized baseline from Course 2.

**Proceed to:** `1.2 Build a Regularized Logistic Regression Model`