# 012: Ridge, Lasso & ElasticNet Regression

Regularization adds a **penalty term** to the loss function to constrain model complexity and prevent overfitting.

### 📊 Regularization Concept

```mermaid
graph TD
    A[Standard Loss Function] --> B[Add Penalty Term]
    B --> C{Regularization Type}
    C -->|L2| D[Ridge: Shrink All Coefficients]
    C -->|L1| E[Lasso: Zero Out Features]
    C -->|L1 + L2| F[ElasticNet: Combine Both]
    style D fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style E fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style F fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
```

### Loss Functions with Penalties

**Ordinary Least Squares (No Regularization):**
$$\mathcal{L}_{OLS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Ridge Regression (L2 Penalty):**
$$\mathcal{L}_{Ridge} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2$$

**Lasso Regression (L1 Penalty):**
$$\mathcal{L}_{Lasso} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\beta_j|$$

**ElasticNet (L1 + L2):**
$$\mathcal{L}_{ElasticNet} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha(1-\rho)}{2} \sum_{j=1}^{p} \beta_j^2$$

Where:
- $\alpha$ = regularization strength (higher → more penalty)
- $\rho$ = L1 ratio for ElasticNet (0=Ridge, 1=Lasso)
- $\beta_j$ = model coefficients

### 🎯 Regularization Workflow

```mermaid
graph TD
    A[High-Dimensional Data] --> B{Problem Type?}
    B -->|Multicollinearity| C[Use Ridge]
    B -->|Feature Selection Needed| D[Use Lasso]
    B -->|Both Issues| E[Use ElasticNet]
    C --> F[Tune Alpha via CV]
    D --> F
    E --> F
    F --> G[Train Final Model]
    G --> H[Evaluate on Test Set]
    style E fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style H fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
```

### When to Use Each Method?

**Ridge (L2):**
- ✅ Many correlated features
- ✅ Want to keep all features (just shrink them)
- ✅ Multicollinearity issues
- ✅ Numerical stability important

**Lasso (L1):**
- ✅ Need automatic feature selection
- ✅ Sparse models preferred (interpretability)
- ✅ Many irrelevant features
- ✅ Storage/computation constraints

**ElasticNet:**
- ✅ Grouped correlated features (keeps groups)
- ✅ More features than samples (p > n)
- ✅ Need both regularization and selection
- ✅ Best of both worlds

### 🏭 Real-World Applications

**Post-Silicon Validation:**
- High-dimensional STDF parameter reduction (1000+ test parameters)
- Correlated test elimination (voltage tests highly correlated)
- Sparse yield modeling (only key tests matter)
- Robust parameter prediction with noise

**General AI/ML:**
- Genomics (millions of genes, few samples)
- Text classification (large vocabulary, sparse features)
- Financial modeling (correlated economic indicators)
- Image processing (high-dimensional pixel data)

---

## 2. Setup and Data Preparation

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('✅ Libraries imported successfully')
print(f'NumPy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')

### 📝 What's Happening in This Code?

**Purpose:** Import regularization models and tools for hyperparameter tuning

**Key Points:**
- **Ridge, Lasso, ElasticNet**: Three regularization methods with different penalty types
- **RidgeCV, LassoCV, ElasticNetCV**: Built-in cross-validation for alpha tuning (efficient)
- **validation_curve**: Analyze model performance vs alpha values for visualization
- **StandardScaler**: Essential - regularization is scale-sensitive

**Why This Matters:**
- CV versions automatically find best alpha without manual loops
- StandardScaler ensures fair penalization (features on same scale)
- Unified API allows easy comparison between methods

### 2.1 Generate High-Dimensional Dataset with Multicollinearity

### 📝 What's Happening in This Code?

**Purpose:** Create synthetic data that demonstrates regularization benefits

**Key Points:**
- **High dimensionality**: 50 features (some relevant, many irrelevant)
- **Multicollinearity**: Intentionally correlate features to mimic STDF data (voltage tests correlated)
- **Sparse ground truth**: Only 10 features truly predictive (others are noise)
- **Controlled experiment**: Known which features matter for validating Lasso selection

**Why This Approach:**
- Mimics real STDF data where parametric tests are highly correlated
- OLS would overfit badly - regularization shines here
- Can validate feature selection against ground truth

In [None]:
def generate_high_dimensional_data(n_samples=200, n_features=50, n_informative=10, noise=5.0):
    """
    Generate high-dimensional data with multicollinearity
    Simulates STDF-like scenario with many correlated test parameters
    """
    # Generate base informative features
    X_informative = np.random.randn(n_samples, n_informative)
    
    # True coefficients (sparse - only informative features matter)
    true_coef = np.zeros(n_features)
    true_coef[:n_informative] = np.random.randn(n_informative) * 10
    
    # Generate target from informative features
    y = X_informative @ true_coef[:n_informative] + np.random.randn(n_samples) * noise
    
    # Create correlated redundant features (multicollinearity)
    X_redundant = np.zeros((n_samples, n_features - n_informative))
    for i in range(n_features - n_informative):
        # Each redundant feature is linear combination of informative ones
        weights = np.random.randn(n_informative) * 0.5
        X_redundant[:, i] = X_informative @ weights + np.random.randn(n_samples) * 0.1
    
    # Combine features
    X = np.hstack([X_informative, X_redundant])
    
    # Shuffle columns to hide which are informative
    shuffle_idx = np.random.permutation(n_features)
    X = X[:, shuffle_idx]
    true_coef = true_coef[shuffle_idx]
    
    return X, y, true_coef

# Generate dataset
X, y, true_coef = generate_high_dimensional_data(n_samples=200, n_features=50, 
                                                  n_informative=10, noise=5.0)

# Create DataFrame
feature_names = [f'Feature_{i+1}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['Target'] = y

print('✅ High-dimensional dataset generated')
print(f'Samples: {X.shape[0]}, Features: {X.shape[1]}')
print(f'Informative features: 10 (hidden in 50 total)')
print(f'\nFeature correlation matrix shape: {df.iloc[:, :-1].corr().shape}')
print(f'Max correlation: {df.iloc[:, :-1].corr().abs().values[np.triu_indices_from(df.iloc[:, :-1].corr(), k=1)].max():.3f}')

### 2.2 Train-Test Split

### 📝 What's Happening in This Code?

**Purpose:** Split data and standardize features for regularization

**Key Points:**
- **Split first, then scale**: Prevents data leakage (test set never seen during scaling)
- **StandardScaler critical**: Ridge/Lasso penalize by coefficient magnitude - features must be same scale
- **Fit on train only**: Scaler learns statistics from training data, applies to test
- **80-20 split**: Standard ratio balancing training data and test reliability

**Why This Matters:**
- Without scaling, features with large ranges get under-penalized
- Data leakage (scaling on full data) inflates performance metrics artificially
- Proper workflow ensures production-ready code

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (critical for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'✅ Data split and scaled')
print(f'Training samples: {X_train_scaled.shape[0]}')
print(f'Test samples: {X_test_scaled.shape[0]}')
print(f'Features: {X_train_scaled.shape[1]}')
print(f'\nFeature means after scaling (should be ~0): {X_train_scaled.mean(axis=0)[:5]}')
print(f'Feature stds after scaling (should be ~1): {X_train_scaled.std(axis=0)[:5]}')

---

## 3. Mathematical Foundation

### 3.1 Ridge Regression (L2 Regularization)

**Objective function:**
$$\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2n} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \alpha \sum_{j=1}^{p} \beta_j^2 \right\}$$

**Closed-form solution:**
$$\boldsymbol{\beta}_{ridge} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}$$

**Properties:**
- Shrinks all coefficients toward zero (but never exactly zero)
- Works well with correlated features
- Always has unique solution (even when $\mathbf{X}^T \mathbf{X}$ singular)

### 3.2 Lasso Regression (L1 Regularization)

**Objective function:**
$$\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2n} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \alpha \sum_{j=1}^{p} |\beta_j| \right\}$$

**Properties:**
- Forces some coefficients to **exactly zero** (feature selection)
- No closed-form solution (solved via coordinate descent)
- Tends to pick one feature from correlated groups

### 3.3 ElasticNet (L1 + L2)

**Objective function:**
$$\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2n} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \alpha \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha(1-\rho)}{2} \sum_{j=1}^{p} \beta_j^2 \right\}$$

**Properties:**
- Combines benefits: feature selection + handling correlated features
- Keeps grouped correlated features together
- $\rho = 0$: Pure Ridge, $\rho = 1$: Pure Lasso

### 3.4 Geometric Interpretation

```mermaid
graph LR
    A[L2 Penalty<br/>Smooth Circle<br/>Coefficients Shrink Smoothly] --> B[Optimal Point]
    C[L1 Penalty<br/>Diamond Shape<br/>Hits Axes → Zeros] --> B
    style A fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
    style C fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
```

---

## 4. Implementation from Scratch (Ridge)

Implement Ridge regression to understand the math.

### 📝 What's Happening in This Code?

**Purpose:** Build Ridge regression from scratch using closed-form solution

**Key Points:**
- **Normal equation + penalty**: $(\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}$
- **Identity matrix**: $\alpha \mathbf{I}$ adds to diagonal, ensures invertibility
- **No intercept penalty**: Only coefficients penalized (intercept term excluded)
- **Numerical stability**: Ridge makes inversion stable even with multicollinearity

**Why This Matters:**
- Shows regularization as simple modification to OLS
- Explains why Ridge always has solution (even singular matrices)
- Understanding math helps debug production issues

In [None]:
class RidgeRegressionScratch:
    """
    Ridge Regression from scratch using closed-form solution
    """
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.coefficients = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        Fit Ridge regression: β = (X^T X + αI)^(-1) X^T y
        """
        # Add intercept column
        n_samples = X.shape[0]
        X_with_intercept = np.column_stack([np.ones(n_samples), X])
        
        # Create penalty matrix (don't penalize intercept)
        n_features = X_with_intercept.shape[1]
        penalty_matrix = np.eye(n_features) * self.alpha
        penalty_matrix[0, 0] = 0  # No penalty on intercept
        
        # Ridge solution
        XtX = X_with_intercept.T @ X_with_intercept
        Xty = X_with_intercept.T @ y
        
        coefficients_all = np.linalg.inv(XtX + penalty_matrix) @ Xty
        
        self.intercept = coefficients_all[0]
        self.coefficients = coefficients_all[1:]
        
        return self
    
    def predict(self, X):
        return X @ self.coefficients + self.intercept
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Train from-scratch Ridge
model_scratch = RidgeRegressionScratch(alpha=1.0)
model_scratch.fit(X_train_scaled, y_train)

# Evaluate
train_r2 = model_scratch.score(X_train_scaled, y_train)
test_r2 = model_scratch.score(X_test_scaled, y_test)
y_pred_scratch = model_scratch.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_scratch))

print('✅ From-Scratch Ridge Regression (α=1.0)')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
print(f'Test RMSE: {rmse:.4f}')
print(f'Non-zero coefficients: {np.sum(np.abs(model_scratch.coefficients) > 0.001)}/{len(model_scratch.coefficients)}')

---

## 5. Production Implementation with Scikit-learn

### 5.1 Compare OLS vs Ridge vs Lasso vs ElasticNet

### 📝 What's Happening in This Code?

**Purpose:** Systematically compare all four regression methods

**Key Points:**
- **OLS baseline**: No regularization - overfits with high-dimensional data
- **Ridge**: Shrinks coefficients but keeps all features
- **Lasso**: Zeros out irrelevant features (automatic selection)
- **ElasticNet**: Balanced approach with both penalties

**Why This Matters:**
- OLS likely overfits (training R² >> test R²)
- Lasso should identify ~10 important features
- Ridge should have better test R² than OLS
- ElasticNet often wins in high-dimensional settings

In [None]:
# Define models with reasonable alpha
models = {
    'OLS': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = []

for name, model in models.items():
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Evaluate
    train_r2 = model.score(X_train_scaled, y_train)
    test_r2 = model.score(X_test_scaled, y_test)
    y_pred = model.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    
    # Count non-zero coefficients
    if hasattr(model, 'coef_'):
        n_nonzero = np.sum(np.abs(model.coef_) > 0.001)
    else:
        n_nonzero = X.shape[1]
    
    results.append({
        'Model': name,
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'Test_RMSE': rmse,
        'Test_MAE': mae,
        'Non_Zero_Features': n_nonzero,
        'Overfitting_Gap': train_r2 - test_r2
    })

results_df = pd.DataFrame(results)

print('📊 Model Comparison: OLS vs Ridge vs Lasso vs ElasticNet\n')
print(results_df.to_string(index=False))

# Identify best model
best_model = results_df.loc[results_df['Test_R2'].idxmax(), 'Model']
print(f'\n🎯 Best model: {best_model} (highest Test R²)')

### 5.2 Hyperparameter Tuning with Cross-Validation

### 📝 What's Happening in This Code?

**Purpose:** Find optimal alpha (regularization strength) using cross-validation

**Key Points:**
- **Alpha range**: Test from 0.001 (weak) to 100 (strong regularization)
- **RidgeCV/LassoCV**: Built-in CV - efficient, no manual loops needed
- **Logarithmic scale**: Alpha effects are multiplicative (0.01, 0.1, 1, 10, 100)
- **Best alpha**: Chosen automatically based on CV performance

**Why This Matters:**
- Wrong alpha → underfitting (too high) or overfitting (too low)
- CV ensures alpha generalizes to unseen data
- Automates hyperparameter search - production-ready

In [None]:
# Alpha values to test (logarithmic scale)
alphas = np.logspace(-3, 2, 50)  # 0.001 to 100

# Ridge with CV
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)

# Lasso with CV
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

# ElasticNet with CV
enet_cv = ElasticNetCV(alphas=alphas, l1_ratio=0.5, cv=5, max_iter=10000)
enet_cv.fit(X_train_scaled, y_train)

print('🔍 Optimal Alpha Values (via 5-Fold CV)\n')
print(f'Ridge optimal α: {ridge_cv.alpha_:.4f}')
print(f'Lasso optimal α: {lasso_cv.alpha_:.4f}')
print(f'ElasticNet optimal α: {enet_cv.alpha_:.4f}')

# Evaluate tuned models
print('\n📊 Performance with Optimized Alpha:\n')
for name, model in [('Ridge', ridge_cv), ('Lasso', lasso_cv), ('ElasticNet', enet_cv)]:
    test_r2 = model.score(X_test_scaled, y_test)
    y_pred = model.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    n_nonzero = np.sum(np.abs(model.coef_) > 0.001)
    
    print(f'{name:12} → Test R²: {test_r2:.4f}, RMSE: {rmse:.4f}, Features: {n_nonzero}/50')

### 5.3 Visualize Coefficient Paths

### 📝 What's Happening in This Code?

**Purpose:** Visualize how coefficients change with regularization strength

**Key Points:**
- **Coefficient paths**: Each line represents one feature's coefficient vs alpha
- **Ridge behavior**: All coefficients shrink smoothly toward zero (never exactly zero)
- **Lasso behavior**: Coefficients hit zero at different alphas (sequential feature elimination)
- **Feature selection**: Lasso path shows which features eliminated first

**Why This Matters:**
- Visual proof of L1 vs L2 difference
- Identifies robust features (survive high alpha)
- Helps communicate regularization to stakeholders

In [None]:
# Compute coefficient paths for Ridge and Lasso
alphas_path = np.logspace(-2, 2, 100)

# Ridge path
ridge_coefs = []
for alpha in alphas_path:
    model = Ridge(alpha=alpha)
    model.fit(X_train_scaled, y_train)
    ridge_coefs.append(model.coef_)
ridge_coefs = np.array(ridge_coefs)

# Lasso path
lasso_coefs = []
for alpha in alphas_path:
    model = Lasso(alpha=alpha, max_iter=10000)
    model.fit(X_train_scaled, y_train)
    lasso_coefs.append(model.coef_)
lasso_coefs = np.array(lasso_coefs)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge coefficient paths
for i in range(ridge_coefs.shape[1]):
    axes[0].plot(alphas_path, ridge_coefs[:, i], alpha=0.6, linewidth=1.5)
axes[0].axvline(ridge_cv.alpha_, color='red', linestyle='--', linewidth=2, label=f'Optimal α={ridge_cv.alpha_:.3f}')
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (log scale)', fontsize=12)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Ridge Coefficient Paths (L2)\nAll coefficients shrink smoothly', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lasso coefficient paths
for i in range(lasso_coefs.shape[1]):
    axes[1].plot(alphas_path, lasso_coefs[:, i], alpha=0.6, linewidth=1.5)
axes[1].axvline(lasso_cv.alpha_, color='red', linestyle='--', linewidth=2, label=f'Optimal α={lasso_cv.alpha_:.3f}')
axes[1].set_xscale('log')
axes[1].set_xlabel('Alpha (log scale)', fontsize=12)
axes[1].set_ylabel('Coefficient Value', fontsize=12)
axes[1].set_title('Lasso Coefficient Paths (L1)\nCoefficients hit zero (feature selection)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('📈 Coefficient Path Interpretation:')
print('   → Ridge: Gradual shrinkage, all features retained')
print('   → Lasso: Sequential zeroing, automatic feature selection')
print(f'   → At optimal α, Lasso keeps {np.sum(np.abs(lasso_cv.coef_) > 0.001)} features')

---

## 6. Feature Selection Analysis (Lasso)

### 📝 What's Happening in This Code?

**Purpose:** Analyze which features Lasso selected as important

**Key Points:**
- **Non-zero coefficients**: Features Lasso kept (implicitly important)
- **Zero coefficients**: Features eliminated (redundant or irrelevant)
- **Magnitude ranking**: Larger |coefficient| → more important
- **Ground truth comparison**: Validate against known informative features

**Why This Matters:**
- Reduces model complexity (fewer features → faster inference)
- Improves interpretability (focus on key drivers)
- Lowers storage/memory requirements in production
- Validates feature engineering decisions

In [None]:
# Extract Lasso coefficients
lasso_coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lasso_cv.coef_,
    'Abs_Coefficient': np.abs(lasso_cv.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

# Identify selected features
selected_features = lasso_coef_df[lasso_coef_df['Abs_Coefficient'] > 0.001]
zeroed_features = lasso_coef_df[lasso_coef_df['Abs_Coefficient'] <= 0.001]

print(f'📊 Lasso Feature Selection Results (α={lasso_cv.alpha_:.4f})\n')
print(f'Features selected: {len(selected_features)}/50')
print(f'Features eliminated: {len(zeroed_features)}/50')
print(f'\nTop 15 Selected Features:\n')
print(selected_features.head(15).to_string(index=False))

# Visualize selected features
plt.figure(figsize=(12, 8))
colors = ['green' if c >= 0 else 'red' for c in selected_features['Coefficient']]
plt.barh(selected_features['Feature'][:15], selected_features['Coefficient'][:15], 
         color=colors, edgecolor='black', alpha=0.7)
plt.xlabel('Lasso Coefficient', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title(f'Top 15 Features Selected by Lasso (α={lasso_cv.alpha_:.4f})', fontsize=13, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

---

## 7. Model Diagnostics

### 📝 What's Happening in This Code?

**Purpose:** Comprehensive diagnostic plots for best regularized model

**Key Points:**
- **Predicted vs Actual**: Points near diagonal indicate good predictions
- **Residual plot**: Random scatter confirms no systematic errors
- **Residual distribution**: Normal distribution validates assumptions
- **Q-Q plot**: Diagonal line confirms residual normality

**Why This Matters:**
- Patterns in residuals indicate model misspecification
- Non-normal residuals invalidate confidence intervals
- Validates that regularization didn't introduce bias

In [None]:
# Use best model (ElasticNet from earlier comparison)
best_model_obj = enet_cv
y_test_pred = best_model_obj.predict(X_test_scaled)
residuals = y_test - y_test_pred

# Diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Predicted vs Actual
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6, edgecolor='k')
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Actual Target', fontsize=12)
axes[0, 0].set_ylabel('Predicted Target', fontsize=12)
axes[0, 0].set_title('Predicted vs Actual (Test Set)', fontsize=13, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Residual Plot
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6, edgecolor='k')
axes[0, 1].axhline(0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Target', fontsize=12)
axes[0, 1].set_ylabel('Residuals', fontsize=12)
axes[0, 1].set_title('Residual Plot', fontsize=13, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residual Distribution
axes[1, 0].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(residuals.mean(), color='r', linestyle='--', lw=2, 
                   label=f'Mean: {residuals.mean():.2f}')
axes[1, 0].set_xlabel('Residuals', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)
axes[1, 0].set_title('Residual Distribution', fontsize=13, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 4. Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Normality Check)', fontsize=13, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('🔍 Diagnostic Summary:')
print(f'   → Mean residual: {residuals.mean():.4f} (close to 0 ✓)')
print(f'   → Residual std: {residuals.std():.4f}')
print(f'   → R² on test: {r2_score(y_test, y_test_pred):.4f}')

---

## 8. Real-World Projects

### 🔬 Post-Silicon Validation Projects

#### **Project 1: High-Dimensional STDF Parameter Reduction**

**Objective:** Reduce 1000+ parametric test parameters to ~50 key predictors for yield modeling.

**Business Value:**
- Reduce test time by 40% (skip redundant tests)
- Lower data storage costs
- Faster model inference
- Maintain prediction accuracy >95%

**Dataset Features:**
- 1000+ parametric tests (voltage, current, frequency, timing)
- High correlation (VDD tests, timing paths)
- Mix of analog and digital parameters
- Spatial wafer data (die_x, die_y)

**Implementation Guide:**
1. Start with ElasticNet (handles correlated groups)
2. Use `ElasticNetCV` with `l1_ratio=0.7` (favor selection)
3. Test alphas: `np.logspace(-4, 1, 50)`
4. Validate on multiple wafer lots
5. Keep features with |coef| > threshold

**Expected Outcomes:** 950 → 50 parameters, R² > 0.90

---

#### **Project 2: Correlated Test Elimination**

**Objective:** Identify and remove redundant tests that provide no additional yield information.

**Business Value:**
- Cut ATE test cost per device
- Increase throughput
- Simplify test programs
- Reduce engineering debug time

**Implementation Guide:**
1. Calculate correlation matrix of all tests
2. Apply Lasso to zero out redundant tests
3. Cross-validate on different product lots
4. Verify eliminated tests truly redundant (domain expert review)

**Expected Outcomes:** 30% test reduction, <2% yield prediction degradation

---

#### **Project 3: Sparse Yield Modeling**

**Objective:** Build yield prediction model using minimal test set for early-stage screening.

**Business Value:**
- Enable fast screening before expensive tests
- Support multi-site parallel testing
- Reduce cost of test development
- Identify critical yield limiters

**Implementation Guide:**
1. Use Lasso for aggressive feature selection
2. Start with alpha=1.0, decrease until R²>0.85
3. Include physics-based features (power, speed)
4. Validate on out-of-lot data

**Expected Outcomes:** 10-15 critical tests identified, 90%+ yield prediction accuracy

---

#### **Project 4: Robust Power Parameter Prediction**

**Objective:** Predict device power consumption from fast pre-tests, handling measurement noise.

**Business Value:**
- Skip expensive power measurements
- Real-time binning decisions
- Reduce test time by 25%
- Robust to measurement variance

**Implementation Guide:**
1. Use Ridge for robustness (outlier resistance)
2. Include voltage, frequency, temperature features
3. Add polynomial terms for V² (physics-based)
4. Cross-validate across temperature corners

**Expected Outcomes:** Power prediction within 5%, robust to 10% measurement noise

---

### 📊 General AI/ML Projects

#### **Project 5: Genomic Biomarker Discovery**

**Objective:** Identify disease-predictive genes from 10,000+ gene expression levels.

**Business Value:**
- Drug target discovery
- Personalized medicine
- Reduce diagnostic test costs

**Implementation:** Lasso for gene selection, ElasticNet for gene networks

---

#### **Project 6: Text Classification with Large Vocabulary**

**Objective:** Build spam filter with 50,000+ word features, select key spam indicators.

**Business Value:**
- Fast inference (fewer features)
- Interpretable rules
- Lower memory footprint

**Implementation:** Lasso L1 penalty naturally handles sparse text features

---

#### **Project 7: Financial Risk Modeling**

**Objective:** Predict credit default from 200+ correlated economic indicators.

**Business Value:**
- Accurate risk assessment
- Regulatory compliance
- Automated lending decisions

**Implementation:** Ridge handles multicollinearity (GDP, inflation, rates correlated)

---

#### **Project 8: Image Compression and Denoising**

**Objective:** Learn sparse representation of images for compression.

**Business Value:**
- Reduce storage costs
- Faster transmission
- Remove noise while preserving edges

**Implementation:** Lasso on pixel/wavelet features for sparse coding

---

## 9. Key Takeaways

### ✅ When to Use Each Method

**Ridge (L2):**
- Many correlated features
- Want smooth shrinkage (keep all features)
- Multicollinearity problems
- Numerical stability critical

**Lasso (L1):**
- Need feature selection
- Sparse models preferred
- Interpretability important
- More features than samples

**ElasticNet:**
- High-dimensional data (p >> n)
- Grouped correlated features
- Best of both worlds
- When unsure, start here

### ⚠️ Limitations

**All Methods:**
- Still assumes linear relationship (use with polynomial features for non-linearity)
- Require feature scaling
- Alpha tuning needed (computationally expensive)

**Lasso Specific:**
- Picks arbitrarily from correlated features
- Can be unstable (small data changes → different features)
- Max selected features = n samples

### 🎯 Best Practices

1. **Always scale features** (StandardScaler before regularization)
2. **Use CV for alpha** (RidgeCV, LassoCV, ElasticNetCV)
3. **Start with ElasticNet** (covers Ridge and Lasso as special cases)
4. **Validate feature selection** (check if Lasso choices make domain sense)
5. **Monitor overfitting** (train vs test R² gap)
6. **Test on unseen data** (different batches/lots for production readiness)

### 📚 Next Steps

After mastering regularization:
1. **`013_Logistic_Regression.ipynb`** - Classification with regularization
2. **`016_Decision_Trees.ipynb`** - Non-linear alternatives
3. **Polynomial + Regularization** - Combine for powerful models

### 🔑 Core Concepts Mastered

✅ L1, L2, and combined penalties  
✅ Bias-variance tradeoff through regularization  
✅ Automatic feature selection with Lasso  
✅ Handling multicollinearity with Ridge  
✅ Hyperparameter tuning via cross-validation  
✅ Production-ready pipelines with CV  

---

**Congratulations!** You can now handle high-dimensional data, multicollinearity, and perform automatic feature selection. These are essential skills for real-world ML where data is often messy and high-dimensional.