## Necessary Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

---
# Part 1: Regression Task (California Housing)

In this section, we will:
1. Load and split the California Housing dataset
2. Build a baseline linear regression model
3. Perform hyperparameter tuning for Ridge and Lasso
4. Compare L1 vs L2 regularization

## Task 1: Load and Split Dataset

### About the Dataset
The California Housing dataset contains information from the 1990 California census. It includes:
- **8 features**: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
- **Target**: Median house value for California districts (in $100,000s)
- **Samples**: 20,640 observations

We will split the data into:
- **Training set**: 80% (for model training)
- **Test set**: 20% (for model evaluation)

In [None]:
# Load California Housing Dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Number of features: {X_train.shape[1]}")
print(f"Target variable range: [{y.min():.2f}, {y.max():.2f}]")

### Exploring the Dataset

In [None]:
# Create a DataFrame for better visualization
california_data = fetch_california_housing()
df_california = pd.DataFrame(X_train, columns=california_data.feature_names)
df_california['Target'] = y_train

print("Dataset Info:")
print(df_california.head())
print("\nBasic Statistics:")
print(df_california.describe())

### Feature Scaling

**Important**: For regularization techniques (Ridge and Lasso), feature scaling is crucial because:
- Regularization penalizes large coefficients
- Features with different scales will be penalized differently
- Standardization ensures fair comparison and better convergence

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling completed!")
print(f"Mean of scaled features: {X_train_scaled.mean(axis=0).round(10)}")
print(f"Std of scaled features: {X_train_scaled.std(axis=0).round(2)}")

---
## Task 2: Step 1 - Baseline Model (No Regularization)

### Linear Regression Without Regularization

We start with a basic Linear Regression model to establish a baseline. This model minimizes:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Key Points:**
- No regularization term
- May overfit if features are correlated or dataset is noisy
- Serves as a baseline for comparison

In [None]:
# Build baseline Linear Regression model
baseline_model = LinearRegression()
baseline_model.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred_baseline = baseline_model.predict(X_train_scaled)
y_test_pred_baseline = baseline_model.predict(X_test_scaled)

# Calculate MSE
train_mse_baseline = mean_squared_error(y_train, y_train_pred_baseline)
test_mse_baseline = mean_squared_error(y_test, y_test_pred_baseline)

print("=" * 60)
print("BASELINE MODEL (No Regularization)")
print("=" * 60)
print(f"Training MSE: {train_mse_baseline:.4f}")
print(f"Test MSE: {test_mse_baseline:.4f}")
print(f"\nR² Score (Train): {baseline_model.score(X_train_scaled, y_train):.4f}")
print(f"R² Score (Test): {baseline_model.score(X_test_scaled, y_test):.4f}")

### Observing Coefficients

In [None]:
# Display coefficients
feature_names = california_data.feature_names
coefficients_baseline = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': baseline_model.coef_
}).sort_values('Coefficient', ascending=False)

print("\nBaseline Model Coefficients:")
print(coefficients_baseline)

# Visualize coefficients
plt.figure(figsize=(10, 6))
plt.barh(coefficients_baseline['Feature'], coefficients_baseline['Coefficient'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Baseline Model - Feature Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.tight_layout()
plt.show()

---
## Task 2: Step 2 - Hyperparameter Tuning

### Ridge Regression (L2 Regularization)

Ridge regression adds an L2 penalty term:

$$\text{Loss} = \text{MSE} + \alpha \sum_{j=1}^{p} w_j^2$$

where $\alpha$ is the regularization strength.

**Characteristics:**
- Shrinks all coefficients towards zero
- Does not set coefficients exactly to zero
- Good when all features are potentially relevant

In [None]:
# Define parameter grid for Ridge
param_grid_ridge = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# Initialize Ridge model
ridge_model = Ridge()

# GridSearchCV for hyperparameter tuning
grid_search_ridge = GridSearchCV(
    estimator=ridge_model,
    param_grid=param_grid_ridge,
    cv=5,  # 5-fold cross-validation
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Training Ridge Regression with GridSearchCV...")
grid_search_ridge.fit(X_train_scaled, y_train)

print("\n" + "=" * 60)
print("RIDGE REGRESSION - Hyperparameter Tuning Results")
print("=" * 60)
print(f"Best alpha: {grid_search_ridge.best_params_['alpha']}")
print(f"Best CV MSE: {-grid_search_ridge.best_score_:.4f}")

### Lasso Regression (L1 Regularization)

Lasso regression adds an L1 penalty term:

$$\text{Loss} = \text{MSE} + \alpha \sum_{j=1}^{p} |w_j|$$

**Characteristics:**
- Can set some coefficients exactly to zero
- Performs feature selection automatically
- Good when only a subset of features is relevant

In [None]:
# Define parameter grid for Lasso
param_grid_lasso = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}

# Initialize Lasso model
lasso_model = Lasso(max_iter=10000)

# GridSearchCV for hyperparameter tuning
grid_search_lasso = GridSearchCV(
    estimator=lasso_model,
    param_grid=param_grid_lasso,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Training Lasso Regression with GridSearchCV...")
grid_search_lasso.fit(X_train_scaled, y_train)

print("\n" + "=" * 60)
print("LASSO REGRESSION - Hyperparameter Tuning Results")
print("=" * 60)
print(f"Best alpha: {grid_search_lasso.best_params_['alpha']}")
print(f"Best CV MSE: {-grid_search_lasso.best_score_:.4f}")

### Evaluating Best Models on Test Set

In [None]:
# Get best models
best_ridge = grid_search_ridge.best_estimator_
best_lasso = grid_search_lasso.best_estimator_

# Predictions
y_train_pred_ridge = best_ridge.predict(X_train_scaled)
y_test_pred_ridge = best_ridge.predict(X_test_scaled)

y_train_pred_lasso = best_lasso.predict(X_train_scaled)
y_test_pred_lasso = best_lasso.predict(X_test_scaled)

# Calculate MSE
train_mse_ridge = mean_squared_error(y_train, y_train_pred_ridge)
test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)

train_mse_lasso = mean_squared_error(y_train, y_train_pred_lasso)
test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)

print("\n" + "=" * 60)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 60)
print(f"\n{'Model':<20} {'Train MSE':<15} {'Test MSE':<15}")
print("-" * 50)
print(f"{'Baseline':<20} {train_mse_baseline:<15.4f} {test_mse_baseline:<15.4f}")
print(f"{'Ridge':<20} {train_mse_ridge:<15.4f} {test_mse_ridge:<15.4f}")
print(f"{'Lasso':<20} {train_mse_lasso:<15.4f} {test_mse_lasso:<15.4f}")

---
## Task 2: Step 3 - Regularization Experiments (L1 vs L2)

### Comparing Coefficients

In [None]:
# Create comparison DataFrame
coef_comparison = pd.DataFrame({
    'Feature': feature_names,
    'Baseline': baseline_model.coef_,
    'Ridge': best_ridge.coef_,
    'Lasso': best_lasso.coef_
})

print("\n" + "=" * 60)
print("COEFFICIENT COMPARISON")
print("=" * 60)
print(coef_comparison)

# Count zero coefficients in Lasso
zero_coef_lasso = np.sum(np.abs(best_lasso.coef_) < 1e-10)
print(f"\nNumber of features with zero coefficient in Lasso: {zero_coef_lasso}/{len(feature_names)}")
print(f"Lasso performed feature selection by eliminating {zero_coef_lasso} features.")

### Visualizing Coefficient Comparison

In [None]:
# Plot coefficient comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

models = ['Baseline', 'Ridge', 'Lasso']
for idx, (ax, model) in enumerate(zip(axes, models)):
    coef_data = coef_comparison.sort_values(model, ascending=False)
    ax.barh(coef_data['Feature'], coef_data[model])
    ax.set_xlabel('Coefficient Value', fontsize=11)
    ax.set_ylabel('Features', fontsize=11)
    ax.set_title(f'{model} Coefficients', fontsize=13, fontweight='bold')
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

### Effect of Alpha on Model Performance

In [None]:
# Test different alpha values
alphas = np.logspace(-3, 3, 50)
train_errors_ridge = []
test_errors_ridge = []
train_errors_lasso = []
test_errors_lasso = []

for alpha in alphas:
    # Ridge
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    train_errors_ridge.append(mean_squared_error(y_train, ridge.predict(X_train_scaled)))
    test_errors_ridge.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))
    
    # Lasso
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    train_errors_lasso.append(mean_squared_error(y_train, lasso.predict(X_train_scaled)))
    test_errors_lasso.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Ridge
ax1.plot(alphas, train_errors_ridge, label='Train MSE', linewidth=2)
ax1.plot(alphas, test_errors_ridge, label='Test MSE', linewidth=2)
ax1.set_xscale('log')
ax1.set_xlabel('Alpha (Regularization Strength)', fontsize=12)
ax1.set_ylabel('Mean Squared Error', fontsize=12)
ax1.set_title('Ridge Regression: MSE vs Alpha', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Lasso
ax2.plot(alphas, train_errors_lasso, label='Train MSE', linewidth=2)
ax2.plot(alphas, test_errors_lasso, label='Test MSE', linewidth=2)
ax2.set_xscale('log')
ax2.set_xlabel('Alpha (Regularization Strength)', fontsize=12)
ax2.set_ylabel('Mean Squared Error', fontsize=12)
ax2.set_title('Lasso Regression: MSE vs Alpha', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Discussion: Bias-Variance Tradeoff

**Key Observations:**

1. **L1 (Lasso) Regularization:**
   - Produces sparse models by setting some coefficients to exactly zero
   - Performs automatic feature selection
   - Useful when we believe only a subset of features is relevant
   - Can lead to better interpretability

2. **L2 (Ridge) Regularization:**
   - Shrinks all coefficients towards zero but rarely sets them exactly to zero
   - Handles multicollinearity well
   - Useful when all features are potentially relevant
   - More stable than Lasso when features are correlated

3. **Effect on Bias-Variance:**
   - **Low alpha (weak regularization)**: Lower bias, higher variance → potential overfitting
   - **High alpha (strong regularization)**: Higher bias, lower variance → potential underfitting
   - **Optimal alpha**: Balances bias and variance for best generalization

4. **Cross-Validation:**
   - Helps find the optimal regularization strength
   - Ensures the model generalizes well to unseen data

---
# Part 2: Classification Task (Breast Cancer)

In this section, we will:
1. Load and split the Breast Cancer dataset
2. Build a baseline logistic regression model
3. Perform hyperparameter tuning for L1 and L2 regularization
4. Compare L1 vs L2 regularization for classification

## Task 1: Load and Split Dataset

### About the Dataset
The Breast Cancer Wisconsin dataset contains features computed from digitized images of breast mass:
- **30 features**: Mean, standard error, and worst values for 10 characteristics
- **Target**: Binary classification (0 = malignant, 1 = benign)
- **Samples**: 569 observations

We will split the data into:
- **Training set**: 80%
- **Test set**: 20%

In [None]:
# Load Breast Cancer Dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Number of features: {X_train.shape[1]}")
print(f"\nClass distribution in training set:")
print(f"Class 0 (Malignant): {np.sum(y_train == 0)} ({np.sum(y_train == 0)/len(y_train)*100:.1f}%)")
print(f"Class 1 (Benign): {np.sum(y_train == 1)} ({np.sum(y_train == 1)/len(y_train)*100:.1f}%)")

### Exploring the Dataset

In [None]:
# Create DataFrame
cancer_data = load_breast_cancer()
df_cancer = pd.DataFrame(X_train, columns=cancer_data.feature_names)
df_cancer['Target'] = y_train

print("Dataset Info:")
print(df_cancer.head())
print("\nBasic Statistics:")
print(df_cancer.describe())

### Feature Scaling

In [None]:
# Standardize features
scaler_cancer = StandardScaler()
X_train_scaled_cancer = scaler_cancer.fit_transform(X_train)
X_test_scaled_cancer = scaler_cancer.transform(X_test)

print("Feature scaling completed!")
print(f"Mean of scaled features: {X_train_scaled_cancer.mean(axis=0).round(10)[0]}")
print(f"Std of scaled features: {X_train_scaled_cancer.std(axis=0).round(2)[0]}")

---
## Task 2: Step 1 - Baseline Model (No Regularization)

### Logistic Regression Without Regularization

Logistic regression predicts the probability of a binary outcome:

$$P(y=1|x) = \frac{1}{1 + e^{-(w^Tx + b)}}$$

The model minimizes the log loss (binary cross-entropy).

In [None]:
# Build baseline Logistic Regression model
# Note: penalty='none' means no regularization
baseline_logreg = LogisticRegression(penalty=None, max_iter=10000, random_state=42)
baseline_logreg.fit(X_train_scaled_cancer, y_train)

# Make predictions
y_train_pred_baseline = baseline_logreg.predict(X_train_scaled_cancer)
y_test_pred_baseline = baseline_logreg.predict(X_test_scaled_cancer)

# Calculate accuracy
train_acc_baseline = accuracy_score(y_train, y_train_pred_baseline)
test_acc_baseline = accuracy_score(y_test, y_test_pred_baseline)

print("=" * 60)
print("BASELINE MODEL (No Regularization)")
print("=" * 60)
print(f"Training Accuracy: {train_acc_baseline:.4f}")
print(f"Test Accuracy: {test_acc_baseline:.4f}")
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_pred_baseline, target_names=['Malignant', 'Benign']))

### Observing Coefficients

In [None]:
# Display coefficients
feature_names_cancer = cancer_data.feature_names
coefficients_baseline_cancer = pd.DataFrame({
    'Feature': feature_names_cancer,
    'Coefficient': baseline_logreg.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)

print("\nBaseline Model - Top 10 Coefficients (by absolute value):")
print(coefficients_baseline_cancer.head(10))

# Visualize top 15 coefficients
top_15 = coefficients_baseline_cancer.head(15)
plt.figure(figsize=(10, 8))
plt.barh(top_15['Feature'], top_15['Coefficient'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Baseline Model - Top 15 Feature Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.tight_layout()
plt.show()

### Confusion Matrix

In [None]:
# Confusion Matrix
cm_baseline = confusion_matrix(y_test, y_test_pred_baseline)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_baseline, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'],
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Confusion Matrix - Baseline Model', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---
## Task 2: Step 2 - Hyperparameter Tuning

### Understanding the C Parameter

In sklearn's LogisticRegression:
- **C** is the inverse of regularization strength
- **Smaller C** = stronger regularization
- **Larger C** = weaker regularization

Loss function with regularization:
$$\text{Loss} = \text{Log Loss} + \frac{1}{C} \cdot \text{Penalty}$$

In [None]:
# Define parameter grid
param_grid_logreg = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2']
}

# Initialize Logistic Regression model
logreg_model = LogisticRegression(solver='saga', max_iter=10000, random_state=42)

# GridSearchCV for hyperparameter tuning
grid_search_logreg = GridSearchCV(
    estimator=logreg_model,
    param_grid=param_grid_logreg,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Training Logistic Regression with GridSearchCV...")
grid_search_logreg.fit(X_train_scaled_cancer, y_train)

print("\n" + "=" * 60)
print("LOGISTIC REGRESSION - Hyperparameter Tuning Results")
print("=" * 60)
print(f"Best parameters: {grid_search_logreg.best_params_}")
print(f"Best CV accuracy: {grid_search_logreg.best_score_:.4f}")

### Separate Tuning for L1 and L2

In [None]:
# L1 Regularization (Lasso-like)
param_grid_l1 = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_l1 = LogisticRegression(penalty='l1', solver='saga', max_iter=10000, random_state=42)
grid_search_l1 = GridSearchCV(logreg_l1, param_grid_l1, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_l1.fit(X_train_scaled_cancer, y_train)

print("L1 Regularization:")
print(f"Best C: {grid_search_l1.best_params_['C']}")
print(f"Best CV accuracy: {grid_search_l1.best_score_:.4f}")

# L2 Regularization (Ridge-like)
param_grid_l2 = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
logreg_l2 = LogisticRegression(penalty='l2', solver='saga', max_iter=10000, random_state=42)
grid_search_l2 = GridSearchCV(logreg_l2, param_grid_l2, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_l2.fit(X_train_scaled_cancer, y_train)

print("\nL2 Regularization:")
print(f"Best C: {grid_search_l2.best_params_['C']}")
print(f"Best CV accuracy: {grid_search_l2.best_score_:.4f}")

### Evaluating Best Models on Test Set

In [None]:
# Get best models
best_logreg_l1 = grid_search_l1.best_estimator_
best_logreg_l2 = grid_search_l2.best_estimator_

# Predictions
y_train_pred_l1 = best_logreg_l1.predict(X_train_scaled_cancer)
y_test_pred_l1 = best_logreg_l1.predict(X_test_scaled_cancer)

y_train_pred_l2 = best_logreg_l2.predict(X_train_scaled_cancer)
y_test_pred_l2 = best_logreg_l2.predict(X_test_scaled_cancer)

# Calculate accuracy
train_acc_l1 = accuracy_score(y_train, y_train_pred_l1)
test_acc_l1 = accuracy_score(y_test, y_test_pred_l1)

train_acc_l2 = accuracy_score(y_train, y_train_pred_l2)
test_acc_l2 = accuracy_score(y_test, y_test_pred_l2)

print("\n" + "=" * 60)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 60)
print(f"\n{'Model':<25} {'Train Accuracy':<20} {'Test Accuracy':<20}")
print("-" * 65)
print(f"{'Baseline (No Reg)':<25} {train_acc_baseline:<20.4f} {test_acc_baseline:<20.4f}")
print(f"{'L1 Regularization':<25} {train_acc_l1:<20.4f} {test_acc_l1:<20.4f}")
print(f"{'L2 Regularization':<25} {train_acc_l2:<20.4f} {test_acc_l2:<20.4f}")

---
## Task 2: Step 3 - Regularization Experiments (L1 vs L2)

### Comparing Coefficients

In [None]:
# Create comparison DataFrame
coef_comparison_cancer = pd.DataFrame({
    'Feature': feature_names_cancer,
    'Baseline': baseline_logreg.coef_[0],
    'L1': best_logreg_l1.coef_[0],
    'L2': best_logreg_l2.coef_[0]
})

# Sort by absolute value of L1 coefficient
coef_comparison_cancer['L1_abs'] = np.abs(coef_comparison_cancer['L1'])
coef_comparison_cancer = coef_comparison_cancer.sort_values('L1_abs', ascending=False)
coef_comparison_cancer = coef_comparison_cancer.drop('L1_abs', axis=1)

print("\n" + "=" * 80)
print("COEFFICIENT COMPARISON - Top 15 Features")
print("=" * 80)
print(coef_comparison_cancer.head(15))

# Count zero coefficients in L1
zero_coef_l1 = np.sum(np.abs(best_logreg_l1.coef_[0]) < 1e-10)
print(f"\nNumber of features with zero coefficient in L1: {zero_coef_l1}/{len(feature_names_cancer)}")
print(f"L1 regularization performed feature selection by eliminating {zero_coef_l1} features.")

### Visualizing Coefficient Comparison

In [None]:
# Plot coefficient comparison for top 15 features
top_15_cancer = coef_comparison_cancer.head(15)

fig, axes = plt.subplots(1, 3, figsize=(20, 8))
models_cancer = ['Baseline', 'L1', 'L2']

for idx, (ax, model) in enumerate(zip(axes, models_cancer)):
    ax.barh(top_15_cancer['Feature'], top_15_cancer[model])
    ax.set_xlabel('Coefficient Value', fontsize=11)
    ax.set_ylabel('Features', fontsize=11)
    ax.set_title(f'{model} Coefficients (Top 15)', fontsize=13, fontweight='bold')
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

### Effect of C on Model Performance

In [None]:
# Test different C values
C_values = np.logspace(-3, 3, 50)
train_acc_l1_list = []
test_acc_l1_list = []
train_acc_l2_list = []
test_acc_l2_list = []
n_features_l1 = []

for C in C_values:
    # L1
    logreg_l1_temp = LogisticRegression(penalty='l1', C=C, solver='saga', max_iter=10000, random_state=42)
    logreg_l1_temp.fit(X_train_scaled_cancer, y_train)
    train_acc_l1_list.append(accuracy_score(y_train, logreg_l1_temp.predict(X_train_scaled_cancer)))
    test_acc_l1_list.append(accuracy_score(y_test, logreg_l1_temp.predict(X_test_scaled_cancer)))
    n_features_l1.append(np.sum(np.abs(logreg_l1_temp.coef_[0]) > 1e-10))
    
    # L2
    logreg_l2_temp = LogisticRegression(penalty='l2', C=C, solver='saga', max_iter=10000, random_state=42)
    logreg_l2_temp.fit(X_train_scaled_cancer, y_train)
    train_acc_l2_list.append(accuracy_score(y_train, logreg_l2_temp.predict(X_train_scaled_cancer)))
    test_acc_l2_list.append(accuracy_score(y_test, logreg_l2_temp.predict(X_test_scaled_cancer)))

# Plot
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# L1
axes[0].plot(C_values, train_acc_l1_list, label='Train Accuracy', linewidth=2)
axes[0].plot(C_values, test_acc_l1_list, label='Test Accuracy', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('L1 Regularization: Accuracy vs C', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# L2
axes[1].plot(C_values, train_acc_l2_list, label='Train Accuracy', linewidth=2)
axes[1].plot(C_values, test_acc_l2_list, label='Test Accuracy', linewidth=2)
axes[1].set_xscale('log')
axes[1].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('L2 Regularization: Accuracy vs C', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

# Number of features selected by L1
axes[2].plot(C_values, n_features_l1, linewidth=2, color='green')
axes[2].set_xscale('log')
axes[2].set_xlabel('C (Inverse Regularization Strength)', fontsize=12)
axes[2].set_ylabel('Number of Non-Zero Features', fontsize=12)
axes[2].set_title('L1: Feature Selection vs C', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].axhline(y=len(feature_names_cancer), color='red', linestyle='--', label='Total Features')
axes[2].legend(fontsize=11)

plt.tight_layout()
plt.show()

### Confusion Matrices Comparison

In [None]:
# Confusion Matrices
cm_baseline = confusion_matrix(y_test, y_test_pred_baseline)
cm_l1 = confusion_matrix(y_test, y_test_pred_l1)
cm_l2 = confusion_matrix(y_test, y_test_pred_l2)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
cms = [cm_baseline, cm_l1, cm_l2]
titles = ['Baseline (No Reg)', 'L1 Regularization', 'L2 Regularization']

for ax, cm, title in zip(axes, cms, titles):
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Malignant', 'Benign'],
                yticklabels=['Malignant', 'Benign'],
                cbar_kws={'label': 'Count'})
    ax.set_xlabel('Predicted Label', fontsize=11)
    ax.set_ylabel('True Label', fontsize=11)
    ax.set_title(f'Confusion Matrix - {title}', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

### Discussion: Bias-Variance Tradeoff in Classification

**Key Observations:**

1. **L1 (Lasso-like) Regularization:**
   - Sets some coefficients to exactly zero
   - Automatically performs feature selection
   - More interpretable models with fewer features
   - Can improve generalization by reducing model complexity

2. **L2 (Ridge-like) Regularization:**
   - Shrinks all coefficients but rarely to zero
   - Keeps all features but with reduced impact
   - More stable when features are correlated
   - Generally provides smooth coefficient distributions

3. **Effect on Bias-Variance:**
   - **Large C (weak regularization)**: Model can overfit → high variance, low bias
   - **Small C (strong regularization)**: Model may underfit → low variance, high bias
   - **Optimal C**: Balances complexity and fit for best test performance

4. **Practical Insights:**
   - Use L1 when you want feature selection and interpretability
   - Use L2 when all features are potentially important
   - Cross-validation is essential for finding optimal regularization
   - Both methods help prevent overfitting and improve generalization

---
## Summary and Key Takeaways

### Part 1: Regression (California Housing)
- Ridge and Lasso both improved generalization compared to baseline
- Lasso performed automatic feature selection
- Optimal alpha found through cross-validation
- Regularization reduced overfitting and improved test performance

### Part 2: Classification (Breast Cancer)
- Both L1 and L2 regularization improved model performance
- L1 created sparser models with fewer features
- L2 kept all features but with controlled magnitudes
- Cross-validation helped identify optimal hyperparameters

### General Principles:
1. **Always standardize features** before applying regularization
2. **Use cross-validation** to find optimal hyperparameters
3. **L1 for feature selection**, L2 for stability
4. **Monitor train vs test performance** to detect over/underfitting
5. **Regularization is a powerful tool** for improving generalization

---

## Conclusion

This worksheet demonstrated the practical application of regularization techniques in both regression and classification tasks. We learned how to:
- Apply Ridge (L2) and Lasso (L1) regularization
- Tune hyperparameters using GridSearchCV
- Understand and visualize the bias-variance tradeoff
- Compare model performance across different regularization strategies

These techniques are fundamental in machine learning and essential for building robust, generalizable models.