# Part 1: Regularization in Machine Learning

Regularization is a technique used to prevent overfitting in machine learning models by penalizing complex models. It involves adding a regularization term to the loss function to control the model's complexity.

## Types of Regularization (Self-Study)

The two most common types of regularization are L1 (Lasso) regularization and L2 (Ridge) regularization.

### L1 Regularization - Lasso

L1 regularization adds the sum of the absolute values of the coefficients to the loss function. It can lead to sparse models where some feature coefficients are zeroed out, effectively performing feature selection.

$$ \text{Loss Function} = \text{MSE} + \alpha \sum_{i=1}^{n} |w_i| $$

### L2 Regularization - Ridge

L2 regularization adds the sum of the squares of the coefficients to the loss function. It doesn't zero out coefficients but ensures they are small, leading to a less complex model.

$$ \text{Loss Function} = \text{MSE} + \alpha \sum_{i=1}^{n} w_i^2 $$

### Elastic Net

Elastic Net is a combination of L1 and L2 regularization. It is useful when there are multiple correlated features.

$$ \text{Loss Function} = \text{MSE} + \alpha \rho \sum_{i=1}^{n} |w_i| + \alpha (1-\rho) \sum_{i=1}^{n} w_i^2 $$

### Choosing the Regularization Parameter

The regularization parameter ($\alpha$) controls the strength of the penalty. The L1 ratio ($\rho$) controls the combination of L1 and L2 penalties.

**Benefits of Regularization**

- Reduces overfitting by penalizing large coefficients.
- Can lead to simpler models that generalize better.
- In the case of L1 regularization, can perform feature selection by zeroing out coefficients.

## Hands-On (Practical)

In [1]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.datasets import make_regression, make_classification

In [2]:
# Generate a synthetic regression dataset with 1000 samples and 20 features, 
# split it into 80% training and 20% testing sets, 

X, y = make_regression(n_samples=1000, n_features=20, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train a Linear Regression model on the training data.

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [3]:
# Generate predictions on the test set using the trained linear model
# and calculate the Mean Squared Error (MSE) to evaluate performance.

linear_preds = linear_reg.predict(X_test)
linear_mse = mean_squared_error(y_test, linear_preds)

print(f'Linear MSE: {linear_mse}')

Linear MSE: 5.575382127163064e-26


In [4]:
# Initialize a Ridge regression model with regularization strength (alpha) of 1.0
# and fit it to the training data.

ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)

In [5]:
# Generate predictions on the test set using the trained Ridge model
# and calculate the Mean Squared Error (MSE).

ridge_preds = ridge_reg.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_preds)

print(f'Ridge MSE: {ridge_mse}')

Ridge MSE: 0.08267563475452225


In [6]:
# Initialize a Lasso regression model with alpha=0.1, train it on the
# training set, make predictions on the test set, and compute the MSE.

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)

lasso_preds = lasso_reg.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_preds)

print(f'Lasso MSE: {lasso_mse}')

Lasso MSE: 0.09972349965638024


Logistic Regression model in `sklearn` already has regularization built-in. The default is `L2 (Ridge)`.

In [7]:
# Generate a binary classification dataset with 1000 samples and 20 features,
# split it into training and testing sets

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train a Logistic Regression model with L2 regularization (C=1.0) on the training data.

log_ridge_reg = LogisticRegression(C=1.0)
log_ridge_reg.fit(X_train, y_train)

In [8]:
# Generate predictions on the test set and evaluate the model's performance
# using a confusion matrix, accuracy, precision, recall, and F1 score.

y_pred = log_ridge_reg.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [9]:
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
[[ 90   3]
 [  4 103]]
Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1 Score: 0.97


In [10]:
# Train a Logistic Regression model with L1 regularization (Lasso) to encourage feature sparsity.
# The 'saga' solver is used as it supports L1 penalties,

log_lasso_reg = LogisticRegression(C=1.0, penalty='l1', solver='saga', max_iter=10000)
log_lasso_reg.fit(X_train, y_train)

In [11]:
# Generate predictions on the test set and evaluate the model's performance
# using a confusion matrix, accuracy, precision, recall, and F1 score.

y_pred = log_lasso_reg.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [12]:
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
[[ 90   3]
 [  4 103]]
Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1 Score: 0.97


In [13]:
# Train a Logistic Regression model with Elastic Net regularization, which combines L1 and L2 penalties.

log_enet_reg = LogisticRegression(penalty='elasticnet', solver='saga', C=1, l1_ratio=0.5, max_iter=10000)
log_enet_reg.fit(X_train, y_train)

In [14]:
# Generate predictions on the test set and evaluate the model's performance
# using a confusion matrix, accuracy, precision, recall, and F1 score.

y_pred = log_enet_reg.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [15]:
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
[[ 90   3]
 [  4 103]]
Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1 Score: 0.97


## Hyperparameter Tuning (Self-Study)

The regularization parameter ($\alpha$) and L1 ratio ($\rho$) are examples of **hyperparameters**. In machine learning, models have parameters that are learned from the data and hyperparameters that are set by the practitioner. Hyperparameter tuning is the process of finding the optimal combination of hyperparameters that yields the best performance. **Cross-validation** and **grid search** are two techniques used for this purpose.

### Cross-Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. Cross-validation involves partitioning the data into subsets, training the model on some subsets, and validating it on the remaining subsets. This process is repeated multiple times, and the results are averaged to estimate the model's performance. 

The goal is to assess how the predictions of a model will generalize to an independent dataset, which is identical to the purpose of train-test split which we have been doing thus far.

The train-test split is a simple method to divide your dataset into two parts: one for training the model and the other for testing it. However it has the following disadvantages if we use it for hyperparamters tuning:

- The evaluation may depend on how the data is split.
- The test score can vary significantly based on which data points end up in the training set and which in the test set.
- Less effective use of data, especially in cases where the amount of training data is limited.

Cross-validation has the following advantages when it comes to hyperparameters tuning:

- More reliable estimate of out-of-sample performance due to multiple rounds of splitting.
- Makes better use of data as each data point gets to be in a test set exactly once and in a training set K-1 times.

**Types of Cross-Validation**

![kfoldcv](../assets/kfold-cv.png)
- **K-Fold Cross-Validation**: The dataset is divided into K equal folds. Each fold acts as the validation set 1 time and acts as the training set K-1 times. The average performance metric across all K trials is used.
- **Leave-One-Out Cross-Validation**: A special case of K-Fold Cross-Validation where K is equal to the number of data points in the dataset.
- **Stratified K-Fold Cross-Validation**: Similar to K-Fold but preserves the percentage of samples for each class.

### Grid Search
![grid-random-search](../assets/grid-random-search.png)

Grid search is a brute force method to estimate the best hyperparameters. A grid of hyperparameter combinations is created, and the model is evaluated for each combination using cross-validation.

**Process**

1. Define the hyperparameter grid.
2. Use cross-validation to evaluate each combination of hyperparameters.
3. Select the combination that yields the best performance.

### Random Search

Random search is an alternative to grid search. Instead of trying out every possible combination, it samples a random subset of parameter combinations.

**Advantages**

- Can be faster than grid search when dealing with a large hyperparameter space.
- Can sometimes find a better combination of hyperparameters by exploring a wider range of values.

**Process**

1. Define a search space as a bounded domain of hyperparameter values.
2. Randomly sample combinations of hyperparameters from this domain.
3. Perform cross-validation for each combination.
4. Select the combination that yields the best validation score.

