# Module 1: Introduction to Scikit-Learn

## Section 2: Supervised Learning Algorithms

### Part 3: Ridge Regression

In this part, we will explore Ridge regression, a popular linear regression technique used to mitigate the issue of multicollinearity and overfitting in linear regression models.

### 3.1 Understanding Ridge regression

Ridge regression is a linear regression technique that extends ordinary least squares regression by adding a penalty term to the objective function. This penalty term, also known as the L2 regularization term, encourages models with smaller coefficients, effectively reducing the impact of highly correlated features and improving generalization. Ridge helps prevent overfitting and is also particularly useful when dealing with multicollinearity, a situation where two or more features are highly correlated, as it can stabilize the model and improve its generalization performance.

In Ridge regression, the objective is to minimize the following cost function:

$\text{J}\theta = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\alpha\sum_{j=1}^{n}\theta_j^2$

Where:

- $m$ is the number of training examples.
- $h_\theta(x^{(i)})$ is the predicted output for the $i$-th training example using the model parameters $\theta$.
- $y^{(i)}$ is the actual output for the $i$-th training example.
- $n$ is the number of features (excluding the bias term).
- $\theta_j$ represents the value of the coefficient of the $j$-th feature.
- $\alpha$ is the regularization parameter or the strength of the penalty term.

The regularization term, is the sum of the squared coefficients multiplied by the regularization parameter $\alpha$. The larger the value of $\alpha$, the greater the penalty, and the more the coefficients will be pushed towards zero but never exactly zero. The model will become less sensitive to variations of the independent variables. The slope of the line will be more flat.

The goal of ridge regression is to find the best-fit line that minimizes the difference between the predicted values and the actual values and reduce overfitting adding a penaly term reducing the weights.

$\text{Least Squares Regression = Min(sum of squared residuals}+ \alpha * \text{slope}^2)$ 

Advantages of Ridge Regression
1. Regularization: Ridge regression introduces regularization, which helps prevent overfitting by controlling the complexity of the model and stabilizing the model's coefficients.
2. Multicollinearity Handling: Ridge regression can handle multicollinearity, a situation where two or more features are highly correlated, by reducing the impact of highly correlated features on the model's predictions.
3. Stable and Robust: Ridge regression is stable and robust to changes in the data, making it more reliable for generalization to new data.
4. No Feature Selection: Ridge regression does not perform feature selection like Lasso regression, making it suitable for situations where all features are considered equally important.

Disadvantages of Ridge Regression
1. Bias Towards Zero: Ridge regression does not set coefficients to exactly zero, which means it does not perform explicit feature selection. It only shrinks the coefficients towards zero, but they remain non-zero.
2. Hyperparameter Tuning: The choice of the regularization parameter $\alpha$ can significantly affect the model's performance. Selecting the right value for $\alpha$ often requires cross-validation.

### 3.2 Lasso vs. Ridge Regression

Both methods are used to prevent overfitting and improve the performance and interpretability of linear models when dealing with high-dimensional datasets or multicollinearity between features.
Here's a comparison between Lasso and Ridge regression:

1. Regularization term:

    Lasso regression adds an L1 regularization term to the traditional least squares cost function. It can drive some coefficients to exactly zero, effectively performing feature selection. Ridge regression, on the other hand, adds an L2 regularization term to the cost function which tends to shrink the coefficients towards zero, but rarely to exactly zero. Ridge regression can reduce the impact of multicollinearity and provide more stable coefficient estimates.

2. Feature selection:

    Lasso's L1 regularization leads to sparse models by driving some coefficients to exactly zero. This property makes Lasso regression effective for feature selection. Ridge regression does not perform feature selection as aggressively as Lasso. While it reduces the impact of less relevant features by shrinking their coefficients towards zero, it rarely sets coefficients to exactly zero. This means Ridge regression often retains all features in the model, which can be an advantage when all features might have some predictive power.

3. Multicollinearity:

    Lasso and Ridge can handle multicollinearity and feature redundancy more effectively by reducing the coefficients of correlated variables.

4. Model performance:

    Lasso regression tends to perform well when there is a true subset of features that are relevant to the target variable. It is particularly effective for feature selection tasks and can lead to sparse models with high interpretability. Ridge regression is often useful when there is a need to stabilize coefficient estimates and prevent overfitting in cases where feature selection is not the primary concern. It works well when all features are somewhat relevant, and it can provide more reliable predictions when multicollinearity is present.

In summary, Lasso and Ridge regression offer different regularization approaches, and the choice between the two depends on the specific characteristics of the dataset and the objectives of the modeling task. Lasso is more suitable for feature selection, while Ridge can be advantageous when handling multicollinearity and providing more stable coefficient estimates. In practice, it is common to try both methods and choose the one that gives the best performance based on cross-validation and evaluation metrics.

### 3.3 Training and Evaluation

In Scikit-Learn, Ridge regression can be easily implemented using the Ridge class from the linear_model module. The regularization parameter  $\alpha$ can be set using the alpha hyperparameter. By varying the value of  $\alpha$, you can control the amount of regularization and the impact of the penalty on the model.

To train a Ridge regression model, we need a labeled dataset with the target variable and the corresponding feature values. The model learns by minimizing the regularized objective function, which includes the sum of squared residuals from the ordinary least squares regression and the L2 regularization term.

Once trained, we can use the Ridge regression model to make predictions for new, unseen data points. The model predicts the target values based on the learned coefficients and the feature values.

Scikit-Learn provides the Ridge class for performing Ridge regression. Here's an example of how to use it:

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

np.random.seed(42)  # Set a fixed seed for reproducibility
# Load the Boston Housing dataset
boston = load_breast_cancer()
X, y = boston.data, boston.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_predictions = linear_model.predict(X_test)
linear_mae = mean_absolute_error(y_test, linear_predictions)
linear_r2 = r2_score(y_test, linear_predictions)
print("Linear Model Mean Absolute Error (MAE):", linear_mae)
print("Linear Model R-squared (R2) Score:", linear_r2)

# Fit and evaluate Lasso regression model
lasso_model = Lasso(max_iter=1000, alpha=1.0)  # Setting alpha to 1.0 to add regularization (L1 penalty)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)
lasso_mae = mean_absolute_error(y_test, lasso_predictions)
lasso_r2 = r2_score(y_test, lasso_predictions)
print("\nLasso Model Mean Absolute Error (MAE):", lasso_mae)
print("Lasso Model R-squared (R2) Score:", lasso_r2)

# Fit and evaluate Ridge regression model
ridge_model = Ridge(max_iter=1000, alpha=1.0)  # Setting alpha to 1.0 to add regularization (L2 penalty)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)
ridge_mae = mean_absolute_error(y_test, ridge_predictions)
ridge_r2 = r2_score(y_test, ridge_predictions)
print("\nRidge Model Mean Absolute Error (MAE):", ridge_mae)
print("Ridge Model R-squared (R2) Score:", ridge_r2)

In this example, we evaluate three different linear regression models (Linear Regression, Lasso Regression, and Ridge Regression) on the Breast Cancer dataset. The dataset is split into training and testing sets, and each model's performance is measured using Mean Absolute Error (MAE) and R-squared (R2) Score. The Ridge regression model (with L2 regularization) should perform better than both linear regression and Lasso regression in this scenario, as it can effectively handle multicollinearity and provide more stable and accurate predictions without doing feature selection.

In general, when comparing the performance of linear models, we want to minimize the MAE and maximize the R-squared score. The lower the MAE, the better the model's predictive accuracy, and the higher the R-squared score, the better the model fits the data.

Based on the results, Model 3 has the lowest MAE (0.1889) and the highest R-squared score (0.7472), indicating it is the best-performing linear model among the three. Model 1 also performs reasonably well, with a lower MAE and a relatively high R-squared score, while Model 2 appears to be the least accurate among the three.

Additionally, hyperparameter tuning, such as fine-tuning the alpha parameter for Lasso regression and Ridge regression, could potentially improve the performance.

### 3.4 Hyperparameter Tuning

Ridge regression has a hyperparameter called alpha that controls the strength of the regularization. Higher values of alpha result in stronger regularization and smaller coefficients. The choice of the alpha value depends on the trade-off between bias and variance. Cross-validation techniques, such as grid search or randomized search, can be used to find the optimal value of alpha.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(42)  # Set a fixed seed for reproducibility
# Load the Boston Housing dataset
boston = load_breast_cancer()
X, y = boston.data, boston.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_predictions = linear_model.predict(X_test)
linear_mae = mean_absolute_error(y_test, linear_predictions)
linear_r2 = r2_score(y_test, linear_predictions)
print("Linear Model Mean Absolute Error (MAE):", linear_mae)
print("Linear Model R-squared (R2) Score:", linear_r2)

# Perform cross-validated hyperparameter tuning for Lasso regression
lasso_params = {'alpha': np.logspace(-3, 5, 100)}  # Define a range of alpha values
lasso_model = Lasso(max_iter=100000)
lasso_cv = GridSearchCV(lasso_model, lasso_params, cv=5)
lasso_cv.fit(X_train, y_train)
best_lasso_alpha = lasso_cv.best_params_['alpha']
print("\nBest Lasso alpha found:", best_lasso_alpha)
# Fit and evaluate Lasso regression model
lasso_model = Lasso(max_iter=100000, alpha=best_lasso_alpha)  # Setting alpha to best alpha found to add regularization (L1 penalty)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)
lasso_mae = mean_absolute_error(y_test, lasso_predictions)
lasso_r2 = r2_score(y_test, lasso_predictions)
print("Lasso Model Mean Absolute Error (MAE):", lasso_mae)
print("Lasso Model R-squared (R2) Score:", lasso_r2)

# Perform cross-validated hyperparameter tuning for Ridge regression
ridge_params = {'alpha': np.logspace(-3, 5, 100)}  # Define a range of alpha values
ridge_model = Ridge(max_iter=100000)
ridge_cv = GridSearchCV(ridge_model, ridge_params, cv=5)
ridge_cv.fit(X_train, y_train)
best_ridge_alpha = ridge_cv.best_params_['alpha']
print("\nBest Ridge alpha found:", best_ridge_alpha)
# Fit and evaluate Ridge regression model
ridge_model = Ridge(max_iter=100000, alpha=best_ridge_alpha)  # Setting alpha to best alpha found to add regularization (L2 penalty)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)
ridge_mae = mean_absolute_error(y_test, ridge_predictions)
ridge_r2 = r2_score(y_test, ridge_predictions)
print("Ridge Model Mean Absolute Error (MAE):", ridge_mae)
print("Ridge Model R-squared (R2) Score:", ridge_r2)

The provided code performs linear regression and hyperparameter tuning for Lasso and Ridge regression models on the Breast Cancer dataset. The dataset is split into training and testing sets. Firstly, a linear regression model is fit and evaluated, providing the Mean Absolute Error (MAE) and R-squared (R2) Score for the linear model's performance.

Next, hyperparameter tuning is performed using GridSearchCV for both Lasso and Ridge regression models. A range of alpha values is defined for each regularization term, and the best alpha value is determined through cross-validation on the training data. The Lasso and Ridge models are then fitted using the best alpha values obtained from the hyperparameter tuning.

The performance of the tuned Lasso and Ridge regression models is evaluated, and their respective MAE and R-squared scores are provided.

The linear regression model provides an MAE of 0.1969 and an R-squared score of 0.7271. After tuning, both the Lasso and Ridge regression models outperform the basic linear regression model. The Lasso regression achieves an MAE of 0.1907 and an R-squared score of 0.7443, while the Ridge regression achieves an MAE of 0.1907 and an R-squared score of 0.7570.

By performing hyperparameter tuning, the Lasso and Ridge regression models are better able to generalize to unseen data and improve their predictive accuracy compared to the initial linear regression model. These tuned models with regularization provide a more robust approach to modeling the data and reducing overfitting.

Overall, the code demonstrates how to apply linear regression and tune the regularization strength for Lasso and Ridge regression to improve their performance on the Breast Cancer dataset. By tuning the hyperparameters, the models achieve better generalization and predictive accuracy.

### 3.5 Summary

Ridge regression is a useful technique for linear regression tasks, especially when dealing with multicollinearity and overfitting. It adds a regularization term to the objective function, promoting models with smaller coefficients. While Lasso regression adds an L1 regularization term to the traditional least squares cost function, it can drive some coefficients to exactly zero, effectively performing feature selection. Ridge regression, on the other hand, adds an L2 regularization term to the cost function which tends to shrink the coefficients towards zero, but rarely to exactly zero. Ridge regression can reduce the impact of multicollinearity and provide more stable coefficient estimates retaining all features in the model, which can be an advantage when all features might have some predictive power.

Ridge regression is a powerful tool for regularization and handling multicollinearity. It can stabilize the model and improve generalization performance in situations where multicollinearity is present. However, it's essential to carefully tune the regularization parameter and consider the trade-off between bias and variance when using Ridge regression.