# Linear Regression with Cross-Validation and Regularization

In this demo, we will explore two important concepts in machine learning: **Cross-Validation** and **Regularization**. We will use the `scikit-learn` package to perform these tasks.

### Cross-Validation:
Cross-validation is a technique for evaluating machine learning models by splitting the dataset into training and test sets multiple times to ensure that the model's performance generalizes well to unseen data. One common approach is **k-fold cross-validation**, where the data is split into `k` subsets, and the model is trained on `k-1` subsets while the remaining subset is used for validation. This process is repeated `k` times, and the average performance is taken.

### Regularization:
Regularization techniques add a penalty to the cost function of a machine learning model to prevent overfitting. We focus on two types of regularization:
- **Ridge Regression (L2 Regularization)**: Adds a penalty proportional to the square of the coefficients.
  
 $
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2 + \lambda \sum_{j=1}^{n} \theta_j^2
$

- **Lasso Regression (L1 Regularization)**: Adds a penalty proportional to the absolute value of the coefficients.

 $
  J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2 + \lambda \sum_{j=1}^{n} |\theta_j|
  $

The parameter $\lambda$ (also known as `alpha` in `scikit-learn`) controls the strength of regularization: higher values increase the penalty and thus shrink the model coefficients.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score, train_test_split

# Generate synthetic data for the demo
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1).ravel()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Cross-validation with a simple Linear Regression (no regularization)
lin_reg = LinearRegression()

# Perform 5-fold cross-validation and compute the average R^2 score
scores = cross_val_score(lin_reg, X_train, y_train, cv=5, scoring='r2')
print(f"Linear Regression (no regularization) 5-fold cross-validation R^2 scores: {scores}")
print(f"Average R^2: {scores.mean()}")

# Ridge Regression (L2 Regularization)
ridge_reg = Ridge(alpha=1.0)  # alpha is the regularization strength

# Perform cross-validation for Ridge
ridge_scores = cross_val_score(ridge_reg, X_train, y_train, cv=5, scoring='r2')
print(f"\nRidge Regression (L2) 5-fold cross-validation R^2 scores: {ridge_scores}")
print(f"Average R^2: {ridge_scores.mean()}")

# Lasso Regression (L1 Regularization)
lasso_reg = Lasso(alpha=0.1)  # alpha is the regularization strength

# Perform cross-validation for Lasso
lasso_scores = cross_val_score(lasso_reg, X_train, y_train, cv=5, scoring='r2')
print(f"\nLasso Regression (L1) 5-fold cross-validation R^2 scores: {lasso_scores}")
print(f"Average R^2: {lasso_scores.mean()}")

# Train and compare models on the test set
lin_reg


Linear Regression (no regularization) 5-fold cross-validation R^2 scores: [1. 1. 1. 1. 1.]
Average R^2: 1.0

Ridge Regression (L2) 5-fold cross-validation R^2 scores: [0.99840327 0.9979706  0.99798774 0.99762316 0.99803145]
Average R^2: 0.9980032443503974

Lasso Regression (L1) 5-fold cross-validation R^2 scores: [0.99212177 0.98990394 0.98996858 0.9882734  0.99021121]
Average R^2: 0.9900957787734809
