# Hyper parameter tuning and regularization

## Ridge regression
Ridge Regression is a machine learning technique that enhances linear regression by adding an L2 penalty to the loss function, shrinking coefficients to prevent overfitting, especially with many correlated features (multicollinearity). It introduces bias to reduce variance, creating more stable, generalized models that perform better on new data by penalizing large coefficient values, unlike standard regression that can become unstable. 

![ridge](https://www.appliedaicourse.com/blog/wp-content/uploads/2024/10/ridge-regression-in-machine-learning.webp)

### The problem of overfitting
overfitting is a condition where a model performs too well on the test data set, this is actually not a desirable situation as this can the model to be highly baised towards the test data set and it is high chance that the model won't perform well on the **train data set** or **new points**

The **Ridge** penalises the coefficients to prevent overfitting.

#### **How it works**
it adds a **λ** and **slope²** in the cost function (here MSE is taken as example), **λ** is a *hyperparameter*, i.e., adjusted by the programmer

$$ C(θ₀, θ₁) = \sum^{n}_{i=1} \frac{(y-ŷ)²}{n} + λ \sum^{p}_{j=1} (slopeⱼ)² $$
$$ n = number\ of\ data\ points $$
$$ p = number\ of\ features $$

note: the *λ* and *slope* are inversely proportional, as we increase the λ, the slope value decreases (but never gets 0)

![gradient](https://miro.medium.com/v2/resize:fit:1400/1*m1vR4evYBV7NHZUJLvpH5A.png)

## Lasso regression
Lasso regression is a method that performs both variable selection and regularization by adding a penalty term to the standard linear regression (Ordinary Least Squares) objective function. It is specifically known for its use of the L1 norm as a penalty, which enables the model to produce "sparse" solutions—meaning some coefficients are reduced to exactly zero.


### The need of feature selection
In the context of Lasso regression, feature selection is the process of identifying a small subset of the most relevant variables while discarding the rest by shrinking their coefficients to exactly zero.

#### **How it works**
it adds a **λ** and **|slope|** in the cost function (here MSE is taken as example), **λ** is a *hyperparameter*, i.e., adjusted by the programmer

$$ C(θ₀, θ₁) = \sum^{n}_{i=1} \frac{(y-ŷ)²}{n} + λ \sum^{p}_{j=1} |slopeⱼ| $$
$$ n = number\ of\ data\ points $$
$$ p = number\ of\ features $$

Here also, the *λ* and *slope* are inversely related, and here the value can become 0

## ElasticNet
Elastic Net is a hybrid regularization technique that combines both Ridge (L2) and Lasso (L1) penalties into a single cost function. It was created to overcome the individual limitations of each method.

$$ C(θ₀, θ₁) = \sum^{n}_{i=1} \frac{(y-ŷ)²}{n} + λ₁ \sum^{p}_{j=1} (slopeⱼ)² + λ₂ \sum^{p}_{j=1} |slopeⱼ| $$
$$ n = number\ of\ data\ points $$
$$ p = number\ of\ features $$

It is used to levarage both the things, it solves the problem of ***overfitting*** and ***feature selection***

In [1]:
import pandas as pd

df = pd.read_csv('../../data/height-weight.csv')
df.head()

Unnamed: 0,Weight,Height
0,45,120
1,58,135
2,48,123
3,60,145
4,70,160


In [2]:
X = df[['Weight']]
y = df['Height']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train, X_test = scaler.fit_transform(X_train), scaler.transform(X_test)

In [1]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

In [2]:
# from sklearn.model_selection import RandomizedSearchCV