# Regularized linear models

A good way to reduce overfitting is to regularize the model (i.e., to constrain it).

The fewer degrees of freedom it has, the harder it will be for it to overfit the data.

For example, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees.

For a linear model, regularization is typically achieved by constraining the weights of the model.

**Ridge Regression, Lasso Regression, and Elastic Net** implements three different ways to constrain the weights.

**It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.**


# Ridge Regression (Tikhonov regularization)

- It is a regularized version of Linear Regression,
- a regularization term equal to  $\alpha\sum_{i=1}^{n}\theta_i^2$  is added to the cost function.
- This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible.

**Note:** the regularization term should only be added to the cost function during training. Once the model is trained, we want to evaluate the model’s performance using the unregularized performance measure.

**The hyperparameter α controls how much you want to regularize the model.**

If α = 0 then Ridge Regression is just Linear Regression. 

If α is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean.


$$J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \sum_{i=1}^{n}\theta_i^2$$

**bias term θ0 is not regularized (the sum starts at i = 1, not 0)**

If we define w as the vector of feature weights (θ1 to θn), then the regularization term is simply equal to $\frac{1}{2}(|| w ||_2)^2$, where $|| w ||_2$ represents the l2 norm of the weight vector.

**For Gradient Descent, just add αw to the MSE gradient vector.**

## Scikit learn implementation 

**Using closed-form solution**

In [None]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X,y)
ridge_reg.predict([[1.5]])

**Using Stochastic Gradient Descent**

In [None]:
sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel()) # The .ravel() function is used to transform a multi-dimensional array into a one-dimensional array.
sgd_reg.predict([[1.5]])

# Lasso Regression

Least Absolute Shrinkage and Selection Operator

- Similar to to Ridge regression.
- It adds the l1 norm of the weight vector as a regularization term to the cost function.

$$J(\theta) = MSE(\theta) + \alpha \sum_{i=1}^{n}|\theta_i|$$

An important characteristic of Lasso Regression is that **it tends to completely eliminate the weights of the least important features** (i.e., set them to zero).

**Using Scikit learn Lasso class.**

In [None]:
from sklearn.linear_model import Lasso >>> lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

**Using SGDClassifier**

In [None]:
sgd_reg = SGDRegressor(penalty="l1")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

# Elastic net

It is a middle ground between Ridge Regression and Lasso Regression. 

The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, we can control the mix ratio r. 

- When r = 0, Elastic Net is equivalent to Ridge Regression, 
- When r = 1, it is equivalent to Lasso Regression

**Elastic Net cost function**

$$J(\theta) = MSE(\theta) + r\alpha \sum_{i=1}^{n}|\theta_i| + \frac{1-r}{2}\alpha \sum_{i=1}^{n}\theta_i^2$$


In [None]:
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio corresponds to the mix ratio r
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

# Conclution

It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. 

Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. 

In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.