# Regularization

## Cost Function
Let's assume we have two functions: $\Theta_0 +\Theta_1x + \Theta_2x^2$ (that fits well the data) and $\Theta_0 +\Theta_1x + \Theta_2x^2 + \Theta_3x^3 + \Theta_4x^4$ (that overfits the data). How can we penalize the second function to avoid overfitting?

Remembering that the cost function is: $J = \frac{1}{2m}\sum_{i=1}^m(h_{\Theta}(x^{(i)}) - y^{(i)})^2$

Ex: We add terms $1000*\Theta_3^3$ and $1000*\Theta_4^4$ to the cost function in order to make $\Theta_3$ and $\Theta_4$ small.

## Regularized Cost Function
$J = \frac{1}{2m} [\sum_{i=1}^m(h_{\Theta}(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^n\Theta_j^2]$.

The regularized term is added in order to keep the parameters small. 

### Regularization parameter
The regularization parameter $\lambda$ controls the trade off between the goal of fitting the data and keeping the parameters small (keeping the hypothesis simple).

It is important to note that if $\lambda$ is too large (say $10^10$), the algorithm might result in underfitting.

## Regularized Linear Regression

$\Theta_j = \Theta_j - \alpha[\frac{1}{m}\sum^m_{i=1}(h_{\Theta}(x^i)-y_i)*x^i_j + \frac{\lambda}{m}\Theta_j]$

OBS: We keep the same function for $\Theta_0$ (we don't want to penalize it)

The above can also be written as:
$\Theta_j = \Theta_j(1 - \alpha\frac{\lambda}{m})- \alpha\frac{1}{m}\sum^m_{i=1}(h_{\Theta}(x^i)-y_i)*x^i_j$

Note that $(1 - \alpha\frac{\lambda}{m})$ will always be less than one.

<hr style="border:2px solid gray"> </hr>

# Ridge Regression
Add the term $\lambda\sum_{j=1}^n\Theta_j^2$ to the cost function, forcing the algorithm to **keep the model weights as small as possible**.

This term should only be added to the cost function during **training**.

## L2 Regularization
Src: https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

L1 and L2 regularization are named after the L1 and L2 norm of a vector $w$.

L2 norm: $\left\| w \right\|_2 = (|w_1|^2 + |w_2|^2 + ... + |w_n|^2)^{\frac{1}{2}}$

Therefore, Ridge Regression uses a loss function with **squared L2 norm of the weights** (notice the absence of the square root).

In [2]:
import numpy as np
from sklearn.linear_model import Ridge
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

array([[4.57274367]])

Stochastic Gradient Descent

The penalty hyperparameter sets the type of regularization term to use. Specifying "l2" indicates that you want SGD to add a regularization term to the cost function equal to half the square of the $l2$ norm of the weight vector: this is simply Ridge
Regression.

In [5]:
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([4.55364118])

# Lasso Regression

Add the term $\lambda\sum_{j=1}^n|\Theta_j|$ to the cost function.

## L1 Regularization
Lasso Regression uses the $l1$ norm of the weight vector instead of half the square of the $l2$ norm.

L1 norm: $\left\| w \right\|_2 = (|w_1| + |w_2| + ... + |w_n|)$

## Difference from Ridge
Lasso shrinks the less important feature’s coefficient to zero and so it may **remove some features** altogether. This works well for **feature selection** in case we have a huge number of features.

https://www.youtube.com/watch?v=Xm2C_gTAl8c

In [6]:
from sklearn.linear_model import Lasso
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

array([5.09245718])

In [7]:
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l1")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([5.10832161])

<hr style="border:2px solid gray"> </hr>

# Early Stopping
Stop training as soon as the validation error reaches a minimum.

In [None]:
from sklearn.base import clone
sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None,
 learning_rate="constant", eta0=0.0005)
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val_predict, y_val)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)
