# Regularisation

The subset selection methods described in the previous section involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

## Ridge regression

Recall from {doc}`Linear regression <../02-linear-reg/overview>` that the least squares fitting procedure estimates $\beta_0, \beta_1, \ldots, \beta_D$ by minimizing the residual sum of squares:

$$
\textrm{RSS} = \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^D x_{ij} \beta_j \right)^2
$$

Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates $\beta^R$ are the values that minimise

$$
\sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^D x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^D \beta_j^2,
$$

where $\lambda \geq 0$ is a tuning parameter. The term $\lambda \sum_{j=1}^D \beta_j^2$ is called a _regularisation term_ or _shrinkage penalty_, because it regularises/shrinks the coefficient estimates towards zero. The tuning parameter $\lambda$ controls the amount of shrinkage: for large values of $\lambda$, the coefficients are very strongly shrunk towards zero, whereas for small values of $\lambda$, the coefficients are barely shrunk at all. In the limit as $\lambda \rightarrow 0$, ridge regression recovers the least squares estimates.

Note that the ridge regression penalty has the effect of shrinking the coefficient estimates $\beta_j$ for all $j$, but it has no effect on $\beta_0$. This is because the penalty only includes the sum of squares of the $\beta_j$, not the $\beta_0$. In other words, the penalty has no effect on the intercept. This is a desirable property, since we usually do not want to regularize the intercept.


Watch the 9-minute video below for a visual explanation of Ridge regression:

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/Q81RR3yKn30?start=57&end=619" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining Ridge regression, by StatQuest](https://www.youtube.com/embed/Q81RR3yKn30?start=57&end=619)
```

### Ridge regression on `Credit` data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


%matplotlib inline

In [None]:
credit_url = "https://github.com/pykale/transparentML/raw/main/data/Credit.csv"

credit_df = pd.read_csv(credit_url)
credit_df["Student2"] = credit_df.Student.map({"No": 0, "Yes": 1})
credit_df["Own2"] = credit_df.Own.map({"No": 0, "Yes": 1})
credit_df["Married2"] = credit_df.Married.map({"No": 0, "Yes": 1})
credit_df["South"] = credit_df.Region.map(
    {"South": 1, "North": 0, "West": 0, "East": 0}
)
credit_df["West"] = credit_df.Region.map({"West": 1, "North": 0, "South": 0, "East": 0})
credit_df["East"] = credit_df.Region.map({"East": 1, "North": 0, "South": 0, "West": 0})
# credit_df["Region2"] = credit_df.Region.astype("category")
credit_df.head(3)

In [None]:
X = credit_df.drop(["Own", "Student", "Married", "Region", "Balance"], axis=1).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y = credit_df.Balance.values

In [None]:
lambdas = np.logspace(-2, 4, 20)
coef_ridge = []

for lambda_ in lambdas:
    ridge = Ridge(alpha=lambda_)
    ridge.fit(X_scaled, y)
    coef_ridge.append(ridge.coef_)

coef_ridge = np.array(coef_ridge)

In [None]:
plt.plot(lambdas, coef_ridge[:, 0], c="black", ls="-", label="Income")
plt.plot(lambdas, coef_ridge[:, 1], c="tab:red", ls="--", label="Limit")
plt.plot(lambdas, coef_ridge[:, 2], c="tab:blue", ls=":", label="Rating")
plt.plot(lambdas, coef_ridge[:, 6], c="orange", ls=":", label="Student")

plt.legend()
plt.xscale("log")
plt.ylim(-300, 500)
plt.xlabel(r"$\lambda$")
plt.ylabel("Standardised Coefficients")
plt.show()

### Why does ridge regression improve over least squares?

In [None]:
label_scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.1, random_state=123, shuffle=True
)
# Define Algorithm

mses = []

lambdas = np.logspace(-2, 2, 20)

for lambda_ in lambdas:
    ridge = Ridge(alpha=lambda_)
    ridge.fit(X_train, y_train)
    mses.append(mean_squared_error(y_test, ridge.predict(X_test)))


fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(lambdas, mses, c="m", ls="-", label="$MSE(x)$")
ax.vlines(1, np.min(mses), np.max(mses), colors="k", ls="--", label="Optimal $\lambda$")

# add annotation
ax.annotate("Overfit (high variance, high bias)", xy=(0.01, 17000), fontsize=14)
ax.annotate(
    "",
    xy=(0.05, 14000),
    xytext=(0.05, 17000),
    arrowprops=dict(arrowstyle="->", relpos=(0, 0)),
)
ax.annotate(
    "Underfit (high bias, low variance)",
    xy=(70, 20000),
    xytext=(2, 23000),
    arrowprops=dict(arrowstyle="->"),
    fontsize=14,
)
ax.annotate("Optimal", xy=(6, 15000), fontsize=14)
ax.annotate(
    "",
    xy=(10, np.min(mses)),
    xytext=(10, 15000),
    arrowprops=dict(arrowstyle="->", relpos=(0, 0)),
)

# plt.legend()
plt.xscale("log")
plt.yscale("log")
# plt.ylim(-300, 500)
plt.xlabel(r"$\lambda$")
plt.ylabel("Mean Squared Error")
plt.savefig("bias-variance.png", dpi=300)
plt.show()

Ridge regression’s advantage over least squares is preventing overfitting, which is rooted in the bias-variance trade-oﬀ. As illustrated in the figure above, $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. 
But as $\lambda$ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias. 
Recall that the test mean squared error (MSE), plotted in purple, is closely related to the variance plus the squared bias. 
For values of $\lambda$ up to about 10, the MSE drops as $\lambda$ increases. Beyond this point, the MSE increases considerably.

To better understand this trade-off you can consider the example of the mean square error, which can be decomposed into its components. The mean square error can be written as:

\begin{align}
\begin{aligned}
\text{MSE} = & \mathbb{E}\left[\left(\mathbf{y} -\hat{f}(\mathbf{X})^2 \right) \right] \\
           = & \frac{1}{N} \sum_{i=1}^N \left(y_i - \hat{f}(x_i)\right)^2 \\
           = & \left(\mathbb{E}\left[(\hat{f}(\mathbf{X}) \right] - \hat{f}(\mathbf{X})\right)^2 
           + \mathbb{E}\left[\left(\hat{f}(\mathbf{X}) - \mathbb{E}\left[ \hat{f}(\mathbf{X})\right]\right)^2\right] + \text{Irreducible Error} \\
           = & \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
\end{aligned}
\end{align}

where $\mathbb{E}[\cdot]$ is the expectation operator and $\text{Var}(\cdot)$ is the variance operator. The first term on the right-hand side is the squared bias, the second term is the variance of the prediction, and the third term is the irreducible error. The bias-variance trade-off is a consequence of the fact that the variance and squared bias terms are inversely related. As the variance increases, the squared bias decreases, and vice versa. The irreducible error is the variance of the error term $\epsilon$ and is independent of the model. The MSE is minimized when the variance and squared bias are equal, which occurs when the model is unbiased and has the minimum variance possible given the data.


In general, in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates. In particular, when the number of variables $D$ is almost as large as the number of observations $N$, the least squares estimates will be extremely variable. And if $D > N$, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading oﬀ a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.

Ridge regression also has substantial computational advantages over best subset selection, which requires searching through $2^D$ models. As we discussed previously, even for moderate values of $D$, such a search can be computationally infeasible. In contrast, for any fixed value of $\lambda$, ridge regression only fits a single model, and the model-fitting procedure can be performed quite quickly. In fact, one can show that the computations required to a ridge regression model, simultaneously for all values of $\lambda$, are almost identical to those for fitting a model using least squares.

## Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The penalty $\lambda$ β j 2 in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless $\lambda$ = ∞). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables $D$ is quite large.

$$
\sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^D x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^D |\beta_j| = \textrm{RSS} + \lambda \sum_{j=1}^D |\beta_j|.
$$

Watch the 8-minute video below for a visual explanation of Lasso:

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/NGf0voTMlcs?start=15" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining Lasso, by StatQuest](https://www.youtube.com/embed/NGf0voTMlcs?start=15)
```



In [None]:
lambdas = np.logspace(1, 3, 20)

coef_lasso = []
for lambda_ in lambdas:
    lasso = Lasso(alpha=lambda_)
    lasso.fit(X_scaled, y)
    coef_lasso.append(lasso.coef_)

coef_lasso = np.array(coef_lasso)

In [None]:
plt.plot(lambdas, coef_lasso[:, 0], c="black", ls="-", label="Income")
plt.plot(lambdas, coef_lasso[:, 1], c="tab:red", ls="--", label="Limit")
plt.plot(lambdas, coef_lasso[:, 2], c="tab:blue", ls=":", label="Rating")
plt.plot(lambdas, coef_lasso[:, 6], c="orange", ls=":", label="Student")

plt.legend()
plt.xscale("log")
plt.ylim(-300, 500)
plt.xlabel(r"$\lambda$")
plt.ylabel("Standardised Coefficients")
plt.show()

### Another formulation of ridge regression and lasso

One can show that the lasso and ridge regression coefficient estimates solve the problems

$$
\min_{\boldsymbol{\beta} \in \mathbb{R}^D} \left\{ \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^D x_{ij} \beta_j \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^D |\beta_j| \leq t,
$$

and

$$
\min_{\boldsymbol{\beta} \in \mathbb{R}^D} \left\{ \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{j=1}^D x_{ij} \beta_j \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^D \beta_j^2 \leq t,
$$
respectively. In other words, for every value of $\lambda$, there is some $t$ such that will give the same ridge regression or lasso coefficient estimates.


### The Variable Selection Property of the Lasso

<!-- <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Regularization.jpg/800px-Regularization.jpg?20190518214104" width="700px;" alt="Contours of the error and constraint functions for the lasso
(left) and ridge regression (right). source: https://commons.wikimedia.org/wiki/File:Regularization.jpg"/> -->
```{figure} https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Regularization.jpg/800px-Regularization.jpg?20190518214104
---
height: 300px
name: l2-l1
---
Contours of the error and constraint functions for the lasso
(left) and ridge regression (right). source: https://commons.wikimedia.org/wiki/File:Regularization.jpg
```


Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? The figure above illustrates the situation. The least squares solution is marked as $\hat{\beta}$ while the blue diamond and circle represent the lasso and ridge regression constraints, respectively. If s is sufficiently large, then the constraint regions will contain β̂, and so the ridge regression and lasso estimates will be the same as the least squares estimates. (Such a large value of $t$ corresponds to $\lambda$ = 0 in
(6.5) and (6.7).) However, in Figure 6.7 the least squares estimates lie outside of the diamond and the circle, and so the least squares estimates are not the same as the lasso and ridge regression estimates.

Each of the ellipses centred around $\hat{\beta}$ represents a contour: this means that all of the points on a particular ellipse have the same RSS value. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In Figure 6.7, the intersection occurs at $\beta_1 = 0$, and so the resulting model will only include $\beta_2$ .

## Comparing the Lasso and Ridge Regression

It is clear that the lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. However, which method leads to better prediction accuracy? Figure 6.8 displays the variance, squared bias, and test MSE of the lasso applied to the same simulated data as in Figure 6.5. Clearly the lasso leads to qualitatively similar behaviour to ridge regression, in that as $\lambda$ increases, the variance decreases and the bias increases. Consequently, the minimum MSE of ridge regression is slightly smaller than that of the lasso.

However, the data in Figure 6.8 were generated in such a way that all 45 predictors were related to the response—that is, none of the true coefficients β 1 , . . . , β 45 equalled zero. The lasso implicitly assumes that a number of the coefficients truly equal zero. Consequently, it is not surprising that ridge
regression outperforms the lasso in terms of prediction error in this setting. Figure 6.9 illustrates a similar situation, except that now the response is a function of only 2 out of 45 predictors. Now the lasso tends to outperform ridge regression in terms of bias, variance, and MSE. 

These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

As with ridge regression, when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret. 

There are very efficient algorithms for fitting both ridge and lasso models; in both cases the entire coefficient paths can be computed with about the same amount of work as a single least squares fit. We will explore this further in the lab at the end of this chapter.

## Exercises

min 3 max 5

