### Linear Model Selection and Regularization

In the regression setting, the standard linear model
Y = 0+ 1X1+···+ pXp+
(6.1)
is commonly used to describe the relationship between a response Y and
a set of variables X1,X2,...,Xp. We have seen in Chapter 3 that one
typically fits this model using least squares.
In the chapters that follow, we consider some approaches for extending
the linear model framework. In Chapter 7 we generalize (6.1) in order to
accommodate non-linear, but still additive, relationships, while in Chap
ters 8 and 10 we consider even more general non-linear models. However,
the linear model has distinct advantages in terms of inference and, on real
world problems, is often surprisingly competitive in relation to non-linear
methods. Hence, before moving to the non-linear world, we discuss in this
chapter some ways in which the simple linear model can be improved, by re
placing plain least squares fitting with some alternative fitting procedures.
Why might we want to use another fitting procedure instead of least
squares? As we will see, alternative fitting procedures can yield better pre
diction accuracy and model interpretability.

• Prediction Accuracy: Provided that the true relationship between the
response and the predictors is approximately linear, the least squares
estimates will have low bias. If n 
p—that is, if n, the number of
observations, is much larger than p, the number of variables—then the
least squares estimates tend to also have low variance, and hence will
perform well on test observations. However, if n is not much larger
than p, then there can be a lot of variability in the least squares fit,
resulting in overfitting and consequently poor predictions on future
observations not used in model training. And if p>n, then there is no
longer a unique least squares coefficient estimate: there are infinitely many solutions. Each of these least squares solutions gives zero error
on the training data, but typically very poor test set performance
due to extremely high variance.1 By constraining or shrinking the
estimated coefficients, we can often substantially reduce the variance
at the cost of a negligible increase in bias. This can lead to substantial
improvements in the accuracy with which we can predict the response
for observations not used in model training.

• Model Interpretability: It is often the case that some or many of the
variables used in a multiple regression model are in fact not associ
ated with the response. Including such irrelevant variables leads to
unnecessary complexity in the resulting model. By removing these
variables—that is, by setting the corresponding coefficient estimates
to zero—we can obtain a model that is more easily interpreted. Now
least squares is extremely unlikely to yield any coefficient estimates
that are exactly zero. In this chapter, we see some approaches for au
tomatically performing feature selection or variable selection—that is, feature
for excluding irrelevant variables from a multiple regression model.

There are many alternatives, both classical and modern, to using least
squares to fit (6.1). In this chapter, we discuss three important classes of
methods.

• Subset Selection. This approach involves identifying a subset of the p
predictors that we believe to be related to the response. We then fit
a model using least squares on the reduced set of variables.

• Shrinkage. This approach involves fitting a model involving all p pre
dictors. However, the estimated coefficients are shrunken towards zero
relative to the least squares estimates. This shrinkage (also known as
regularization) has the effect of reducing variance. Depending on what
type of shrinkage is performed, some of the coefficients may be esti
mated to be exactly zero. Hence, shrinkage methods can also perform
variable selection.

• Dimension Reduction. This approach involves projecting the p predic
tors into an M-dimensional subspace, where M<p.This is achieved
by computing M different linear combinations, or projections, of the
variables. Then these M projections are used as predictors to fit a
linear regression model by least squares.

#### Subset Selection

##### Best Subset Selection

To perform *best subset selection*, we fit a separate least squares regression for each of the models of the p predictors. That is, we fit all p models that contain exactly one predictor, all $ \binom{p}{2} $ models that contain exactly two predictors, and so on. We look at all of the resulting models, with the goal of identifying the one that is the best.

The problem of selecting the best model from among the $ 2^p $ possibilities considered by best subset selection is then usually broken up into two stages, as described in Algorithm 6.1.

##### **Algorithm 6.1** *Best subset selection*

1. Let $ M_0 $ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

2. For $ k = 1, 2, \ldots, p $:

   (a) Fit all $ \binom{p}{k} $ models that contain exactly k predictors.
   
   (b) Pick the best among these models, and call it $ M_k $. Here “best” is defined as having the smallest RSS, or equivalently $ R^2 $.

3. Select a single best model from among $ M_0, \ldots, M_p $ using the prediction error on a validation set, $ C_p $ (AIC), BIC, or adjusted $ R^2 $.

In Algorithm 6.1, Step 2 identifies the best model (on the training data)
for each subset size, in order to reduce the problem from one of 2p possible
models to one of p +1possible models. In Figure 6.1, these models form
the lower frontier depicted in red.
Now in order to select a single best model, we must simply choose among
these p +1options. This task must be performed with care, because the
RSS of these p +1models decreases monotonically, and the R2 increases
monotonically, as the number of features included in the models increases.
Therefore, if we use these statistics to select the best model, then we will
always end up with a model involving all of the variables. The problem is
that a low RSS or a high R2 indicates a model with a low training error,
whereas we wish to choose a model that has a low test error. (As shown in
Chapter 2 in Figures 2.9–2.11, training error tends to be quite a bit smaller
than test error, and a low training error by no means guarantees a low test
error.) Therefore, in Step 3, we use the error on a validation set, Cp, BIC, or
adjusted R2 in order to select among M0,M1,...,Mp. If cross-validation
is used to select the best model, then Step 2 is repeated on each training
fold, and the validation errors are averaged to select the best value of k.

Then the model Mk fit on the full training set is delivered for the chosen
k. These approaches are discussed in Section 6.1.3.
An application of best subset selection is shown in Figure 6.1. Each
plotted point corresponds to a least squares regression model fit using a
different subset of the 10 predictors in the Credit data set, discussed in
Chapter 3. Here the variable region is a three-level qualitative variable,
and so is represented by two dummy variables, which are selected sepa
rately in this case. Hence, there are a total of 11 possible variables which
can be included in the model. We have plotted the RSS and R2 statistics
for each model, as a function of the number of variables. The red curves
connect the best models for each model size, according to RSS or R2. The
f
igure shows that, as expected, these quantities improve as the number of
variables increases; however, from the three-variable model on, there is little
improvement in RSS and R2 as a result of including additional predictors.
Although we have presented best subset selection here for least squares
regression, the same ideas apply to other types of models, such as logistic
regression. In the case of logistic regression, instead of ordering models by
RSS in Step 2 of Algorithm 6.1, we instead use the deviance, a measure deviance
that plays the role of RSS for a broader class of models. The deviance is
negative two times the maximized log-likelihood; the smaller the deviance,
the better the fit.
While best subset selection is a simple and conceptually appealing ap
proach, it suffers from computational limitations. The number of possible
models that must be considered grows rapidly as p increases. In general,
there are 2p models that involve subsets of p predictors. So if p = 10,
then there are approximately 1,000 possible models to be considered, and if
p =20,then there are over one million possibilities! Consequently, best sub
set selection becomes computationally infeasible for values of p greater than around 40, even with extremely fast modern computers. There are compu
tational shortcuts—so called branch-and-bound techniques—for eliminat
ing some choices, but these have their limitations as p gets large. They also
only work for least squares linear regression. We present computationally
efficient alternatives to best subset selection next


##### **Algorithm 6.2** *Forward stepwise selection*

1. Let $ M_0 $ denote the null model, which contains no predictors.

2. For $ k = 0, \ldots, p - 1 $:

   - (a) Consider all $ p - k $ models that augment the predictors in $ M_k $ with one additional predictor.
   
   - (b) Choose the best among these $ p - k $ models, and call it $ M_{k+1} $. Here best is defined as having the smallest RSS or highest $ R^2 $.

3. Select a single best model from among $ M_0, \ldots, M_p $ using the prediction error on a validation set, $ C_p $ (AIC), BIC, or adjusted $ R^2 $. Or use the cross-validation method.

##### Stepwise Selection

For computational reasons, best subset selection cannot be applied with
very large p. Best subset selection may also suffer from statistical problems
when p is large. The larger the search space, the higher the chance of finding
models that look good on the training data, even though they might not
have any predictive power on future data. Thus an enormous search space
can lead to overfitting and high variance of the coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.

Forward Stepwise Selection

*Forward stepwise selection* is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all
2p possible models containing subsets of the p predictors, forward step
wise considers a much smaller set of models. Forward stepwise selection
begins with a model containing no predictors, and then adds predictors
to the model, one-at-a-time, until all of the predictors are in the model.
In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model. More formally, the forward
stepwise selection procedure is given in Algorithm 6.2.


## 6. Linear Model Selection and Regularization

Unlike best subset selection, which involved fitting $ 2^p $ models, forward stepwise selection involves fitting one null model, along with $ p - k $ models in the $ k $-th iteration, for $ k = 0, \ldots, p - 1 $. This amounts to a total of $ 1 + \sum_{i=0}^{p-1} (p - i) = 1 + p(p + 1)/2 $ models. This is a substantial difference: when $p$ = 20, best subset selection requires fitting $ 1,048,576 $ models, whereas forward stepwise selection requires fitting only $ 21 $ models. 

In step 2(b) of Algorithm 6.2, we identify the best model from those available, which augment $ M_k $ with one additional predictor. We do this by simply choosing the model with the lowest RSS or the highest $ R^2 $. In this case, we must identify the best model from a set of models with different numbers of variables. This is more challenging, and is discussed in Section 6.1.3.

Forward stepwise selection's computational advantage over best subset selection is clear. Though forward stepwise seems to work well in practice, it is important to keep in mind that it is fundamentally a greedy algorithm. In a data set with $ p $ predictors, the best possible model will include predictors $ X_2 $ and $ X_1 $. However, forward stepwise selection is still unable to find the best possible model among $ M_1, M_2, $ and those available with $ X_1 $ together with an additional variable.

As shown in Section 6.1.3, the forward stepwise selection on the Credit data set illustrates this phenomenon. Both forward stepwise selection and best subset selection favored models that included the predictors rating and income, whereas the best subset selection also included the variable student.

In high-dimensional settings where $ p $ is greater than $ n $, forward stepwise selection can still be applied even when subset selection cannot. If $ p $ is greater than $ n $, each time an additional variable is included, only the subset of $ M_0, \ldots, M_k $ can be constructed, which avoids overfitting, as each submodel is fit using least squares, which does not yield a unique solution if $ p > n $.



| # Variables | Best subset                      | Forward stepwise                  |
|-------------|----------------------------------|-----------------------------------|
| One         | `rating`                           | `rating`                            |
| Two         | `rating`, `income`                   | `rating`, `income`                    |
| Three       | `rating`, `income`, `student`              | `rating`, `income`, `student`               |
| Four        | `cards`, `income`, `student`, `limit` | `rating`, `income`, `student`, `limit` |

The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ.

Backward Stepwise Selection

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward step
wise selection, it begins with the full least squares model containing all p
predictors, and then iteratively removes the least useful predictor, one-at
a-time. Details are given in Algorithm 6.3.


Like forward stepwise selection, the backward selection approach searches
through only 1+p(p+1)/2 models, and so can be applied in settings where
p is too large to apply best subset selection.3 Also like forward stepwise
selection, backward stepwise selection is not guaranteed to yield the best
model containing a subset of the p predictors.
Backward selection requires that the number of samples n is larger than
the number of variables p (so that the full model can be fit). In contrast,
forward stepwise can be used even when n<p, and so is the only viable
subset method when p is very large.

Hybrid Approaches

The best subset, forward stepwise, and backward stepwise selection ap
proaches generally give similar but not identical models. As another al
ternative, hybrid versions of forward and backward stepwise selection are
available, in which variables are added to the model sequentially, in analogy
to forward selection. However, after adding each new variable, the method
may also remove any variables that no longer provide an improvement in
the model fit. Such an approach attempts to more closely mimic best sub
set selection while retaining the computational advantages of forward and
backward stepwise selection.

##### Choosing the Optimal Model

Best subset selection, forward selection, and backward selection result in
the creation of a set of models, each of which contains a subset of the p predictors. To apply these methods, we need a way to determine which of
these models is best. As we discussed in Section 6.1.1, the model containing
all of the predictors will always have the smallest RSS and the largest R2,
since these quantities are related to the training error. Instead, we wish to
choose a model with a low test error. As is evident here, and as we show
in Chapter 2, the training error can be a poor estimate of the test error.
Therefore, RSS and R2 are not suitable for selecting the best model among
a collection of models with different numbers of predictors.
In order to select the best model with respect to test error, we need to
estimate this test error. There are two common approaches:

1. We can indirectly estimate test error by making an adjustment to the
training error to account for the bias due to overfitting.


2. We can directly estimate the test error, using either a validation set
approach or a cross-validation approach, as discussed in Chapter 5.

We consider both of these approaches below.

We show in Chapter 2 that the training set $ R^2 $ is generally an under-estimate of the test MSE. (Recall that MSE = RSS/n.) In training RSS, but when we fit a model to the training data using least squares, we (not the test RSS) is as small as possible, this can sometimes misestimate the regression coefficients in a model. Thus, we need to choose from among a set of models with different variables.

However, a number of techniques for adjusting the training error can lead to a set of models with different variables. We introduce the following metrics for model selection criteria: 

- **Akaike information criterion (AIC)**, 
- **Bayesian information criterion (BIC)**, and 
- **Adjusted $ R^2 $**. 

Figure 6.2 gives a comparison of the best model selection criteria and selects the best subset on the Credit data set.

For a fitted least squares model containing $ p $ predictors, the $ C_p $ estimate of test MSE is computed using the equation:

$$
C_p = \frac{(RSS + 2d\hat{\sigma}^2)}{n}
$$

where $ \hat{\sigma}^2 $ is an estimate of the variance of the error $ \epsilon $ associated with the regression model in (6.1). Typically $ d $ is chosen to be $ p $ or the number of parameters in the model. The $ C_p $ statistic adds a penalty term involving $ d $ in order to adjust for the number of predictors  in the model increases; this is intended to adjust for the corresponding decrease in training RSS. Based on the scope of this book, one can show that if $ \hat{\sigma}^2 $ in (6.2), then $ C_p $ is an unbiased estimate of test MSE. As a consequence, the $ C_p $ statistic tends to take a larger value in the model selected with the lowest $ C_p $ value. In Figure 6.2, C_p selects the six-variable model containing the predictors income, lint, age, and risk.

The AIC criterion is defined as follows:
$$
AIC = \frac{1}{n} (RSS + 2d\hat{\sigma}^2)
$$
where, for simplicity, we have omitted irrelevant constants. Here for least squares models, AIC and $ C_p $ are proportional to each other, as shown in Figure 6.2.

BIC is derived from a Bayesian point of view; it ends up looking similar to $ C_p $ and $ AIC $ for the least squares model but with a different penalty term:
$$
BIC = \frac{1}{n} (RSS + \log(n)d\hat{\sigma}^2)
$$
Like $ C_p $, the BIC also penalizes a more complex model for a loss in precision, and generally we select the model with the lowest BIC. Notice that the BIC penalty includes the $ 2d\ \log(n) $ term, where $ n $ is the number of observations. Since $ \log n > 2 $ for any $ n > 7 $, the BIC statistic generally places a heavier penalty on models with many variables, and hence reveals in the selection of smaller models than $ C_p $. In Figure 6.2, we see that the direct task decider is a single model that contains only the four predictors income, lint, age, and assertion. In this case the chosen model does not appear to make much difference in accuracy between the four-variable and six-variable models.

The adjusted $ R^2 $ statistic is another popular approach for selecting among a set of models that contain different numbers of variables. Recall from Chapter 3 that the adjusted $ R^2 $ is defined as follows, where $ RSS = \sum(y_i - \hat{y})^2 $ is the total sum of squared response. Since $ RSS $ increases as more variables are added to the model, the $ R^2 $ measure becomes biased toward a larger selected model of variables, the adjusted $ R^2 $ is calculated as

$$
\text{Adjusted } R^2 = 1 - \frac{RSS/(n - d - 1)}{TSS/(n - 1)}
$$

Unlike $ C_p $ and BIC, for which a small value indicates a model with a low extent error, a large value of adjusted $ R^2 $ indicates a model with a small test error. Additionally, the adjusted $ R^2 $ is equivalent to how much variability is explained by the model. Consequently, the adjusted $ R^2 $ may increase or decrease depending on the fits of new variables added, and one must be cautious of the laws of diminishing returns when adding additional predictors. 

Nonetheless, the adjusted $ R^2 $ provides a more robust measure than $ C_p $ and BIC in model selection, owing to the fact that the adjusted $ R^2 $ accounts for degrees of freedom in models where fewer predictor variables yield higher estimates; hence, $ AIC $ and BIC can be less robust than the adjusted $ R^2 $.

Validation and Cross-Validation


As an alternative to the approaches just discussed, we can directly esti
mate the test error using the validation set and cross-validation methods
discussed in Chapter 5. We can compute the validation set error or the
cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest. This pro
cedure has an advantage relative to AIC, BIC, Cp, and adjusted R2, in that
it provides a direct estimate of the test error, and makes fewer assumptions
about the true underlying model. It can also be used in a wider range of
model selection tasks, even in cases where it is hard to pinpoint the model
degrees of freedom (e.g. the number of predictors in the model) or hard
to estimate the error variance 2. Note that when cross-validation is used,
the sequence of models Mk in Algorithms 6.1–6.3 is determined separately
for each training fold, and the validation errors are averaged over all folds
for each model size k. This means, for example with best-subset regression,
that Mk, the best subset of size k, can differ across the folds. Once the
best size k is chosen, we find the best model of that size on the full data
set.
In the past, performing cross-validation was computationally prohibitive
for many problems with large p and/or large n, and so AIC, BIC, Cp,
and adjusted R2 were more attractive approaches for choosing among a
set of models. However, nowadays with fast computers, the computations
required to perform cross-validation are hardly ever an issue. Thus, cross
validation is a very attractive approach for selecting from among a number
of models under consideration.
Figure 6.3 displays, as a function of d, the BIC, validation set errors, and
cross-validation errors on the Credit data, for the best d-variable model.
The validation errors were calculated by randomly selecting three-quarters
of the observations as the training set, and the remainder as the valida
tion set. The cross-validation errors were computed using k = 10 folds.
In this case, the validation and cross-validation methods both result in a
six-variable model. However, all three approaches suggest that the four-,
f
ive-, and six-variable models are roughly equivalent in terms of their test
errors.
In fact, the estimated test error curves displayed in the center and right
hand panels of Figure 6.3 are quite flat. While a three-variable model clearly
has lower estimated test error than a two-variable model, the estimated test
errors of the 3- to 11-variable models are quite similar. Furthermore, if we repeated the validation set approach using a different split of the data into
a training set and a validation set, or if we repeated cross-validation using
a different set of cross-validation folds, thentheprecisemodelwiththe
lowest estimatedtest errorwouldsurelychange. Inthis setting,we can
select amodel using theone-standard-error rule.Wefirst calculate the 
standarderrorof theestimatedtestMSEfor eachmodel size, andthen
selectthesmallestmodel forwhichtheestimatedtesterror iswithinone
standarderrorofthelowestpointonthecurve.Therationalehereisthat
ifasetofmodelsappeartobemoreor lessequallygood, thenwemight
aswell choose the simplestmodel—that is, themodelwiththe smallest
number of predictors. Inthis case, applying theone-standard-error rule
tothevalidationsetorcross-validationapproachleadstoselectionof the
three-variablemodel.


#### Shrinkage Methods

The subset selection methods described in Section 6.1 involve using least squares to fit a linear model that contains as many predictors. As an alternative, we can find a model containing all predictors and then calibrate or regularize the coefficient estimates, or equivalently, shrink the estimates toward zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficients can significantly reduce their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are ridge regression and the lasso.

##### Ridge Regression

Recall from Chapter 3 that the least squares fitting procedure for $ \beta_0, \beta_1, \ldots, \beta_p $ using the values that minimize

$$
RSS = \sum (y_i - \beta_0 - \beta_1 x_{1i} - \ldots - \beta_p x_{pi})^2
$$

Ridge regression is simply a variation of this, where the estimator is determined by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates $ \hat{\beta}_P $ are those that minimize

$$
RSS + \lambda \sum_{j=1}^p \beta_j^2
$$

where $ \lambda > 0 $ is a tuning parameter, to be determined separately. Expressing $ \lambda $ trades off too much variance in estimates; ridge regression seeks coefficient estimates that fit the data well, by making them smaller. The second term, $ \lambda \sum_{j=1}^p \beta_j^2 $, penalizes large values of the coefficients, thus shrinking the estimates of $ \beta_j $ towards zero. The tuning parameter $ \lambda $ serves to control the relative impact of these two terms on the regression coefficient esti
mates. When =0, the penalty term has no effect, and ridge regression
will produce the least squares estimates. However, as 
, the impact of
the shrinkage penalty grows, and the ridge regression coefficient estimates
will approach zero. Unlike least squares, which generates only one set of co
efficient estimates, ridge regression will produce a different set of coefficient
estimates, ˆR, for each value of . Selecting a good value for is critical;
we defer this discussion to Section 6.2.3, where we use cross-validation.
Note that in (6.5), the shrinkage penalty is applied to 1,..., p, but
not to the intercept 0. We want to shrink the estimated association of
each variable with the response; however, we do not want to shrink the
intercept, which is simply a measure of the mean value of the response
when xi1 = xi2 = ...= xip =0. If we assume that the variables—that is,
the columns of the data matrix X—have been centered to have mean zero
before ridge regression is performed, then the estimated intercept will take
the form ˆ0 =¯y= n
i=1yi/n.




An Application to the Credit Data

In Figure 6.4, the ridge regression coefficient estimates for the Credit data
set are displayed. In the left-hand panel, each curve corresponds to the
ridge regression coefficient estimate for one of the ten variables, plotted
as a function of . For example, the black solid line represents the ridge
regression estimate for the income coefficient, as is varied. At the extreme
left-hand side of the plot, is essentially zero, and so the corresponding
ridge coefficient estimates are the same as the usual least squares esti
mates. But as increases, the ridge coefficient estimates shrink towards
zero. When is extremely large, then all of the ridge coefficient estimates
are basically zero; this corresponds to the null model that contains no predictors. In this plot, the income, limit, rating, and student variables are
displayed in distinct colors, since these variables tend to have by far the
largest coefficient estimates. While the ridge coefficient estimates tend to
decrease in aggregate as increases, individual coefficients, such as rating
and income, may occasionally increase as increases.

The right-hand panel of Figure 6.4 displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying on the x-axis, we now display $\hat{R}^2/\hat{2}$, where $\hat{}$ denotes the vector of least squares coefficient estimates. The notation $\|\cdot\|_2$ denotes the 2 norm (pronounced “ell 2”) of a vector, and is defined as $\|\beta\|_2 = \sqrt{\sum_{j=1}^{p} \beta_j^2}$. It measures the distance of $\beta$ from zero. As $\lambda$ increases, the 2 norm of $\hat{R}$ will always decrease, and so will $\hat{R}^2/\hat{2}$. The latter quantity ranges from 1 (when $\lambda=0$, in which case the ridge regression coefficient estimate is the same as the least squares estimate, and so their 2 norms are the same) to 0 (when $\lambda = \infty$, in which case the ridge regression coefficient estimate is a vector of zeros, with 2 norm equal to zero). Therefore, we can think of the x-axis in the right-hand panel of Figure 6.4 as the amount that the ridge regression coefficient estimates have been shrunken towards zero; a small value indicates that they have been shrunken very close to zero.

The standard least squares coefficient estimates discussed in Chapter 3 are scale equivariant: multiplying $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$. In other words, regardless of how the jth predictor is scaled, $X_j \hat{\beta}_j$ will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. For instance, consider the income variable, which is measured in dollars. One could reasonably have measured income in thousands of dollars, which would result in a reduction in the observed values of income by a factor of 1,000. Now due to the sum of squared coefficients term in the ridge regression formulation (6.5), such a change in scale will not simply cause the ridge regression coefficient estimate for income to change by a factor of 1,000. In other words, $X_j \hat{R}_j$ will depend not only on the value of $\lambda$, but also on the scaling of the jth predictor. In fact, the value of $X_j \hat{R}_j$ may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

$$
\tilde{x}_{ij} = \frac{x_{ij}}{1 / n \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}
$$

so that they are all on the same scale. In (6.6), the denominator is the estimated standard deviation of the jth predictor. Consequently, all of the standardized predictors will have a standard deviation of one. As a result, the final fit will not depend on the scale on which the predictors are measured. In Figure 6.4, the y-axis displays the standardized ridge regression coefficient estimates—that is, the coefficient estimates that result from performing ridge regression using standardized predictors.

Why Does Ridge Regression Improve Over Least Squares?

Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. This is illustrated in the left-hand panel of Figure 6.5, using a simulated data set containing $p = 45$ predictors and $n = 50$ observations. The green curve in the left-hand panel of Figure 6.5 displays the variance of the ridge regression predictions as a function of $\lambda$. At the least squares coefficient estimates, which correspond to ridge regression with $\lambda = 0$, the variance is high but there is no bias. But as $\lambda$ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias. Recall that the least mean squared error (MSE), plotted in bias-variance bias, is closely related to the variance plus the squared bias. For values of $\lambda$ that are not too small, the variance generally remains very low, as shown in the figure, plotted in black. However, as $\lambda$ increases from 0 to 10. Beyond this point, the decrease in variance is no longer sufficient to offset the increased bias, and the MSE can begin to be significantly underestimated, resulting in a large increase in the bias. 

The minimum MSE is achieved at and around the value of $\lambda$ that results in the smallest MSE associated with the least squares fit, when using the same hyperparameter that will yield the best fit for any model designed for use with $\lambda$. However, for an inflexible estimator, the MSE is considerably higher.

In general, as the number of observations increases, the ridge regression estimates become more stable against the errors in the left-hand curve; however, the fitted values may still have high variance. This means that while ridge regression can improve the fitted values' stability by controlling their variance, the coefficients of the variables in the model may be increasingly biased.

In Figure 6.5, the least squares estimates continue to outperform ridge regression predictions, even when $\lambda > 0$, because the least squares estimation can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations
where the least squares estimates have high variance.
Ridge regression also has substantial computational advantages over best
subset selection, which requires searching through $2^p$ models. As we dis
cussed previously, even for moderate values of $p$, such a search can be
computationally infeasible. In contrast, for any fixed value of $\lambda$, ridge re
gression only fits a single model, and the model-fitting procedure can be
performed quite quickly. In fact, one can show that the computations re
quired to solve (6.5), simultaneously for all values of $\lambda$, are almost identical
to those for fitting a model using least squares.

#### The Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset,
forward stepwise, and backward stepwise selection, which will generally
select models that involve just a subset of the variables, ridge regression
will include all $p$ predictors in the final model. The penalty 
$$
\sum_{j=1}^{p} \beta_j^2
$$ 
in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them
exactly to zero (unless $\lambda = \infty$). This may not be a problem for prediction
accuracy, but it can create a challenge in model interpretation in settings in
which the number of variables $p$ is quite large. For example, in the Credit
data set, it appears that the most important variables are income, limit,
rating, and student. So we might wish to build a model including just
these predictors. However, ridge regression will always generate a model
involving all ten predictors. Increasing the value of $\lambda$ will tend to reduce
the magnitudes of the coefficients, but will not result in exclusion of any of
the variables.

The lasso is a relatively recent alternative to ridge regression that over-
comes this disadvantage. The lasso coefficients, $\hat{\beta}_L$, minimize the quantity
$$
\sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} | \beta_j | = \text{RSS} + \lambda \sum_{j=1}^{p} | \beta_j |.
$$
Comparing (6.7) to (6.5), we see that the lasso and ridge regression have
similar formulations. The only difference is that the $\beta_j^2$ term in the ridge
regression penalty (6.5) has been replaced by $|\beta_j|$ in the lasso penalty (6.7).
In statistical parlance, the lasso uses an $L_1$ (pronounced “ell 1”) penalty
instead of an $L_2$ penalty. The $L_1$ norm of a coefficient vector is given by
$$
||\beta||_1 = \sum_{j=1}^{p} | \beta_j |.
$$

As with ridge regression, the lasso shrinks the coefficient estimates to
wards zero. However, in the case of the lasso, the $L_1$ penalty has the effect
of forcing some of the coefficient estimates to be exactly equal to zero when
the tuning parameter is sufficiently large. Hence, much like best subset se
lection, the lasso performs variable selection. As a result, models generated
from the lasso are generally much easier to interpret than those produced
by ridge regression. We say that the lasso yields sparse models—that is,
models that involve only a subset of the variables. As in ridge regression,
selecting a good value of $\lambda$ for the lasso is critical; we defer this discussion
to Section 6.2.3, where we use cross-validation.

As an example, consider the coefficient plots in Figure 6.6, which are gen
erated from applying the lasso to the Credit dataset. When $\lambda = 0$, then
the lasso simply gives the least squares fit, and when $\lambda$ becomes sufficiently
large, the lasso gives the null model in which all coefficient estimates equal
zero. However, in between these two extremes, the ridge regression and
lasso models are quite different from each other. Moving from left to right
in the right-hand panel of Figure 6.6, we observe that at first the lasso re
sults in a model that contains only the rating predictor. Then student and
limit enter the model almost simultaneously, shortly followed by income.
Eventually, the remaining variables enter the model. Hence, depending on
the value of $\lambda$, the lasso can produce a model involving any number of vari
ables. In contrast, ridge regression will always include all of the variables in
the model, although the magnitude of the coefficient estimates will depend
on $\lambda$.




Another Formulation for Ridge Regression and the Lasso

One can show that the lasso and ridge regression coefficient estimates solve
the problems:

$$
\begin{align*}
\text{minimize} & \quad \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2
\text{subject to} & \quad \sum_{j=1}^{p} | \beta_j | \leq s \tag{6.8}
\end{align*}
$$

and

$$
\begin{align*}
\text{minimize} & \quad \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2
\text{subject to} & \quad \sum_{j=1}^{p} \beta_j^2 \leq s \tag{6.9}
\end{align*}
$$

respectively. In other words, for every value of $\lambda$, there is some $s$ such that
the Equations (6.7) and (6.8) will give the same lasso coefficient estimates.
Similarly, for every value of $\lambda$ there is a corresponding $s$ such that Equations (6.5) and (6.9) will give the same ridge regression coefficient estimates.