### Resampling Methods

Resampling methods are an indispensable tool in modern statistics. They
involve repeatedly drawing samples from a training set and refitting a model
of interest on each sample in order to obtain additional information about
the fitted model. For example, in order to estimate the variability of a linear
regression fit, we can repeatedly draw different samples from the training
data, fit a linear regression to each new sample, and then examine the
extent to which the resulting fits differ. Such an approach may allow us to
obtain information that would not be available from fitting the model only
once using the original training sample.

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate
its performance, or to select the appropriate level of flexibility. 

The process of evaluating a model’s performance is known as model assessment, whereas model
the process of selecting the proper level of flexibility for a model is known as
model selection. 

The bootstrap is used in several contexts, most commonly model
to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.

#### Cross-Validation

The test error is the average error that results from using
a statistical learning method to predict the response on a new observation—
that is, a measurement that was not used in training the method. Given
a data set, the use of a particular statistical learning method is warranted
if it results in a low test error. The test error can be easily calculated if a
designated test set is available. Unfortunately, this is usually not the case.
In contrast, the training error can be easily calculated by applying the
statistical learning method to the observations used in its training. But as
we saw in Chapter 2, the training error rate often is quite different from the
test error rate, and in particular the former can dramatically underestimate
the latter.

In the absence of a very large designated test set that can be used to
directly estimate the test error rate, a number of techniques can be used
to estimate this quantity using the available training data. Some methods
make a mathematical adjustment to the training error rate in order to
estimate the test error rate. Such approaches are discussed.
In this section, we instead consider a class of methods that estimate the
test error rate by holding out a subset of the training observations from the
fitting process, and then applying the statistical learning method to those
held out observations.

#####  The Validation Set Approach

Suppose that we would like to estimate the test error associated with fit
ting a particular statistical learning method on a set of observations. The
validation set approach, displayed in Figure 5.1, is a very simple strategy validation
for this task. It involves randomly dividing the available set of observa
tions into two parts, a training set and a validation set or hold-out set. The validation
model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set. The resulting validation
set error rate—typically assessed using MSE in the case of a quantitative
response—provides an estimate of the test error rate.
Weillustrate the validation set approach on the Auto data set. Recall from
Chapter 3 that there appears to be a non-linear relationship between mpg
and horsepower, and that a model that predicts mpg using horsepower and
horsepower2 gives better results than a model that uses only a linear term.
It is natural to wonder whether a cubic or higher-order fit might provide
even better results. We answer this question in Chapter 3 by looking at
the p-values associated with a cubic term and higher-order polynomial
terms in a linear regression. But we could also answer this question using
the validation method. We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set
containing the remaining 196 observations. The validation set error rates
that result from fitting various regression models on the training sample
and evaluating their performance on the validation sample, using MSE
as a measure of validation set error, are shown in the left-hand panel of
Figure 5.2. The validation set MSE for the quadratic fit is considerably
smaller than for the linear fit. However, the validation set MSE for the cubic
f
it is actually slightly larger than for the quadratic fit. This implies that
including a cubic term in the regression does not lead to better prediction
than simply using a quadratic term.
Recall that in order to create the left-hand panel of Figure 5.2, we ran
domly divided the data set into two parts, a training set and a validation
set. If we repeat the process of randomly splitting the sample set into two
parts, we will get a somewhat different estimate for the test MSE. As an
illustration, the right-hand panel of Figure 5.2 displays ten different vali
dation set MSE curves from the Auto data set, produced using ten different
random splits of the observations into training and validation sets. All ten
curves indicate that the model with a quadratic term has a dramatically
smaller validation set MSE than the model with only a linear term. Fur
thermore, all ten curves indicate that there is not much benefit in including
cubic or higher-order polynomial terms in the model. But it is worth noting
that each of the ten curves results in a different test MSE estimate for each
of the ten regression models considered. And there is no consensus among
the curves as to which model results in the smallest validation set MSE.
Based on the variability among these curves, all that we can conclude with
any confidence is that the linear fit is not adequate for this data.
The validation set approach is conceptually simple and is easy to imple
ment. But it has two potential drawbacks:
1. As is shown in the right-hand panel of Figure 5.2, the validation esti
mate of the test error rate can be highly variable, depending on pre
cisely which observations are included in the training set and which
observations are included in the validation set.
2. In the validation approach, only a subset of the observations—those
that are included in the training set rather than in the validation
set—are used to fit the model. Since statistical methods tend to per
form worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate
for the model fit on the entire data set.
In the coming subsections, we will present cross-validation, a refinement of
the validation set approach that addresses these two issues.


##### Leave-One-Out Cross-Validation

5.1.2 Leave-One-Out Cross-Validation
Leave-one-out cross-validation (LOOCV) is closely related to the validation leave-one
set approach of Section 5.1.1, but it attempts to address that method’s
drawbacks.
Like the validation set approach, LOOCV involves splitting the set of
observations into two parts. However, instead of creating two subsets of
comparable size, a single observation (x1,y1) is used for the validation
set, and the remaining observations {(x2,y2),...,(xn,yn)} make up the
training set. The statistical learning method is fit on the n 1 training
observations, and a prediction ˆy1 is made for the excluded observation,
using its value x1. Since (x1,y1) was not used in the fitting process, MSE1 =
(y1 
ˆ
y1)2 provides an approximately unbiased estimate for the test error.
But even though MSE1 is unbiased for the test error, it is a poor estimate
because it is highly variable, since it is based upon a single observation
(x1,y1).
We can repeat the procedure by selecting (x2,y2) for the validation
data, training the statistical learning procedure on the n 1 observations
{(x1,y1),(x3,y3),...,(xn,yn)}, and computing MSE2 =(y2 ˆy2)2. Repeat
ing this approach n times produces n squared errors, MSE1,..., MSEn.
The LOOC proach n times produces n squared errors, MSE1,..., MSEn.
The LOOCV estimate for the test MSE is the average of these n test error
estimates:
CV(n) = 1
n 
n
i=1 
MSEi.
(5.1)

LOOCV has a couple of major advantages over the validation set ap
proach. First, it has far less bias. In LOOCV, we repeatedly fit the sta
tistical learning method using training sets that contain n 1 observa
tions, almost as many as are in the entire data set. This is in contrast to
the validation set approach, in which the training set is typically around
half the size of the original data set. Consequently, the LOOCV approach
tends not to overestimate the test error rate as much as the validation
set approach does. Second, in contrast to the validation approach which
will yield different results when applied repeatedly due to randomness in
the training/validation set splits, performing LOOCV multiple times will
always yield the same results: there is no randomness in the training/vali
dation set splits.

LOOCVhas the potential to be expensive to implement, since the model
has to be fit n times. This can be very time consuming if n is large, and if
each individual model is slow to fit. With least squares linear or polynomial
regression, an amazing shortcut makes the cost of LOOCV the same as that
of a single model fit! The following formula holds:
CV(n) = 1
n 
n
i=1 
yi ˆyi
2
,
(5.2)

where ˆyi is the ith fitted value from the original least squares fit, and hi is
the leverage defined in (3.37) on page 105.1 This is like the ordinary MSE,
except the ith residual is divided by 1 hi. The leverage lies between 1/n
and 1, and reflects the amount that an observation influences its own fit.
Hence the residuals for high-leverage points are inflated in this formula by
exactly the right amount for this equality to hold.
LOOCV is a very general method, and can be used with any kind of
predictive modeling. For example we could use it with logistic regression
or linear discriminant analysis, or any of the methods discussed in later
chapters. The magic formula (5.2) does not hold in general, in which case
the model has to be refit n times

#####  k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomly k-fold CV
dividing the set of observations into k groups, or folds, of approximately
equal size. The first fold is treated as a validation set, and the method
is fit on the remaining k 1 folds. The mean squared error, MSE1, is
then computed on the observations in the held-out fold. This procedure is
repeated k times; each time, a different group of observations is treated
as a validation set. This process results in k estimates of the test error,
MSE1,MSE2,...,MSEk. The k-fold CV estimate is computed by averaging
these values,
CV(k) = 1
k 
k
i=1 
MSEi.
Figure 5.5 illustrates the k-fold CV approach.
(5.3)

It is not hard to see that LOOCV is a special case of k-fold CV in which k
is set to equal n. In practice, one typically performs k-fold CV using k =5
or k = 10. What is the advantage of using k =5or k = 10 rather than
k = n? The most obvious advantage is computational. LOOCV requires
f
itting the statistical learning method n times. This has the potential to be
computationally expensive (except for linear models fit by least squares,
in which case formula (5.2) can be used). But cross-validation is a very
general approach that can be applied to almost any statistical learning
method. Some statistical learning methods have computationally intensive
f
itting procedures, and so performing LOOCV may pose computational
problems, especially if n is extremely large. In contrast, performing 10-fold
CV requires fitting the learning procedure only ten times, which may be
much more feasible. As we see in Section 5.1.4, there also can be other
non-computational advantages to performing 5-fold or 10-fold CV, which
involve the bias-variance trade-off.
The right-hand panel of Figure 5.4 displays nine different 10-fold CV
estimates for the Auto data set, each resulting from a different random split
of the observations into ten folds. As we can see from the figure, there is
some variability in the CV estimates as a result of the variability in how
the observations are divided into ten folds. But this variability is typically
much lower than the variability in the test error estimates that results from
the validation set approach (right-hand panel of Figure 5.2).
When we examine real data, we do not know the true test MSE, and
so it is difficult to determine the accuracy of the cross-validation estimate.
However, if we examine simulated data, then we can compute the true
test MSE, and can thereby evaluate the accuracy of our cross-validation
results. In Figure 5.6, we plot the cross-validation estimates and true test
error rates that result from applying smoothing splines to the simulated
data sets illustrated in Figures 2.9–2.11 of Chapter 2. The true test MSE
is displayed in blue. The black dashed and orange solid lines respectively
show the estimated LOOCV and 10-fold CV estimates. In all three plots,
the two cross-validation estimates are very similar.

rue test MSE.
When we perform cross-validation, our goal might be to determine how
well a given statistical learning procedure can be expected to perform on
independent data; in this case, the actual estimate of the test MSE is
of interest. But at other times we are interested only in the location of
the minimum point in the estimated test MSE curve. This is because we
might be performing cross-validation on a number of statistical learning
methods, or on a single method using different levels of flexibility, in order
to identify the method that results in the lowest test error. For this purpose,
the location of the minimum point in the estimated test MSE curve is
important, but the actual value of the estimated test MSE is not. We find
in Figure 5.6 that despite the fact that they sometimes underestimate the
true test MSE, all of the CV curves come close to identifying the correct
level of flexibility—that is, the flexibility level corresponding to the smallest
test MSE.

##### Bias-Variance Trade-Off for k-Fold Cross-Validation

 k-fold CV with k<nhas a compu
tational advantage to LOOCV. But putting computational issues aside,
a less obvious but potentially more important advantage of k-fold CV is
that it often gives more accurate estimates of the test error rate than does
LOOCV. This has to do with a bias-variance trade-off.
It was mentioned in Section 5.1.1 that the validation set approach can
lead to overestimates of the test error rate, since in this approach the
training set used to fit the statistical learning method contains only half
the observations of the entire data set