### Resampling Methods

Resampling methods are an indispensable tool in modern statistics. They
involve repeatedly drawing samples from a training set and refitting a model
of interest on each sample in order to obtain additional information about
the fitted model. For example, in order to estimate the variability of a linear
regression fit, we can repeatedly draw different samples from the training
data, fit a linear regression to each new sample, and then examine the
extent to which the resulting fits differ. Such an approach may allow us to
obtain information that would not be available from fitting the model only
once using the original training sample.

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate
its performance, or to select the appropriate level of flexibility. 

The process of evaluating a model’s performance is known as model assessment, whereas model
the process of selecting the proper level of flexibility for a model is known as
model selection. 

The bootstrap is used in several contexts, most commonly model
to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.

#### Cross-Validation

The test error is the average error that results from using
a statistical learning method to predict the response on a new observation—
that is, a measurement that was not used in training the method. Given
a data set, the use of a particular statistical learning method is warranted
if it results in a low test error. The test error can be easily calculated if a
designated test set is available. Unfortunately, this is usually not the case.
In contrast, the training error can be easily calculated by applying the
statistical learning method to the observations used in its training. But as
we saw in Chapter 2, the training error rate often is quite different from the
test error rate, and in particular the former can dramatically underestimate
the latter.

In the absence of a very large designated test set that can be used to
directly estimate the test error rate, a number of techniques can be used
to estimate this quantity using the available training data. Some methods
make a mathematical adjustment to the training error rate in order to
estimate the test error rate. Such approaches are discussed.
In this section, we instead consider a class of methods that estimate the
test error rate by holding out a subset of the training observations from the
fitting process, and then applying the statistical learning method to those
held out observations.

#####  The Validation Set Approach

Suppose that we would like to estimate the test error associated with fit
ting a particular statistical learning method on a set of observations. The
validation set approach, displayed in Figure 5.1, is a very simple strategy validation
for this task. It involves randomly dividing the available set of observa
tions into two parts, a training set and a validation set or hold-out set. The validation
model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set. The resulting validation
set error rate—typically assessed using MSE in the case of a quantitative
response—provides an estimate of the test error rate.
Weillustrate the validation set approach on the Auto data set. Recall from
Chapter 3 that there appears to be a non-linear relationship between mpg
and horsepower, and that a model that predicts mpg using horsepower and
horsepower2 gives better results than a model that uses only a linear term.
It is natural to wonder whether a cubic or higher-order fit might provide
even better results. We answer this question in Chapter 3 by looking at
the p-values associated with a cubic term and higher-order polynomial
terms in a linear regression. But we could also answer this question using
the validation method. We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set
containing the remaining 196 observations. The validation set error rates
that result from fitting various regression models on the training sample
and evaluating their performance on the validation sample, using MSE
as a measure of validation set error, are shown in the left-hand panel of
Figure 5.2. The validation set MSE for the quadratic fit is considerably
smaller than for the linear fit. However, the validation set MSE for the cubic
f
it is actually slightly larger than for the quadratic fit. This implies that
including a cubic term in the regression does not lead to better prediction
than simply using a quadratic term.
Recall that in order to create the left-hand panel of Figure 5.2, we ran
domly divided the data set into two parts, a training set and a validation
set. If we repeat the process of randomly splitting the sample set into two
parts, we will get a somewhat different estimate for the test MSE. As an
illustration, the right-hand panel of Figure 5.2 displays ten different vali
dation set MSE curves from the Auto data set, produced using ten different
random splits of the observations into training and validation sets. All ten
curves indicate that the model with a quadratic term has a dramatically
smaller validation set MSE than the model with only a linear term. Fur
thermore, all ten curves indicate that there is not much benefit in including
cubic or higher-order polynomial terms in the model. But it is worth noting
that each of the ten curves results in a different test MSE estimate for each
of the ten regression models considered. And there is no consensus among
the curves as to which model results in the smallest validation set MSE.
Based on the variability among these curves, all that we can conclude with
any confidence is that the linear fit is not adequate for this data.
The validation set approach is conceptually simple and is easy to imple
ment. But it has two potential drawbacks:
1. As is shown in the right-hand panel of Figure 5.2, the validation esti
mate of the test error rate can be highly variable, depending on pre
cisely which observations are included in the training set and which
observations are included in the validation set.
2. In the validation approach, only a subset of the observations—those
that are included in the training set rather than in the validation
set—are used to fit the model. Since statistical methods tend to per
form worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate
for the model fit on the entire data set.
In the coming subsections, we will present cross-validation, a refinement of
the validation set approach that addresses these two issues.


##### Leave-One-Out Cross-Validation

5.1.2 Leave-One-Out Cross-Validation
Leave-one-out cross-validation (LOOCV) is closely related to the validation leave-one
set approach of Section 5.1.1, but it attempts to address that method’s
drawbacks.
Like the validation set approach, LOOCV involves splitting the set of
observations into two parts. However, instead of creating two subsets of
comparable size, a single observation (x1,y1) is used for the validation
set, and the remaining observations {(x2,y2),...,(xn,yn)} make up the
training set. The statistical learning method is fit on the n 1 training
observations, and a prediction ˆy1 is made for the excluded observation,
using its value x1. Since (x1,y1) was not used in the fitting process, MSE1 =
(y1 
ˆ
y1)2 provides an approximately unbiased estimate for the test error.
But even though MSE1 is unbiased for the test error, it is a poor estimate
because it is highly variable, since it is based upon a single observation
(x1,y1).
We can repeat the procedure by selecting (x2,y2) for the validation
data, training the statistical learning procedure on the n 1 observations
{(x1,y1),(x3,y3),...,(xn,yn)}, and computing MSE2 =(y2 ˆy2)2. Repeat
ing this approach n times produces n squared errors, MSE1,..., MSEn.
The LOOC proach n times produces n squared errors, MSE1,..., MSEn.
The LOOCV estimate for the test MSE is the average of these n test error
estimates:
CV(n) = 1
n 
n
i=1 
MSEi.
(5.1)

LOOCV has a couple of major advantages over the validation set ap
proach. First, it has far less bias. In LOOCV, we repeatedly fit the sta
tistical learning method using training sets that contain n 1 observa
tions, almost as many as are in the entire data set. This is in contrast to
the validation set approach, in which the training set is typically around
half the size of the original data set. Consequently, the LOOCV approach
tends not to overestimate the test error rate as much as the validation
set approach does. Second, in contrast to the validation approach which
will yield different results when applied repeatedly due to randomness in
the training/validation set splits, performing LOOCV multiple times will
always yield the same results: there is no randomness in the training/vali
dation set splits.

LOOCVhas the potential to be expensive to implement, since the model
has to be fit n times. This can be very time consuming if n is large, and if
each individual model is slow to fit. With least squares linear or polynomial
regression, an amazing shortcut makes the cost of LOOCV the same as that
of a single model fit! The following formula holds:
CV(n) = 1
n 
n
i=1 
yi ˆyi
2
,
(5.2)

where ˆyi is the ith fitted value from the original least squares fit, and hi is
the leverage defined in (3.37) on page 105.1 This is like the ordinary MSE,
except the ith residual is divided by 1 hi. The leverage lies between 1/n
and 1, and reflects the amount that an observation influences its own fit.
Hence the residuals for high-leverage points are inflated in this formula by
exactly the right amount for this equality to hold.
LOOCV is a very general method, and can be used with any kind of
predictive modeling. For example we could use it with logistic regression
or linear discriminant analysis, or any of the methods discussed in later
chapters. The magic formula (5.2) does not hold in general, in which case
the model has to be refit n times

#####  k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomly k-fold CV
dividing the set of observations into k groups, or folds, of approximately
equal size. The first fold is treated as a validation set, and the method
is fit on the remaining k 1 folds. The mean squared error, MSE1, is
then computed on the observations in the held-out fold. This procedure is
repeated k times; each time, a different group of observations is treated
as a validation set. This process results in k estimates of the test error,
MSE1,MSE2,...,MSEk. The k-fold CV estimate is computed by averaging
these values,
CV(k) = 1
k 
k
i=1 
MSEi.
Figure 5.5 illustrates the k-fold CV approach.
(5.3)

It is not hard to see that LOOCV is a special case of k-fold CV in which k
is set to equal n. In practice, one typically performs k-fold CV using k =5
or k = 10. What is the advantage of using k =5or k = 10 rather than
k = n? The most obvious advantage is computational. LOOCV requires
f
itting the statistical learning method n times. This has the potential to be
computationally expensive (except for linear models fit by least squares,
in which case formula (5.2) can be used). But cross-validation is a very
general approach that can be applied to almost any statistical learning
method. Some statistical learning methods have computationally intensive
f
itting procedures, and so performing LOOCV may pose computational
problems, especially if n is extremely large. In contrast, performing 10-fold
CV requires fitting the learning procedure only ten times, which may be
much more feasible. As we see in Section 5.1.4, there also can be other
non-computational advantages to performing 5-fold or 10-fold CV, which
involve the bias-variance trade-off.
The right-hand panel of Figure 5.4 displays nine different 10-fold CV
estimates for the Auto data set, each resulting from a different random split
of the observations into ten folds. As we can see from the figure, there is
some variability in the CV estimates as a result of the variability in how
the observations are divided into ten folds. But this variability is typically
much lower than the variability in the test error estimates that results from
the validation set approach (right-hand panel of Figure 5.2).
When we examine real data, we do not know the true test MSE, and
so it is difficult to determine the accuracy of the cross-validation estimate.
However, if we examine simulated data, then we can compute the true
test MSE, and can thereby evaluate the accuracy of our cross-validation
results. In Figure 5.6, we plot the cross-validation estimates and true test
error rates that result from applying smoothing splines to the simulated
data sets illustrated in Figures 2.9–2.11 of Chapter 2. The true test MSE
is displayed in blue. The black dashed and orange solid lines respectively
show the estimated LOOCV and 10-fold CV estimates. In all three plots,
the two cross-validation estimates are very similar.

rue test MSE.
When we perform cross-validation, our goal might be to determine how
well a given statistical learning procedure can be expected to perform on
independent data; in this case, the actual estimate of the test MSE is
of interest. But at other times we are interested only in the location of
the minimum point in the estimated test MSE curve. This is because we
might be performing cross-validation on a number of statistical learning
methods, or on a single method using different levels of flexibility, in order
to identify the method that results in the lowest test error. For this purpose,
the location of the minimum point in the estimated test MSE curve is
important, but the actual value of the estimated test MSE is not. We find
in Figure 5.6 that despite the fact that they sometimes underestimate the
true test MSE, all of the CV curves come close to identifying the correct
level of flexibility—that is, the flexibility level corresponding to the smallest
test MSE.

##### Bias-Variance Trade-Off for k-Fold Cross-Validation

 k-fold CV with k<nhas a compu
tational advantage to LOOCV. But putting computational issues aside,
a less obvious but potentially more important advantage of k-fold CV is
that it often gives more accurate estimates of the test error rate than does
LOOCV. This has to do with a bias-variance trade-off.
It was mentioned in Section 5.1.1 that the validation set approach can
lead to overestimates of the test error rate, since in this approach the
training set used to fit the statistical learning method contains only half
the observations of the entire data set. Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error,
since each training set contains n 1 observations, which is almost as many
as the number of observations in the full data set. And performing k-fold
CV for, say, k =5or k = 10 will lead to an intermediate level of bias,
since each training set contains approximately (k 1)n/k observations—
fewer than in the LOOCV approach, but substantially more than in the
validation set approach. Therefore, from the perspective of bias reduction,
it is clear that LOOCV is to be preferred to k-fold CV.
However, we know that bias is not the only source for concern in an esti
mating procedure; we must also consider the procedure’s variance. It turns
out that LOOCV has higher variance than does k-fold CV with k<n.Why
is this the case? When we perform LOOCV, we are in effect averaging the
outputs of n fitted models, each of which is trained on an almost identical
set of observations; therefore, these outputs are highly (positively) corre
lated with each other. In contrast, when we perform k-fold CV with k<n,
we are averaging the outputs of k fitted models that are somewhat less
correlated with each other, since the overlap between the training sets in
each model is smaller. Since the mean of many highly correlated quantities
has higher variance than does the mean of many quantities that are not
as highly correlated, the test error estimate resulting from LOOCV tends
to have higher variance than does the test error estimate resulting from
k-fold CV.
To summarize, there is a bias-variance trade-off associated with the
choice of k in k-fold cross-validation. Typically, given these considerations,
one performs k-fold cross-validation using k =5or k = 10, as these values
have been shown empirically to yield test error rate estimates that suffer
neither from excessively high bias nor from very high variance.


#####  Cross-Validation on Classification Problems

Cross-validation can also be a very useful
approach in the classification setting when Y is qualitative. In this setting,
cross-validation works just as described earlier in this chapter, except that
rather than using MSE to quantify test error, we instead use the number
of misclassified observations. For instance, in the classification setting, the
LOOCV error rate takes the form
CV(n) = 1
n 
n
i=1 
Erri,
where Erri = I(yi=ˆ
(5.4)
yi). The k-fold CV error rate and validation set error
rates are defined analogously.
As an example, we fit various logistic regression models on the two
dimensional classification data displayed in Figure 2.13. In the top-left
panel of Figure 5.7, the black solid line shows the estimated decision bound
ary resulting from fitting a standard logistic regression model to this data
set. Since this is simulated data, we can compute the true test error rate,
which takes a value of 0.201 and so is substantially larger than the Bayes errorrateof0.133.Clearlylogisticregressiondoesnothaveenoughflexi
bilitytomodel theBayesdecisionboundaryinthissetting.Wecaneasily
extendlogisticregressiontoobtainanon-lineardecisionboundarybyusing
polynomialfunctionsofthepredictors,aswedidintheregressionsettingin
Section3.3.2.Forexample,wecanfitaquadraticlogisticregressionmodel,
givenby
log p
1 p = 0+ 1X1+ 2X2
1+ 3X2+ 4X2
2. (5.5)
Thetop-rightpanelofFigure5.7displaystheresultingdecisionboundary,
whichisnowcurved.However,thetesterrorratehasimprovedonlyslightly,
to0.197.Amuchlargerimprovementisapparentinthebottom-leftpanel of Figure 5.7, in which we have fit a logistic regression model involving
cubic polynomials of the predictors. Now the test error rate has decreased
to 0.160. Going to a quartic polynomial (bottom-right) slightly increases
the test error.
In practice, for real data, the Bayes decision boundary and the test er
ror rates are unknown. So how might we decide between the four logistic
regression models displayed in Figure 5.7? We can use cross-validation in
order to make this decision. The left-hand panel of Figure 5.8 displays in
black the 10-fold CV error rates that result from fitting ten logistic regres
sion models to the data, using polynomial functions of the predictors up
to tenth order. The true test errors are shown in brown, and the training
errors are shown in blue. As we have seen previously, the training error
tends to decrease as the flexibility of the fit increases. (The figure indicates
that though the training error rate doesn’t quite decrease monotonically,
it tends to decrease on the whole as the model complexity increases.) In
contrast, the test error displays a characteristic U-shape. The 10-fold CV
error rate provides a pretty good approximation to the test error rate.
While it somewhat underestimates the error rate, it reaches a minimum
when fourth-order polynomials are used, which is very close to the min
imum of the test curve, which occurs when third-order polynomials are
used. In fact, using fourth-order polynomials would likely lead to good test
set performance, as the true test error rate is approximately the same for
third, fourth, fifth, and sixth-order polynomials.
The right-hand panel of Figure 5.8 displays the same three curves us
ing the KNN approach for classification, as a function of the value of K
(which in this context indicates the number of neighbors used in the KNN
classifier, rather than the number of CV folds used). Again the training
error rate declines as the method becomes more flexible, and so we see that
the training error rate cannot be used to select the optimal value for K.
Though the cross-validation error curve slightly underestimates the test error rate, it takes on a minimum very close to the best value for K.

####  The Bootstrap

The bootstrap is a widely applicable and extremely powerful statistical tool bootstrap
that can be used to quantify the uncertainty associated with a given esti
mator or statistical learning method. As a simple example, the bootstrap
can be used to estimate the standard errors of the coefficients from a linear
regression fit. In the specific case of linear regression, this is not particularly
useful, since we saw in Chapter 3 that standard statistical software such as
R outputs such standard errors automatically. However, the power of the
bootstrap lies in the fact that it can be easily applied to a wide range of
statistical learning methods, including some for which a measure of vari
ability is otherwise difficult to obtain and is not automatically output by
statistical software.
In this section we illustrate the bootstrap on a toy example in which we
wish to determine the best investment allocation under a simple model.
In Section 5.3 we explore the use of the bootstrap to assess the variability
associated with the regression coefficients in a linear model fit.
Suppose that we wish to invest a fixed sum of money in two financial
assets that yield returns of X and Y, respectively, where X and Y are
random quantities. We will invest a fraction of our money in X, and will
invest the remaining 1 
in Y. Since there is variability associated with
the returns on these two assets, we wish to choose to minimize the total
risk, or variance, of our investment. In other words, we want to minimize
Var( X+(1 )Y). One can show that the value that minimizes the risk
is given by

2=
Y 
XY
2
X + 2
Y 2XY

(5.6)

where 2
X = Var(X), 2
Y = Var(Y), and XY = Cov(X,Y).
In reality, the quantities 2
X, 2
Y , and XY are unknown. We can compute
estimates for these quantities, ˆ2
X, ˆ2
Y , and ˆXY, using a data set that
contains past measurements for X and Y . We can then estimate the value
of 
that minimizes the variance of our investment using

ˆ= ˆ2
Y ˆXY
ˆ2
X +ˆ2
Y 2ˆXY 
.
(5.7)

Figure 5.9 illustrates this approach for estimating on a simulated data
set. In each panel, we simulated 100 pairs of returns for the investments
X and Y. We used these returns to estimate 2
X, 2
Y, and XY, which we
then substituted into (5.7) in order to obtain estimates for . The value of
ˆresulting from each simulated data set ranges from 0.532 to 0.657.
It is natural to wish to quantify the accuracy of our estimate of .To
estimate the standard deviation of ˆ, we repeated the process of simu
lating 100 paired observations of X and Y , and estimating using (5.7),
1,000 times. We thereby obtained 1,000 estimates for , which we can call
ˆ1, ˆ2,...,ˆ1,000. The left-hand panel of Figure 5.10 displays a histogram
of the resulting estimates. For these simulations the parameters were set to
2
X =1, 2
Y =1.25, and XY =0.5, and so we know that the true value of
is 0.6. We indicated this value using a solid vertical line on the histogram.


Themeanoverall1,000estimatesfor is
¯= 1
1000
1000
r=1
ˆr=0.5996,
verycloseto =0.6,andthestandarddeviationoftheestimatesis
1
1000 1
1000
r=1
(ˆr ¯)2=0.083.
Thisgivesusaverygood ideaof theaccuracyof ˆ: SE(ˆ) 0.083. So
roughly speaking, for a randomsample fromthepopulation,wewould
expectˆtodifferfrom byapproximately0.08,onaverage.
Inpractice,however,theprocedureforestimatingSE(ˆ)outlinedabove
cannotbeapplied,becauseforrealdatawecannotgeneratenewsamples
fromtheoriginalpopulation.However, thebootstrapapproachallowsus
touseacomputer toemulate theprocessof obtainingnewsample sets,
sothatwecanestimatethevariabilityof ˆwithoutgeneratingadditional
samples.Ratherthanrepeatedlyobtainingindependentdatasetsfromthe
population,we insteadobtaindistinctdata setsbyrepeatedlysampling
observationsfromtheoriginaldataset.
Thisapproachis illustratedinFigure5.11onasimpledataset,which
wecallZ, thatcontainsonlyn=3observations.Werandomlyselectn
observations fromthedataset inorder toproduceabootstrapdataset, Z1. The sampling is performed with replacement, which means that the with
same observation can occur more than once in the bootstrap data set. In
this example, Z1 contains the third observation twice, the first observation
once, and no instances of the second observation. Note that if an observation
is contained in Z1, then both its X and Y values are included. We can use
Z1 to produce a new bootstrap estimate for , which we call ˆ 1. This
procedure is repeated B times for some large value of B, in order to produce
B different bootstrap data sets, Z1,Z2,...,ZB, and B corresponding 
estimates, ˆ 1, ˆ 2,...,ˆB. We can compute the standard error of these
bootstrap estimates using the formula
B
SEB(ˆ)=
1
B 1
r=1
ˆ r 
1
B 
B
r=1
ˆ r
2
.
(5.8)
This serves as an estimate of the standard error of ˆestimated from the
original data set.
The bootstrap approach is illustrated in the center panel of Figure 5.10,
which displays a histogram of 1,000 bootstrap estimates of , each com
puted using a distinct bootstrap data set. This panel was constructed on
the basis of a single data set, and hence could be created using real data.
Note that the histogram looks very similar to the left-hand panel, which
displays the idealized histogram of the estimates of obtained by generat
ing 1,000 simulated data sets from the true population. In particular the
bootstrap estimate SE(ˆ) from (5.8) is 0.087, very close to the estimate of
0.083 obtained using 1,000 simulated data sets. The right-hand panel dis
plays the information in the center and left panels in a different way, via
boxplots of the estimates for obtained by generating 1,000 simulated data
sets from the true population and using the bootstrap approach. Again, the
boxplots have similar spreads, indicating that the bootstrap approach can
be used to effectively estimate the variability associated with ˆ.


####  Lab:Cross-Validation and the Bootstrap

In [1]:
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                            summarize,
                            poly)
from sklearn.model_selection import train_test_split

In [2]:
from functools import partial
from sklearn.model_selection import \
    (cross_validate,
     KFold,
     ShuffleSplit)
from sklearn.base import clone
from ISLP.models import sklearn_sm


##### The Validation Set Approach

In [3]:
Auto = load_data('Auto')
Auto_train, Auto_valid = train_test_split(Auto,
                                          test_size=196,
                                          random_state=0)

In [4]:
hp_mm = MS(['horsepower'])
X_train = hp_mm.fit_transform(Auto_train)
y_train = Auto_train['mpg']
model = sm.OLS(y_train, X_train)
results = model.fit()

In [5]:
X_valid = hp_mm.transform(Auto_valid)
y_valid = Auto_valid['mpg']
valid_pred = results.predict(X_valid)
np.mean((y_valid- valid_pred)**2)

23.616617069669882

In [7]:
def evalMSE(terms,
            response,
            train,
            test):
        
        mm = MS(terms)
        X_train = mm.fit_transform(train)
        y_train = train[response]
        
        X_test = mm.transform(test)
        y_test = test[response]
        results = sm.OLS(y_train, X_train).fit()
        test_pred = results.predict(X_test)
        return np.mean((y_test- test_pred)**2)

In [8]:
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
    MSE[idx] = evalMSE([poly('horsepower', degree)],
                       'mpg',
                       Auto_train,
                       Auto_valid)
MSE

array([23.61661707, 18.76303135, 18.79694163])

In [9]:
Auto_train, Auto_valid = train_test_split(Auto,
                                          test_size=196,
                                          random_state=3)
MSE = np.zeros(3)
for idx, degree in enumerate(range(1, 4)):
    MSE[idx] = evalMSE([poly('horsepower', degree)],
                       'mpg',
                       Auto_train,
                       Auto_valid)
MSE

array([20.75540796, 16.94510676, 16.97437833])

##### Cross-Validation