# Resampling Methods
Topics covered in this chapter of the book-

* 5.1 Cross-Validation ........................ 176
  * 5.1.1 TheValidationSetApproach ............. 176
  * 5.1.2 Leave-One-Out Cross-Validation . . . . . . . . . . . 178
  * 5.1.3 k-FoldCross-Validation ................ 181
  * 5.1.4 Bias-Variance Trade-Off for k-Fold Cross-Validation .................... 183
  * 5.1.5 Cross-Validation on Classification Problems . . . . . 184
* 5.2 TheBootstrap ......................... 187

**Following is the summary of concepts along with data and python code-**


_______

**Resampling methods** involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
Two of the most commonly used resampling methods, *cross-validation* and the *bootstrap*.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import scipy
import pandas as pd 
import math
import random

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
from sklearn import datasets, linear_model

### Cross-Validation (CV)

CV is used to estimate the error rate with a given model on test data, in order to evaluate its performance (model assessment), or to select the appropriate level of flexibility (model selection). 

There are multiple methods to do this, we will take one at a time.

1. **The Validation Set Approach**

Here we randomly divide the available set of samples into two parts: a training set and a validation or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation-set error provides an estimate of the test error. This is typically assessed using MSE in the case of a quantitative response and misclassification rate in the case of a qualitative (discrete) response.

This approach has several drawbacks:
• the validation estimate of the test error can be highly variable.
• only a subset of the observations – those that are included in the training set rather than in the
validation set – are used to fit the model.
• the validation set error may tend to overestimate the test error for the model fit on the entire data set.

In [4]:
## Lets this approach on automobile data
Auto = pd.read_csv('../data/Auto.csv', header=0, na_values='?')
Auto = Auto.dropna().reset_index(drop=True) # drop the observation with NA values and reindex the obs from 0
Auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger


In [12]:
np.random.seed(1)
train = np.random.choice(Auto.shape[0], int(Auto.shape[0]/2), replace=False)
select = np.in1d(range(Auto.shape[0]), train)
train

array([ 81, 165, 351, 119, 379, 236,  78,  92,  80, 333, 278, 307, 283,
       218, 366,   4, 385, 324,   6, 167, 146, 132, 120, 228,   5, 290,
       214, 197, 162, 338, 260, 232,  67, 383, 224, 185, 161, 250, 377,
        18, 229,  62, 122, 125, 106, 160, 102, 371, 189,  93,  65, 251,
       311, 389, 329, 172, 304,  17, 306, 246, 381, 180, 343, 247, 192,
       174, 207,  11, 291,  41, 318, 289, 213, 315,  23, 293,  13,  90,
        61, 334, 258, 139, 310, 349,  95, 358, 327, 294, 127, 191,  82,
       361, 222,  27,  89, 305,  73, 274, 257, 287, 107, 204,  98, 233,
       117, 277, 360, 244,  39, 159, 301, 355, 345,  59,  12, 303, 163,
        91, 332, 391, 388,  29, 273, 292,  85,  58, 354, 188, 171, 348,
       369, 298,  88, 131, 124, 230,  14, 271, 123, 138, 111,  51, 112,
         9, 175,  16, 173,   0, 105, 179, 201,  70,  38, 150, 359, 375,
       372, 145,  42, 227, 223, 208, 186, 386, 285, 272, 147, 326, 100,
        34, 110, 234, 135, 368, 154,  19, 248, 158, 267,  44, 33

In [7]:
Auto.shape[0]/2

196.0