# Resampling Methods

This notebook follows ISLR chapter 5 which is about using resampling methods to verify our analytics approach and includes the following topics:
* Validation Set
* Leave-One-Out Cross Validation (LOOCV)
* K-fold Cross Validation
* Bias-Variance Tradeoff for K-fold Cross Validation
* Cross-Validation in Classification Problems
* Bootstrap
* Applied Cross Validation on ISLR Auto Dataset

## Resampling

Repeatedly sampling data from a training data set and refitting our model to gauge if the fitted models differ. Training models this way gives us more insight into a model's performance. The two most commonly used resampling methods are:

1. Cross Validation (CV): Used to evaluate the test error of the model or to select the level of flexibility of a model.
    * Evaluating model performance is known as ***model assessment***
    * Selecting the proper level of flexibility is known as ***model selection***
2. Bootstrap: Measure the accuracy of a parameter estimate of a learning method

## Training Error and Test Error

***Test Error***: The average error obtained from using a model to make a prediction on new data.

***Training Error:*** The average error obtained from using a model to make predictions against the training set of data.

We always evaluate model performance using the test error.

## Validation Set

One way to evaluate Training/Test error is to split the data into a training and test (or hold-out) dataset. Our modeling approach follows these steps:

1. Divide data into training and test set***
2. Fit model on training dataset
3. Use fitted model to predict the responses for the data points in the test set
4. Evaluate model performance using a measure of error like MSE

***If comparing multiple models it is best to divide data into a training/test/validation set, comparing different models against the test dataset, and evaluating the final model chosen against the validation set

<img src="Training_Test.PNG">

## Applying validation set approach to ISLR `Default` dataset

In [None]:
library(ISLR)
data(Default)

In [3]:
spec <- c(train = .8, test = .2)

g <- sample(cut(
  seq(nrow(Default)), 
  nrow(Default)*cumsum(c(0,spec)),
  labels = names(spec)
))

split_data_set <- split(Default, g)
attributes(split_data_set)

In [5]:
sapply(split_data_set, nrow)/nrow(Default)

Now that we have split the dataset into a training and a test set we can proceed to train a machine learning model on the training set and evaluate model performance on the test set.

Alternatively, we could split our data into training, test, and validation sets if we are evaluating more then one model. We train our prospective models on the training dataset, compare model performance on the test dataset, and finally evaluate the chosen model's "true" predictive capability on the validation set.

In [6]:
spec <- c(train = .7, test = .2, validation=.1)

g <- sample(cut(
  seq(nrow(Default)), 
  nrow(Default)*cumsum(c(0,spec)),
  labels = names(spec)
))

split_data_set <- split(Default, g)
attributes(split_data_set)

In [7]:
sapply(split_data_set, nrow)/nrow(Default)

## Drawbacks of Validation Set Approach

While easy to understand and implement, the validation set appproach suffers from a few issues:

* By nature, splitting the data into two datasets is highly variable in the sense that our validation test error estimate is contingent upon which observations were included in the training and validation sets
* Since one set of data is used to fit the model, the test set error may tend to over-estimate the true test error of the model because it is biased based on which observations were being tested on

***Cross Validation*** addresses both of these issues.

## Leave-One-Out Cross Validation (LOOCV)

LOOCV also uses a training and a test dataset, but is structured as follows:
* Only one observation $(x_{1} \, , \, y_{1})$ is used for the test set
* All other $(n-1)$ observations $\left \{ (x_{2} \, , \, y_{2}) \, , \, \ldots \, , \, (x_{n} \, , \, y_{n}) \right \}$ are used to train the model
* A prediction $\hat{y_{1}}$ is made on the test datapoint $x_{1}$
* Calculate the error term, for example $MSE_{1}=(y_{1} - \hat{y_{1}})^{2}$
    * This error term is an approximately unbiased estimate for the test error
    * However, this test error is highly variable as it is made up of only one data point!

To minimize the variance associated with this method, we repeat the whole procedure using all $n$ data points to come up with $MSE_{1} \, , \, \ldots \, , \, MSE_{n}$

The LOOCV estimate for the test MSE is the average of these $n$ MSEs:

$$
CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_{i}
$$

<img src="LOOCV.PNG">

## Applying LOOCV approach to ISLR `Default` dataset

In [13]:
if (!require(boot)) install.packages("boot")
library(boot)

Loading required package: boot
“there is no package called ‘boot’”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


help(cv.glm)

In [18]:
head(Default,3)

default,student,balance,income
No,No,729.5265,44361.63
No,Yes,817.1804,12106.13
No,No,1073.5492,31767.14


In [21]:
Default$dft <- ifelse(Default$default == "Yes", 1, 0)

In [22]:
head(Default,3)

default,student,balance,income,dft
No,No,729.5265,44361.63,0
No,Yes,817.1804,12106.13,0
No,No,1073.5492,31767.14,0


reference: https://gerardnico.com/lang/r/cross_validation#leave-one-out

Let's train a model with a quantitative response variable to demonstrate LOOCV. Later we'll apply the exact same logic to a classification problem.

In [29]:
glm.fit <- glm(balance~dft+income, data=Default)

#LOOCV
loocv <- cv.glm(Default,glm.fit)

We access the `delta` attribute of the loocv object returned to get the measure of $CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_{i}$
* First term is our MSE
* Second term is the bias corrected MSE

loocv$delta

## Pros and Cons of Using LOOCV

* Pros:
    * LOOCV has far less bias then the training/test approach since it uses $(n-1)$ observations to train the model
    * After repeating the process for all $n$ data points we end up with a measure of model performance that has used all data points to train/test the model
    * LOOCV always gives the same results, unlike the training/test approach
* Cons:
    * Computationally expensive since we must do $n$ model runs
    * Often infeasible for very large datasets or for very computationally intense model runs

## K-fold CV

Another variant of cross validation is ***K-fold CV***:
* Divide the data into $k$ groups or folds of about the same size
* Training the model on $(k-1)$ folds, testing on the hold out fold
* Measuring the model error using a measure like MSE
* Repeating the procedure $k$ times yielding $k$ estimates of the test error $MSE_{1} \, , \, \ldots \, , \, MSE_{k}$
* k-fold CV estimate for the test MSE is the average of these $k$ MSEs:

$$
CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_{i}
$$

***LOOCV is a special case of K-fold CV where $k=n$***

<img src="k-fold.PNG">

K-fold CV is a better choice then LOOCV because we only need to perform $k$ model runs compared to $n$ model runs with LOOCV. This is advantageous with large datasets or complex model runs.

## Applying k-fold CV approach to ISLR `Default` dataset

help(cv.glm)

In [30]:
k_fold <- cv.glm(Default, glm.fit, K = 3)

We access the `delta` attribute of the k_fold object returned to get the measure of $CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_{i}$
* First term is our MSE
* Second term is the bias corrected MSE

In [31]:
k_fold$delta

## Pros and Cons of Using k-fold CV

* Pros:
    * Model is only fit $k$ time compared to $n$ times with LOOCV
    * After repeating the process for all $k$ folds we end up with a measure of model performance that has used all data points to train/test the model
* Cons:
    * Can still be computationally expensive since we must do $k$ model runs

## Bias-Variance Tradeoff with k-fold CV

* Since model is only fit $k$ times compared to $n$ times with LOOCV (where $k<n$), k-fold CV is computationally less expensive
* Believe it or not ***k-fold CV gives more accurate test error rate estimates than LOOCV***
    * This is due to the bias-variance tradeoff
    * LOOCV yields approximately unbiased estimates of the test error since each training set has (n-1) observations which is basically the whole dataset
    * The fact above makes LOOCV basically equivalent to the test/training approach when it comes to underestimating the true test error rate
    
Doing k-fold CV with $k=5$ or $k=10$ will lead to some level of bias since each test set contains $\frac{n}{k}$ observations and each training set contains $n-\frac{n}{k}=n\left(1-\frac{1}{k}\right)=n\frac{\left(k-1\right)}{k}$ observations.
* Training set size in k-fold CV is smaller than in LOOCV, but alot bigger than the test/training approach

From a test error bias reduction standpoint, ***LOOCV is preferred to k-fold CV***

However, ***LOOCV has higher test error rate variance than k-fold CV with $k<n$***
* Outputs from LOOCV are highly correlated with one another
* Due to the fact that the mean of many highly correlated quantities has higher variance than the mean of many quantities that are not highly correlated (k-fold CV)

LOOCV is in effect averaging the outputs of $n$ fitted models, each of which is trained on an almost identical set of data

In contrast, when we perform k-fold CV with $k<n$, we are averaging the outputs of $k$ fitted models that are somewhat less correlated with each other since the overlap between training sets in each model is smaller
* Use $k=5$ or $k=10$ as a sweet spot tradeoff between bias and variance

## Cross-validation in Classification Problems

We can use the exact same resampling methods discussed above in classification problems.The only thing that changes is the error term (by default we only look at the misclassification rate, but could just as easily extend this to other classification metrics like F1 score):

$$
CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}Err_{i}
$$

where $ERR_{i} = I\,(y_{i} \neq \hat{y_{i}})$ i.e., $ERR_{i} = 1\: if\: (y_{i} \neq \hat{y_{i}})$ and 0 otherwise.

This formulation of the error term holds for the training/test approach, LOOCV, and k-fold CV.

<img src="CV Classification.PNG">

Charts above further demonstrate the need for cross-validation. It should be obvious that as the model gets more and more complex, the training data is fit better and better. This perceived high confidence in model fit to the training data can be very misleading as evidenced by the "U" shape of the test set in this case.

## Bootstrap

Bootstrap is a widely applicable and powerful tool to ***quantify the uncertainty*** associated with a given estimator or machine learning model.
* For example, bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit
* Bootstrap is powerful because we can find measures of variability that are not automatically output by statistical software
* See ISLR Page 195-197 for more details on lab below

# TODO
## ISLR Bootstrap Lab