# Cross-Validation Techniques

A proper model and its optimal set of features and parameter specifications is a complex process that defines the success of an algorithm. We estimate this success using our chosen model metrics. However, in order for the metrics to be accurate, we need need to have confidence that the performance of the model will generalize well to unseen data. This is where cross-validation comes in; using these techniques, we divide our data into training and test datasets in order to get a good estimate of the model's generalization error as well as choose the appropriate bias-variance trade-off.  

### The holdout method  

In the holdout method, we partition the data into a training and test set, where the former is used for training the model and the latter is used for evaluating the model. However, it is very likely that we will have many different versions of a given model which we will want to evaluate, with each evaluation informing the decisions made in the subsequent version of the model. If we use (and reuse) the test set when tuning our model, then the test data is being used for training, which means that we are likely to overfit our model to not only the training set, but to the test set as well, rendering our model metrics inaccurate. 

In order to prevent that we want to further subset our training set into a training and validation set. Now we can use the validation set for the model selection and use the test set only once we have chosen our model and would like to get an accurate measure of the model's generalization error. The advantage of having a test set that the model hasn't seen before during the training and model selection steps is that we can obtain a less biased estimate of its ability to generalize to new data.

<img src="extras/06_02.png" width="500" height="500" />

A disadvantage of the holdout method is that the performance estimate is sensitive to how we partition the training set into the training and validation subsets; the estimate will vary for different samples of the data. Additionally, the validation set might contain some valuable information that could be useful in developing our model.

### K-fold cross-validation 

In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k-1 folds are used for the model training and one fold is used for testing. This procedure is repeated k times so that we obtain k models and performance estimates. 

We then calculate the average performance of the models based on the different, independent folds to obtain a performance estimate that is less sensitive to the subpartitioning of the training data compared to the holdout method. Typically, we use k-fold cross-validation for model tuning, that is, finding the optimal hyperparameter values that yield a satisfying generalization performance. Once we have found satisfactory hyperparameter values, we can retrain the model on the complete training set and obtain a final performance estimate using the independent test set.

Since k-fold cross-validation is a resampling technique without replacement, the advantage of this approach is that each sample point will be part of a training and test dataset exactly once, which yields a lower-variance estimate of the model performance than the holdout method. The following figure summarizes the concept behind k-fold cross-validation with k=10. The training data set is divided into 10 folds, and during the 10 iterations, 9 folds are used for training, and 1 fold will be used as the test set for the model evaluation. Also, the estimated performances Ei (for example, classification accuracy or error) for each fold are then used to calculate the estimated average performance E of the model:

<img src="extras/06_03.png" width="600" height="600" />

The value of k varies depending on the size of our dataset. We generally set k to 10, but for smaller datasets we can increase it, or for larger sets we can decrease it. A higher k ensures that we use more of our training data in estimating the model, which will result in a lower bias; however this will also cause a higher variance due to overfitting that is caused by using the same training samples multiple times. Additionally, the higher the k, the higher the computational cost of fitting and evaluating our model, which is why we want to use a lower k for large datasets. 

The extreme case of k is known as the _leave one out method_, where the number of folds equals the number of samples in the training set (k = n), so that only one sample is used for testing during each iteration. This approach is preferred when our dataset is extrememly small.

In _stratified k-fold cross-validation_, the class proportions are preserved in each fold in order to ensure that each fold is representative of the class proportions in the training set. In the case of unequal class proportions, this method has shown to deliver better bias and variance estimates. 

### Nested k-fold cross-validation 

If we are choosing the optimal set of features and parameter specifications of a given algorithm, then we might want to use k-fold cross-validation. However, if additionally we want to choose between multiple algorithms, then we will need to implement the nested k-fold cross-validation method. This method allows us to get an even more accurate estimate of the model performance so that we can choose between multiple algorithms. 

In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model using k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. The following figure explains the concept of nested cross-validation with five outer and two inner folds, which can be useful for large data sets where computational performance is important; this particular type of nested cross-validation is also known as 5x2 cross-validation:

<img src="extras/06_07.png" width="500" height="500" />

Nested cross-validation is used to avoid optimistically biased estimates of performance that result from using the same cross-validation to set the values of the hyper-parameters of the model (e.g. the regularisation parameter, C, and kernel parameters of an SVM) and performance estimation.  

### Important caveats

When doing k-fold cross-validation, we must be careful not to select our features before doing the train-test splits. Instead, feature selection needs to be part of the process within each of the k iterations. This will prevent _knowledge leaking_ from the test set to the training set, which would otherwise result in a biased estimate of the model evaluation metric (it would be overly optimistic). If there the different iterations recommend different features, then we can use a voting method to pick the top performing features and use those to fit the final model.  

In the case of imbalanced data (in the class of the target), we might want to consider oversampling the minority class or undersampling the majority class during cross-validation (but after the fold split). This will result in more accurate measures of the model metrics (sensitivity, specificity, AUC). 

## Coding Example

In our example, we will use the Titanic dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import sys
import seaborn as sns
%matplotlib inline

In [2]:
data = pd.read_csv('Data/titanic_train.csv')

X = pd.concat([pd.get_dummies(data['Sex'], drop_first=True),data['Fare']], axis=1)
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 1)

#### K-fold cross-validation

We can use sklearn's `cross_val_score` to perform our k-fold cross-validation and return the individual and average scores. 

In [3]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1000, random_state=0)

scores = cross_val_score(estimator=lr,
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV accuracy scores: %s' % scores)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy scores: [ 0.71428571  0.74603175  0.84126984  0.76190476  0.87301587  0.79365079
  0.79032258  0.81967213  0.83606557  0.75409836]
CV accuracy: 0.793 +/- 0.047


Additionally, we can implement k-fold cross-validation when finding the optimal hyperparameters in using GridSearch.

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe_lr = Pipeline([('scl', StandardScaler()),
            ('clf', lr)])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 
               }]

gs = GridSearchCV(estimator=pipe_lr, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=10,
                  n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.792937399679
{'clf__C': 0.0001}


#### Nested k-fold cross-validation

Let's use nested k-fold cross-validation to perform compare the logistic regression model to a support vector machine. 

In [5]:
gs = GridSearchCV(estimator=pipe_lr,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=2)

# Note: Optionally, you could use cv=2 
# in the GridSearchCV above to produce
# the 5 x 2 nested CV that is shown in the figure.

scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy: 0.793 +/- 0.035


In [8]:
from sklearn.svm import SVC

param_range = [0.0001, 1000.0]

param_grid = [{'C': param_range, 
               'kernel': ['rbf']}]

gs = GridSearchCV(estimator=SVC(),
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=2,
                  n_jobs=-1)

scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy: 0.769 +/- 0.058


## References

- http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
- http://www.alfredo.motta.name/cross-validation-done-wrong/
- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
- Raschka, Sebastian. Python Machine Learning. Packt Publishing, 2015, Birmingham, UK.
- http://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation