# **DSFM Exercise**: Cross-validation - Resampling to Evaluate Models (SOLUTION)

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

Cross-validation is a resampling procedure to evaluate statistical models based on a limited dataset. Computing model performance using cross-validation often results in more reliable, less biased, and less optimistic estimates of how well our model performs on unseen data.

k-fold cross-validation splits the dataset into *k* groups, which are used as training or testing datasets in an alternating manner. The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into *k* groups
3. For each unique group:
    1. Take the first group as test data
    2. Take the remaining groups as a training data
    3. Fit a model on the training data and evaluate it on the test data
    4. Retain the evaluation score; discard the model
4. Summarize model performance using the sample of evaluation scores collected

<img src="kfold.png" width="800" height="800" align="center"/>


Image source: EPFL TIS Lab

It follows that when e.g. *k=5*, we refer to the procedure as five-fold cross-validation (see image right above). Selecting the right value for *k* is important. As a starting point, values for *k* between 3 and 10 (inclusive) tend to work well. We want to make sure that the resulting train/test split is representative of the entire dataset.

## **Part 0**: Setup

### Import packages

In [None]:
# We use tha array data structure from NumPy
import numpy as np

# sklearn has k-fold cross-validation already implemented
from sklearn.model_selection import KFold

# **MAIN EXERCISE**

## **Part 1**: Make up some toy data

First, we come up with some toy observations that make it easier to follow the k-fold cross-validation procedure.

**Q 1**: Assign an array of the following observations to a variable named `data`:

`[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]`

In [None]:
# Toy data
data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

## **Part 2**: Set up cross-validation schema

Next, we set up the k-fold cross-validation schema.

**Q 1**: Use the `KFold` function from `sklearn` with `k=3`

In [None]:
# Set up cross-validation schema
kfold = KFold(n_splits=3)

## **Part 3**: Apply cross-validation

Finally, show the train/test splits for each of the folds.

**Q 1**: Print the train/test splits for all *k* folds. Why are the observations split the way they are?

In [None]:
print('Data: ', data)
print()

# Show train/test splits
for train, test in kfold.split(data):
    print('Train: %s, Test: %s' % (data[train], data[test]))

## **Further reading**

- [A gentle introduction to k-fold cross-validation](https://machinelearningmastery.com/k-fold-cross-validation/) on machinelearningmastery.com