# Cross Validation

We start with the fundamental assumption of machine learning:

> The data that you train your model on must come from the same distribution as the data you hope to apply the model to.

And there's is no way around this. One million observations of the wrong thing is not going to help you build a model for the thing you want. This is just the Machine Learning version of the idea that a sample must be representative. Assuming you have the right thing, though, how do you calculate a metric for it that gives you an idea of how it will work on unseen data?

To a certain extent, we have already been wrestling with this problem.
We have a *sample* of data from a process but we don't know for sure all the values the process can or will ever generate ("population").
And while we need to assume that, in general, the process is stable for us to even think about modeling it, the problem of inference remains.
We have so far looked at the problem of inference as, "given that the data is potentially consistent with a large number of models, which model(s) should I find the most credible?"
This is the Bayesian perspective.

The Machine Learning perspective is, "how will the model most likely perform on previously unseen data?".
What do we mean by unseen data?
Do we mean that the value $7.3$ never appeared in the training data? No.
We generally mean, the model is not evaluated on the same data that was used to train it.
As we noted previously, the data still has to come from the same distribution.
Where can we get this unseen data?
We've already noted that getting more data (additional samples) can be difficult, costly, or impossible.

One very simple way would be to split the data we have into a train(ing) set and a test set.
We could train our model (say, linear regression) on the train set and evaluated it (calculate MSE) on the test set.
We then get an estimate of MSE on unseen data for our model.

Unfortunately, that only gets use a single estimate and we'd like to get a sense of the overall variation in performance.
We could switch the roles of train and test sets, train on the test set and evaluated on the train set and get another estimate of MSE but two isn't quite enough either.
Enter *cross validation*.

Borrowing ideas from bootstrapping, there's another option. You could divide your data into ten sections or *folds* and then loop over each fold, using it as the test set and the others as the training set. So, for the first iteration, you have Fold 1 as the test set and Folds 2-10 (combined) as the training set. The second iteration sees Fold 2 as the test set and Folds 1, 3-10 (combined) as the training set. And so on.

What this looks like is as follows:

|  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **Test** | Train | Train | Train | Train | Train | Train | Train | Train | Train |

|  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Train| **Test** | Train | Train | Train | Train | Train | Train | Train | Train |

|  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Train| Train | **Test** | Train | Train | Train | Train | Train | Train | Train |

And finally:

|  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Train| Train | Train | Train | Train | Train | Train | Train | Train | **Test** |

When you're done, you would then end up with 10 estimates of MSE. With ten estimates of MSE we're back into familiar statistical inference territory, we can estimate the mean MSE (which sounds funny if you say "mean Mean Squared Error") and the variance.

There are other variants of cross-validation such as 3 fold and 5 fold. The *main* consideration on the number of folds is this: is each fold representative of the distribution you're trying to model? This often depends on:

1. How many observations do you have? 300 or 300 million?
2. How many variables do you have? The more variables you have, the "smaller" your data is from a machine learning perspective.
3. How many values can each categorical variable take? Each possible value of a categorical variable is a partitioning of your data. If you have a lot of categorical variables which can each take on many values, your 300 million observations may actually *not* be enough data.

There has been quite a bit of research on cross validation in general and how well it works. From my own vantage point of seeing students apply 10 fold cross validation to the same problem, I would suggest doing successive rounds of cross validation, creating new folds each time. You should do as many rounds as needed to reach at least 30 estimates of your metric.

For 10 fold cross validation, that's three (3) rounds of 10 fold-cross validation. This is nice because because with 30 observations, you can do Bayesian bootstrap inference on your results, and estimate the posterior distribution of the mean of your evaluation metric.

Why not just do bootstrap sampling directly by taking a random 10% sample as the test set, 100s of times?
I often do this because it's easier than setting up folds.
However, the literature gives this approach mixed reviews.

Which leads us to one final world of advice: *randomize*. Never use your data just as you found it. Make sure you randomize your data each time before establishing your folds (in other words, *shuffle* your data each time before you designate folds).

## Implementation

It's not difficult to implement N-Fold Cross Validation on your own.
You can select the indices of your dataframe (or data), randomize them, "chunk" them into N batches, and iterate.
However, Scikit Learn does implement [N-Fold Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html).
You will have to work the Scikit Learn's classes directly instead of the `models` module.