<center><h1>Cross Validation</h1></center>

## 1. Introduction

We learned about train/test validation, a simple technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. In this lesson, we'll focus on more robust techniques.

To start, we'll focus on the **holdout validation** technique, which involves:

- splitting the full dataset into 2 partitions:
    - a training set
    - a test set
- training the model on the training set,
- using the trained model to predict labels on the test set,
- computing an error metric to understand the model's effectiveness,
- switch the training and test sets and repeat,
- average the errors.

In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove the number of observations as a potential source of variation in our model performance.

<img src="figs/holdout_validation.png" width="600" height="400" />


Let's start by splitting the data set into 2 nearly equivalent halves.

When splitting the data set, don't forget to set a copy of it using .copy() to ensure you don't get any unexpected results later on. If you run the code locally in Jupyter Notebook or Jupyter Lab without .copy(), you'll notice what is known as a SettingWithCopy Warning. This won't prevent your code from running properly, but it's letting you know that whatever operation you're doing is trying to be set on a copy of a slice from a dataframe. To make sure you don't see this warning, make sure to include .copy() whenever you perform operations on a dataframe.

### Exercise

- Use the `numpy.random.permutation()` function to shuffle the ordering of the rows in `dc_listings`.
- Select the first 1862 rows and assign to `split_one`.
- Select the remaining 1861 rows and assign to `split_two`.


In [None]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [None]:
dc_listings = pd.read_csv("dc_airbnb.csv")

In [None]:
dc_listings['price'] = dc_listings['price'].str.replace('[\$,]', '', regex=True).astype('float')
print(dc_listings['price'].head())

In [None]:
print(len(dc_listings) / 2, '\n')
print(print(len(dc_listings) / 2, '\n')
dc_listings.head())

In [None]:
dc_listings = dc_listings.iloc[np.random.default_rng(2021).permutation(len(dc_listings))]
print(dc_listings.head())

## 2. Holdout Validation
Now that we've split our data set into 2 dataframes, let's:

- train a k-nearest neighbors model on the first half,
- test this model on the second half,
- train a k-nearest neighbors model on the second half,
- test this model on the first half.

### Exercise

- Train a k-nearest neighbors model using the default algorithm (`auto`) and the default number of neighbors (`5`) that:
    - Uses the `accommodates` column from `train_one` for training and
    - Tests it on `test_one`.
- Assign the resulting RMSE value to `iteration_one_rmse`.
- Train a k-nearest neighbors model using the default algorithm (`auto`) and the default number of neighbors (`5`) that:
    - Uses the `accommodates` column from `train_two` for training and
    - Tests it on `test_two`.
- Assign the resulting RMSE value to `iteration_two_rmse`.
- Use `numpy.mean()` to calculate the average of the 2 RMSE values and assign to `avg_rmse`.

## 3. K-Fold Cross Validation

If we average the two RMSE values from the last step, we get an RMSE value of approximately 140.32. Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

- splitting the full dataset into `k` equal length partitions.
    - selecting `k-1` partitions as the training set and
    - selecting the remaining partition as the test set
- training the model on the training set.
- using the trained model to predict labels on the test fold.
- computing the test fold's error metric.
- repeating all of the above steps `k-1` times, until each partition has been used as the test set for an iteration.
- calculating the mean of the `k` error values.

Holdout validation is essentially a version of k-fold cross validation when `k` is equal to `2`. Generally, `5` or `10` folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

<img src="figs/kfold_cross_validation.png" width="800" height="600" />

As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to. This way, we can easily select our training set and testing set.

### Exercise

- Add a new column to `dc_listings` named `fold` that contains the fold number each row belongs to:
- Fold `1` should have rows from index `0` up to `745`, not including `745`.
- Fold `2` should have rows from index `745` up to `1490`, not including `1490`.
- Fold `3` should have rows from index `1490` up to `2234`, not including `2234`.
- Fold `4` should have rows from index `2234` up to `2978`, not including `2978`.
- Fold `5` should have rows from index `2978` up to `3723`, not including `3723`.
- Make sure `fold`'s type is a float type.
- Display the unique value counts for the `fold` column to confirm that each fold has roughly the same number of elements.
- Display the number of missing values in the `fold` column to confirm we didn't miss any rows.

## 4. First iteration

Let's start by performing the first iteration of k-fold cross validation on a simple, univariate model.

### Exercise

- Train a k-nearest neighbors model using the `accommodates` column as the sole feature from folds `2` to `5` as the training set.
- Use the model to make predictions on the test set (`accommodates` column from fold `1`) and assign the predicted labels to `labels`.
- Calculate the RMSE value by comparing the `price` column with the predicted labels.
- Assign the RMSE value to `iteration_one_rmse`.

## 5. Function for training models

From the first iteration, we achieved an RMSE value of roughly **128**. Let's calculate the RMSE values for the remaining iterations. To make the iteration process easier, let's wrap the code we wrote in the previous screen in a function.


### Exercise

- Write a function named `train_and_validate` that takes in a dataframe as the first parameter (`df`) and a list of fold values (`1` to `5` in our case) as the second parameter (`folds`). This function should:
    - Train `n` models (where `n` is number of folds) and perform k-fold cross validation (using `n` folds). Use the default `k` value for the `KNeighborsRegressor` class.
    - Return a list of RMSE values, where the first element is the RMSE for when fold `1` was the test set, the second element is the RMSE for when fold `2` was the test set, and so on.
- Use the `train_and_validate` function to return the list of RMSE values for the `dc_listings` Dataframe and assign to `rmses`.
- Calculate the mean of these values and assign to `avg_rmse`.
- Display both `rmses` and `avg_rmse`.

## 6. Performing K-Fold Cross Validation Using Scikit-Learn

While the average RMSE value was approximately 135, the RMSE values ranged from 102 to 159+. This large amount of variability between the RMSE values means that we're either using a poor model or a poor evaluation criteria (or a bit of both!). By implementing your own k-fold cross-validation function, you hopefully acquired a good understanding of the inner workings of the technique. The function we wrote, however, has many limitations. If we want to now change the number of folds we want to use, we need to make the function more general so it can also handle randomizing the ordering of the rows in the dataframe and splitting into folds.

In machine learning, we're interested in building a good model and accurately understanding how well it will perform. To build a better k-nearest neighbors model, we can change the features it uses or tweak the number of neighbors (a hyperparameter). To accurately understand a model's performance, we can perform k-fold cross validation and select the proper number of folds. We've learned how scikit-learn makes it easy for us to quickly experiment with these different knobs when it comes to building a better model. Let's now dive into how we can use scikit-learn to handle cross-validation as well.

First, we instantiate an instance of the [KFold class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) from `sklearn.model_selection`:

```python
from sklearn.model_selection import KFold
kf = KFold(n_splits, shuffle=False, random_state=None)
```

where:

- `n_splits` is the number of folds you want to use,
- `shuffle` is used to toggle shuffling of the ordering of the observations in the dataset,
- `random_state` is used to specify the random seed value if `shuffle` is set to `True`.

You'll notice here that no parameters depend on the data set at all. This is because the KFold class returns an iterator object which we use in conjunction with the [cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function, also from `sklearn.model_selection`. Together, these 2 functions allow us to compactly train and test using k-fold cross validation:

Here are the relevant parameters for the `cross_val_score` function:

```python
from sklearn.model_selection import cross_val_score
cross_val_score(estimator, X, Y, scoring=None, cv=None)
```

where:

- `estimator` is a sklearn model that implements the `fit` method (e.g. instance of KNeighborsRegressor),
- `X` is the list or 2D array containing the features you want to train on,
- `y` is a list containing the values you want to predict (target column),
- `scoring` is a string describing the scoring criteria (list of accepted values here).
- `cv` describes the number of folds. Here are some examples of accepted values:
    - an instance of the `KFold` class,
    - an integer representing the number of folds.

Depending on the scoring criteria you specify, a single total value is returned for each fold. Here's the general workflow for performing k-fold cross-validation using the classes we just described:

- instantiate the scikit-learn model class you want to fit,
- instantiate the `KFold` class and using the parameters to specify the k-fold cross-validation attributes you want,
- use the `cross_val_score()` function to return the scoring metric you're interested in.


### Exercise

- Create a new instance of the `KFold` class with the following properties:
    - `5` folds,
    - shuffle set to `True`,
    - random seed set to `1` (so we can answer check using the same seed),
    - assigned to the variable `kf`.
- Create a new instance of the `KNeighborsRegressor` class and assign to `knn`.
- Use the `cross_val_score()` function to perform k-fold cross-validation:
    - using the KNeighborsRegressor instance `knn`,
    - using the `accommodates` column for training,
    - using the `price` column as the target column,
    - using the string `neg_mean_squared_error` as the value of the `scoring` parameter,
    - using `kf` as the value of the `cv` parameter
    - returning an array of MSE values (one value for each fold).
- Assign the resulting list of MSE values to `mses`. Then, take the absolute value followed by the square root of each MSE value. Then, calculate the average of the resulting RMSE values and assign to `avg_rmse`.


## 7. Exploring Different K Values
Choosing the right `k` value when performing k-fold cross validation is more of an art and less of a science. As we discussed earlier in the lesson, a `k` value of `2` is really just holdout validation. On the other end, setting `k` equal to `n` (the number of observations in the data set) is known as **leave-one-out cross validation**, or LOOCV for short. Through lots of trial and error, data scientists have converged on `10` as the standard k value.

In the following code block, we display the results of varying `k` from `3` to `23`. For each `k` value, we calculate and display the average RMSE value across all of the folds and the standard deviation of the RMSE values. Across the many different `k` values, it seems like the average RMSE value is around `129`. You'll notice that the standard deviation of the RMSE increases from approximately `8` to over `40` as we increase the number of folds.



## 8. Bias-Variance Tradeoff

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately. A model has two sources of error, **bias** and **variance**.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's **variance** while the average RMSE is a proxy for a model's **bias**. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.


<img src="figs/bias_variance.png" width="600" height="400" />

While k-nearest neighbors can make predictions, it isn't a mathematical model. A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two courses, we'll learn about a mathematical model called linear regression. We'll explore the bias-variance tradeoff in greater depth in these next 2 courses because of its importance when working with mathematical models in particular.