### Cross Validation

In an earlier mission, we learned about train/test validation, a simple technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. In this mission, we'll focus on more robust techniques.

To start, we'll focus on the holdout validation technique, which involves:

splitting the full dataset into 2 partitions:
a training set
a test set
training the model on the training set,
using the trained model to predict labels on the test set,
computing an error metric to understand the model's effectiveness,
switch the training and test sets and repeat,
average the errors.
In holdout validation, we usually use a 50/50 split instead of the 75/25 split from train/test validation. This way, we remove number of observations as a potential source of variation in our model performance.

In [63]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score

dc_listings = pd.read_csv("data/dc_airbnb.csv")
dc_listings['price'] = dc_listings['price'].str.replace('[,|$]', '').astype('float')

# Use the numpy.random.permutation() function to shuffle the ordering of the rows in dc_listings.
randomized_index = np.random.permutation(dc_listings.index)
# Select the first 1862 rows and assign to split_one.
split_one = dc_listings.loc[randomized_index[:1862]]
# Select the remaining 1861 rows and assign to split_two.
split_two = dc_listings.loc[randomized_index[1862:]]

In [12]:
train_one = split_one
test_one = split_two
train_two = split_two
test_two = split_one

In [22]:
# Train a k-nearest neighbors model using the default algorithm (auto) and the default number of neighbors (5) that:
# Uses the accommodates column from train_one for training and
# Tests it on test_one.`
# Assign the resulting RMSE value to iteration_one_rmse.
knn = KNeighborsRegressor()
knn.fit(train_one[['accommodates']], train_one['price'])
predictions = knn.predict(test_one[['accommodates']])
iteration_one_rmse = mean_squared_error(test_one['price'], predictions) ** (0.5)
# Train a k-nearest neighbors model using the default algorithm (auto) and the default number of neighbors (5) that:
# Uses the accommodates column from train_two for training and
# Tests it on test_two.
# Assign the resulting RMSE value to iteration_two_rmse.
knn = KNeighborsRegressor()
knn.fit(train_two[['accommodates']], train_two['price'])
predictions = knn.predict(test_two[['accommodates']])
iteration_two_rmse = mean_squared_error(test_two['price'], predictions) ** (0.5)
# Use numpy.mean() to calculate the average of the 2 RMSE values and assign to avg_rmse.
avg_rmse = np.mean((iteration_one_rmse, iteration_two_rmse))
iteration_one_rmse, iteration_two_rmse, avg_rmse

(127.90266516816538, 138.6984822541034, 133.3005737111344)

If we average the two RMSE values from the last step, we get an RMSE value of approximately 133.30. Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

Here's the algorithm from k-fold cross validation:

splitting the full dataset into k equal length partitions,
selecting k-1 partitions as the training set and
selecting the remaining partition as the test set
training the model on the training set,
using the trained model to predict labels on the test fold,
computing the test fold's error metric,
repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
calculating the mean of the k error values.
Holdout validation is essentially a version of k-fold cross validation when k is equal to 2. Generally, 5 or 10 folds is used for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:


As you increase the number the folds, the number of observations in each fold decreases and the variance of the fold-by-fold errors increases. Let's start by manually partitioning the data set into 5 folds. Instead of splitting into 5 dataframes, let's add a column that specifies which fold the row belongs to. This way, we can easily select

In [23]:
# Add a new column to dc_listings named fold that contains the fold number each row belongs to:
# Fold 1 should have rows from index 0 to 744, including both of those rows.
# Fold 2 should have rows from index 744 to 1488, including both of those rows.
# Fold 3 should have rows from index 1488 to 2232, including both of those rows.
# Fold 4 should have rows from index 2232 to 2976, including both of those rows.
# Fold 5 should have rows from index 2976 to 3723, including both of these rows.
# Display the unique value counts for the fold column to confirm that each fold has roughly the same number of elements.

In [54]:
folds = dc_listings.index // 744
dc_listings['fold'] = np.where(folds < 5, folds + 1, 5)
dc_listings.fold.value_counts()

5    747
3    744
1    744
4    744
2    744
Name: fold, dtype: int64

In [55]:
# Train a k-nearest neighbors model using the accommodates column as the sole feature from folds 2 to 5 as the training set.
# Use the model to make predictions on the test set (accommodates column from fold 1) and assign the predicted labels to labels.
# Calculate the RMSE value by comparing the price column with the predicted labels.
# Assign the RMSE value to iteration_one_rmse.
train_df = dc_listings.loc[dc_listings['fold'] > 1]
test_df = dc_listings.loc[dc_listings['fold'] == 1]
knn = KNeighborsRegressor()
knn.fit(train_df[['accommodates']], train_df['price'])
labels = knn.predict(test_df[['accommodates']])
iteration_one_rmse = mean_squared_error(test_df['price'], labels) ** (0.5)

In [62]:
# Write a function named train_and_validate that takes in a dataframe as the first parameter (df) and a list of fold values (1 to 5 in our case) as the second parameter (folds). This function should:

# Train n models (where n is number of folds) and perform k-fold cross validation (using n folds). Use the default k value for the KNeighborsRegressor class.
# Return a list of RMSE values, where the first element is the RMSE for when fold 1 was the test set, the second element is the RMSE for when fold 2 was the test set, and so on.
def train_and_validate(df, folds):
    rmse_results = []
    for n in folds:
        test_df = df[df['fold'] == n]
        train_df = df[df['fold'] != n]
        knn = KNeighborsRegressor()
        knn.fit(train_df[['accommodates']], train_df['price'])
        labels = knn.predict(test_df[['accommodates']])
        rmse = mean_squared_error(test_df['price'], labels) ** (0.5)
        rmse_results.append(rmse)
    return rmse_results

# Use the train_and_validate function to return the list of RMSE values for the dc_listings Dataframe and assign to rmses.
rmses = train_and_validate(dc_listings, range(1,6))
# Calculate the mean of these values and assign to avg_rmse.
avg_rmse = np.mean(rmses)
# Display both rmses and avg_rmse.
rmses, avg_rmse

([137.26488167749056,
  99.71116620853789,
  163.72074818756715,
  116.12777422406992,
  157.38717103508057],
 134.8423482665492)

In [77]:
# Create a new instance of the KFold class with the following properties:

# 5 folds,
# shuffle set to True,
# random seed set to 1 (so we can answer check using the same seed),
# assigned to the variable kf.
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Create a new instance of the KNeighborsRegressor class and assign to knn.
knn = KNeighborsRegressor()
# Use the cross_val_score() function to perform k-fold cross-validation:

# using the KNeighborsRegressor instance knn,
# using the accommodates column for training,
# using the price column as the target column,
# returning an array of MSE values (one value for each fold).
# Assign the resulting list of MSE values to mses.
mses = cross_val_score(knn, dc_listings[['accommodates']], dc_listings['price'], cv=kf,
                      scoring='neg_mean_squared_error')
# Then, take the absolute value followed by the square root of each mse value.
# Then, calculate the average of the resulting RMSE values and assign to avg_rmse.
avg_rmse = np.mean(np.sqrt(np.abs(mses)))

In [79]:
num_folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, dc_listings[["accommodates"]], dc_listings["price"], scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), "std RMSE: ", str(std_rmse))

3 folds:  avg RMSE:  127.19146799819767 std RMSE:  7.80114274447321
5 folds:  avg RMSE:  130.57004998596955 std RMSE:  15.968993082617418
7 folds:  avg RMSE:  124.74000565490935 std RMSE:  23.009326104623764
9 folds:  avg RMSE:  133.85427296864364 std RMSE:  20.275996691809862
10 folds:  avg RMSE:  134.50358073016668 std RMSE:  30.83892745302988
11 folds:  avg RMSE:  129.58548991863123 std RMSE:  22.39316430178567
13 folds:  avg RMSE:  133.05101345639838 std RMSE:  27.88932598342725
15 folds:  avg RMSE:  124.86715246014936 std RMSE:  37.03384132069149
17 folds:  avg RMSE:  131.3786960290144 std RMSE:  40.043451719093724
19 folds:  avg RMSE:  129.0143524209374 std RMSE:  44.3383982741942
21 folds:  avg RMSE:  125.49498964946545 std RMSE:  41.03033829748872
23 folds:  avg RMSE:  125.27939162120605 std RMSE:  41.668089858618046
