**What is cross-validation?**

In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.

For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds".

tut5_crossval

Then, we run one experiment for each fold:

**In Experiment 1,** we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.

**In Experiment 2,** we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.

We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).
When should you use cross-validation?
Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take longer to run, because it estimates multiple models (one for each fold).

So, given these tradeoffs, when should you use each approach?

For small datasets, where extra computational burden isn't a big deal, you should run cross-validation.
For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs. small dataset. But if your model takes a couple minutes or less to run, it's probably worth switching to cross-validation.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the same results, a single validation set is probably sufficient.



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('melb_data.csv')

data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [3]:
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']

X = data[cols_to_use]
Y = data.Price

Then, we define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions.

While it's possible to do cross-validation without pipelines, it is quite difficult! Using a pipeline will make the code remarkably straightforward.

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


my_pipeline = Pipeline(steps=[('perprocessor', SimpleImputer()),
                               ('model', RandomForestRegressor(n_estimators=50, random_state=0))])



We obtain the cross-validation scores with the cross_val_score() function from scikit-learn. We set the number of folds with the cv parameter.

In [5]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X,Y, cv=5, scoring='neg_mean_absolute_error')

print(scores)

[301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


In [6]:
simpleimputer = SimpleImputer()

X_ready = pd.DataFrame(simpleimputer.fit_transform(X), columns=X.columns)
X_ready


Unnamed: 0,Rooms,Distance,Landsize,BuildingArea,YearBuilt
0,2.0,2.5,202.0,151.96765,1964.684217
1,2.0,2.5,156.0,79.00000,1900.000000
2,3.0,2.5,134.0,150.00000,1900.000000
3,3.0,2.5,94.0,151.96765,1964.684217
4,4.0,2.5,120.0,142.00000,2014.000000
...,...,...,...,...,...
13575,4.0,16.7,652.0,151.96765,1981.000000
13576,3.0,6.8,333.0,133.00000,1995.000000
13577,3.0,6.8,436.0,151.96765,1997.000000
13578,4.0,6.8,866.0,157.00000,1920.000000


In [7]:
model = RandomForestRegressor(n_estimators=50, random_state=0)

In [8]:
score = -1 * cross_val_score(model, X_ready,Y, scoring='neg_mean_absolute_error')

In [9]:
score

array([301124.18372791, 301928.51358312, 287612.23985381, 235963.2009766 ,
       260542.44042393])