# Cross-validation

To get multiple measures of model quality by running the modelling process on different subsets of data. The complete set of data is divided into subsets called folds. some of these subsets(folds) are used as training data and the remaining folds are used as holdout for model validation.

Coss-validation gives a more accurate measure of model quality. Use cases include
- **small datasets**, where we can afford the additional computational burden due to cross-validation
- **larger datasets**, where a single validation isn't sufficient

In [1]:
# CROSS-VALIDATION on Melbourne Housing Dataset

import pandas as pd
data = pd.read_csv('~/kaggle/input/melbourne-housing-snapshot/melb_data.csv')

In [5]:
data.describe().astype('int64')

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580,13580,13580,13580,13580,13580,13518,13580,7130,8205,13580,13580,13580
mean,2,1075684,10,3105,2,1,1,558,151,1964,-37,144,7454
std,0,639310,5,90,0,0,0,3990,541,37,0,0,4378
min,1,85000,0,3000,0,0,0,0,0,1196,-38,144,249
25%,2,650000,6,3044,2,1,1,177,93,1940,-37,144,4380
50%,3,903000,9,3084,3,1,2,440,126,1970,-37,145,6555
75%,3,1330000,13,3148,3,2,2,651,174,1999,-37,145,10331
max,10,9000000,48,3977,20,8,10,433014,44515,2018,-37,145,21650


In [7]:
# Select predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 5 columns):
Rooms           13580 non-null int64
Distance        13580 non-null float64
Landsize        13580 non-null float64
BuildingArea    7130 non-null float64
YearBuilt       8205 non-null float64
dtypes: float64(4), int64(1)
memory usage: 530.6 KB


In [10]:
# Build pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define preprocessor
preprocessor = SimpleImputer(strategy='constant')
# Define model
model = RandomForestRegressor(n_estimators=193, random_state=0)

# pipeline
my_pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),('model', model)
] )

Cross-validation scores is calculated using `cross_val_score()` function from `sklearn.model_selection`. The number of folds is specified using the `cv` parameter

In [12]:
# CROSS-VALIDATION

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                             cv=5,
                             scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)

MAE scores:
 [299836.70688591 300047.83663901 286083.40909066 236604.03224108
 259524.31043587]


The scoring parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error (MAE). The docs for scikit-learn show a list of options.

https://scikit-learn.org/stable/modules/model_evaluation.html

In [14]:
# Averging scores to get a single measure of model quality
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
276419.2590585053


### Comparision with traditional technique

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X,y,
                                                     random_state=0)

In [17]:
# Define preprocessor
preprocessor = SimpleImputer(strategy='constant')
# Define model
model = RandomForestRegressor(n_estimators=193, random_state=0)

# pipeline
my_pipeline_cpy = Pipeline(steps = [
    ('preprocessor', preprocessor),('model', model)
] )

In [18]:
# Preprocess and train
my_pipeline_cpy.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='constant',
                               verbose=0)),
                ('model',
                 RandomForestRegressor(bootstrap=True, criterion='mse',
                                       max_depth=None, max_features='auto',
                                       max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=193, n_jobs=None,
                                       oob_score=False, random_state=0,
                                       verbose=0, warm_start=False))],
         verbose=False)

In [20]:
# Preprocess and Predict
pred_y = my_pipeline_cpy.predict(X_valid)

In [25]:
# Assess Accuracy
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_valid, pred_y)
print("MAE without Cross-validation:\n", mae)

MAE without Cross-validation:
 252586.07506742276
