# Cross Validation
Kaggle Machine Learning Intermediate Course: 
https://www.kaggle.com/code/alexisbcook/cross-validation

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

## Define Pipeline
uses imputer to fill in missing values and a random forest model to make predictions.

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                             ('model', RandomForestRegressor(n_estimators=50, random_state=0))
                              ])

## Cross Validation
- use `cross_val_scores()` function from scikit-learn
- set the number of folds with the `cv` parameter

In [3]:
from sklearn.model_selection import cross_val_score

# Mulitply by -1 since sklearn calculations *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y, 
                             cv=5,
                             scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


The `scoring` parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error(MAE). 
Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere. 

## Average MAE score

In [4]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
277707.3795913405
