ML is a **iterative** process

We will cope with many different 'What?' questions
+ what predictive variables to use?
+ what types of models to use?
+ what arguments to supply to those models?
+ ...

The **larger the validation set**, the **less randomness** (aka "noise") there is in our measure of model quality, and the **more reliable** it will be


# Cross-validation
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.

For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "**folds**".

![](https://storage.googleapis.com/kaggle-media/learn/images/9k60cVA.png)

Then, we run one experiment for each fold:

+ In **Experiment 1**, we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.
+In **Experiment 2**, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.
+We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).


# Example

In [1]:
import pandas as pd 

data = pd.read_csv("./input/melb_data.csv")

features = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[features]

y = data.Price


## Pipeline + Model
Define a pipeline that uses an imputer to fill in mising value and a random forest model to make predictions.

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[
    ('preprocessor',SimpleImputer()),
    ('model',RandomForestRegressor(n_estimators=50,random_state=0))
])

Obtain the cross-validations scores with the `cross_val_score()` function from sklearn. Set the number of folds with the `cv` parameters

In [5]:
from sklearn.model_selection import cross_val_score

#Mutilply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')

print("MAE scores:\n",scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


In [6]:
print("Average MAE score (accross experiments):")
print(scores.mean())

Average MAE score (accross experiments):
277707.3795913405
