### What is cross-validation?

In **cross-validation**, we run our modeling process on different subset of the data to get multiple measures of model quality.

### When should you use cross-validation?

Cross-validation gives a more accurate measure of model quality, which is especilly import if you are making a lot of modeling decisions. however, it can take longer to run, because it estimates multiple models (one of each fold)

So, give these tradeoffs, when should you see each approach?

 - For small datasets, where extra computation burden isn't a big deal, you should run cross-validation
 - for larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
 
There's no simple threshould for what constitutes a large vs. samll dataset. But if your model takes a couple minutes or less to run, it's probably worth switching to cross-validation.

Alternative, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the sanme results, a single validaton set is probably sufficient.

### Example

1.We load the input data in `X` and the output data in `y`.

In [None]:
import pandas as pd

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

2.We define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipline = Pipeline(steps= [('preprocessor', SimpleImputer()),
                             ('model', RandomForestRegressor(n_estimators= 50,
                                                            random_state= 0))
                             ])


3.We obtain the cross-validation scores with the `cross_val_score()` function from scikit-learn. We set the number of folds with the `cv` parameter.

In [None]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipline, X, y,
                             cv = 5,
                             scoring = 'neg_mean_absolute_error')

print("MAE scores:\n", scores)

补充：
The `scoring` parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error(MAE). The docs for scikit-learn show a list of [options](http://scikit-learn.org/stable/modules/model_evaluation.html)

It is a little surprising that we specify *negative* MAE. Scikit-learn has a convention where all metrics are defined so a high number is better. Using negative here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.

We typically want a single measure of model quality to compare alternative models. so we take the average across experiments.


In [None]:
print("Average MAE score (across experiments):")
print(scores.mean())

### Conclusion

Using cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer need to keep track of separate training and validation sets. So, especially for small datasets, it's a good improvement!