In [None]:
# What is Cross-Validation?
# Cross-validation creates and trains multiple identical models using different subsets of the data. For each model, the data is split into
# different training sets and testing sets in such a way that each sample is used in a testing set during one split. Each new train/test split
# of the data is called a fold.
# In the image above each row represents a different fold. The blue rectangles are the testing set for each fold. K number of models are
# trained on the training data for each of the k folds and evaluated on the testing data. The scores for each fold are recorded.
# Since the image above is k=5, or 5 folds, a different 20% of the data is used as a test set for each fold so that over 5 iterations every
# piece of the data is used for both training and testing.
# Hold-out set. Notice on the bottom left the ‘Test data’? We only use our training data for the cross-validation and hold out our test data
# for evaluating the very final model we choose, after model selection and hyperparameter tuning.



# Why Use Cross-Validation?
# There are two primary reasons to use cross-validation.
# Cross-Validation Compares Model Performance on Multiple Test Sets
# Model scores often vary if different subsets of the data are used for evaluation. Sometimes testing sets are not representative of the
# entire dataset, or certain outliers in the data are harder to predict. Cross-validation returns scores for all parts of the data. This
# increases confidence in how the model may perform on new data since new data will not be identical to any one test set.
# Cross-Validation Prevents Overfitting to a Single Test Set.
# You have learned about overfitting to a training set, but it’s also possible to overfit to a test set. This can occur when models and
# hyperparameters are chosen because they work well on one particular test set, but may not work well for other data. While no
# individual model has been fitted to the test set, you as a data scientist have fit your choice of model and hyperparameters to perform
# well on one particular test set. Will that model with those parameters work well with new data as well?
# When we use cross-validation we only use the training data for the folds. This allows us to test each model using only data from the
# training set while still using a validation split. If we use cross-validation to choose a model and hyperparameters, then just use the
# original test set for evaluation of the final model, we can increase our confidence in the performance of the model on new data.

In [None]:
# Cross-validation in Python
# Scikit-Learn has several tools for cross-validation. One of them is:

# sklearn.model_selection.cross_val_score

# The function above takes a model (or pipeline), an X set of features, a y set of labels, and optionally a scoring function. You can also
# define the number of folds you want to use. See the documentation on cross_val_score to learn more.
# cross_val_score returns an array of scores, one for each fold. A common way to compare the scores across many models is to average
# them.

# Preventing Data Leakage During Cross-Validation
# Since each portion of the training set will be used as a testing set during cross-validation, we will need to apply any preprocessing
# separately to prevent overfitting in each fold. We can do this by using a pipeline in place of our model in the code.