# Model Fit and Cross Validation

Recall that in supervised machine learning it is common to split your full dataset for the purposes of training and testing. 

* The training set contains a known output, and the model learns on this data in order to be generalized to other data later on. 
* The test set is used to evaluate our model’s prediction.

Seems like a good idea, right? 

As with most approaches, the train-test split is not a silver bullet and has a number of potential risks associated with it. For example, what if the split you make isn’t random? Perhaps the dataset is ordered on a specific attribute or set of attributes and so, when you perform the split, you end up with a model that has only been trained on certain types of examples and never others. Such a scenario could result in poor model performance related to issues with model fit.

## Model Fit

In statistics, **fit** refers to how well you are able to approximate a given function. This terminology is good for use in ML since supervised learning algorithms seek to approximate the underlying mapping from input (independent variables) to output (dependent variables). In ML, **model fitting** is a measure of how well a predictive model generalizes to similar data to that on which it was trained. The fit of an ML model may be **balanced**, **overfit**, or **underfit**.

<img style="margin: 15px 15px 15px 15px;" src="../img/modelfit.png" width=60%>

### Balanced Fit
An ML model with _good_ or _balanced_ fit produces accurate outcomes. Model fitting is the essence of ML as if your model doesn’t fit your data correctly, the outcomes it produces will not be practically useful. A well-fitted model has hyperparameters that capture the complex relationships between the input and the output, allowing it to find relevant insights or make accurate predictions. Two common explanations for poor performance in predictive ML models are overfitting and underfitting.

### Overfitting

Overfitting is when a predictive model trains "too well" and has been fitted too closely to the training set. This happens when the model captures noise in the data instead of, or in addition to, the underlying data pattern. In other words, the model begins memorizing during training and is therefore unable to generalize afterwards. As a result, an overfit ML model tends to be very accurate on the training data, but performs poorly against unseen data, thereby defeating its purpose. Overfitting may occur if the model is too complex or if it trains for too long on the sample data. Signals of overfitting are:
* Very few errors in the model's prediction when compared to the training data - **low bias**
* Great sensitivity to  fluctuations in the training data - **high variance**

### Underfitting

Underfitting is when a predictive model trains "too little" and therefore misses the trends in the data. This occurs when the model is unable to capture the underlying data pattern, and usually happens when you have too little data or a very simple model. Underfitting can also happen if you are trying to fit a linear model to non-linear data. Like overfit models, underfit models fail to generalize to unseen data. However, in practice, underfitting is not as prevalent as overfitting.  Nevertheless, you'll want to avoid both of these problems.  Signals of underfitting are:

* Many errors in the model's prediction when compared to the training data - **high bias**
* Little sensitivity to fluctuations in the training data - **low variance**

### Bias-Variance Tradeoff

If your model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters then it’s going to have high variance and low bias. The goal is therefore to find a good balance without overfitting and underfitting the data.
Since an ML model can’t be more complex and less complex at the same time, this tradeoff in complexity is known as the **bias-variance tradeoff**. To build a good model, you need to find a balance between bias and variance that minimizes the total error.

<img style="margin: 15px 15px 15px 15px;" src="../img/biasvariancetradeoff.png" width=40%>

## Introduction to Cross Validation

In order to flag problems like overfitting or [selection bias](https://en.wikipedia.org/wiki/Selection_bias), you can leverage **cross validation**. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it and provide insight on how the model will generalize to unseen data, for example, in the real word. There are a number of cross validation techniques: 

* K-Fold Cross Validation
* Leave One Out Cross Validation (LOOCV)
* Leave P-Out Cross Validation (LpOCV)
* Stratified K-Fold Cross Validation
* Repeated K-Fold Cross Validation
* Nested K-Fold Cross Validation

### K-Fold Cross Validation

This approach to cross-validation involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.

<img style="margin: 15px 15px 15px 15px;" src="../img/kfold.png" width=40%>

#### Algorithm: How It Works

The algorithm for k-fold cross-validation is as follows:

1. Shuffle the dataset randomly
2. Split the dataset into k groups
3. For each unique group:  
    3.1 Take the group as a hold out or test data set  
    3.2 Take the remaining groups as a training data set  
    3.3 Fit a model on the training set and evaluate it on the test set  
    3.4 Retain the evaluation score and discard the model  
4. Summarize the skill of the entire model using the individual model evaluation scores. e.g., average accuracy

Note that each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.

## Hands-On with K-Fold Cross Validation

In this hands-on activity, you'll leverage scikit-learn libraries to perform k-fold cross validation.  Since until now, you've been mainly using classification models, in this activity you'll get a chance to train a linear regression model and then apply k-fold cross validation as part of the training process.

#### Import and Configure Libraries

In [None]:
# Libraries to Support Organizing and Processing Data: NumPy, Pandas, Scikit-learn: 
import pandas as pd
import numpy as np

# Libraries to Support Model Implementation, Training and Validation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#### <a id='cross_validation_setup_data'>Prepare Training and Test Data</a>

In [None]:
# Load the Dataset into a Pandas Dataframe and print the first few records
data = pd.read_csv('../data/auto.csv', sep=',')
data.head()

#### Interactive Demo

For this interactive demo, you are going to try to predict `mpg` for an automobile based upon `cylinders` and `weight` alone.

In [None]:
# X represents your independent variables and y is the dependent variable you want to predict.
X = data[['cylinders','weight']]
y = data['mpg']

Split data into training and test sets using the Scikit-learn [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) module 

In [None]:
# Default train-test split ratio of 75% to 25% since no values provided for train_size and test_size.
X_train,X_test,y_train,y_test = train_test_split(X,y)

Now you will set up the `KFold` cross validation module to evaluate the model based on different sets of data from the training set. 

For this example, you are going use the following configuration to set up cross validation:
* `n_splits = 5`: split the data into 5 sets and run the test 5 times.
* `Shuffle = True`: create the kfold sets randomly to avoid introducing bias into the data by taking the data points in some regular order.  
* `random_state = 123`: define a random seed so that you can reproduce your random splits later if you need to.

In [None]:
# Setting up KFold module for 5 splits:
kf = KFold(n_splits=5, shuffle=True, random_state=123)

In [None]:
# Implement the model. In this example, a linear regression model.
lr = LinearRegression()

In [None]:
# Run cross validation using k-folds and display the results.
cv_scores = cross_val_score(lr, X_train, y_train, cv=kf)
cv_scores

The above results are produced by training the model using the training set for each fold, producing a performance score based on its test set, discarding the model, shuffling the folds and repeating the process.

#### Try It For Yourself

Now its your turn. 

Below write code to run K-folds validation on the data 10 times, using random shuffling, and providing it with your own random seed.  Make sure to also display the results. 

In [None]:
# Write and test your solution here...

Finally, you can average all the scores from the 10-fold cross validation to produce an average score for your model as follows:

In [None]:
cv_scores.mean()

#### Recommendations, Pros, and Cons

The average accuracy measure produced by k-folds cross validation may provide you with greater confidence in how your model would perform using real data, over simply using the hold-out validation technique where you just use one training set and one test set. 

**Advantages**: In general, it is always better to use the k-fold technique instead of hold-out. In a head to head comparison, k-fold cross validation gives a more stable and trustworthy result since training and testing is performed on several different parts of the dataset. You can make the overall score even more robust if you increase the number of folds to test the model on many different sub-datasets. 

**Disadvantage**: Increasing k results in training more models and the training process might be really expensive and time-consuming.

## Leave One Out Cross Validation (LOOCV)

LOOCV is an extreme case of k-fold cross validation. Imagine if k is equal to n where n is the number of samples in the dataset.

<img style="margin: 15px 15px 15px 15px;" src="../img/loocv.jpeg" width=40%>

#### Algorithm: How It Works

The algorithm for LOOCV is:

1. Choose one sample from the dataset which will be the test set
2. The remaining n – 1 samples will be the training set
3. Train the model on the training set. On each iteration, a new model must be trained
4. Validate on the test set
5. Save the result of the validation
6. Repeat steps 1 – 5 n times as for n samples we have n different training and test sets
7. To get the final score average the results that you got on step 5.

#### Library Support

For LOOCV, scikit-learn has a built-in method that can be found in the `sklearn.model_selection.LeaveOneOut` library.

In [None]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

#### Recommendations, Pros, and Cons

The data science community has a general rule based on empirical evidence and different research, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.  But why? As with most techniques, you are trying to balance trade-offs.

**Advantages**: LOOCV does not waste much data. You use only one sample from the whole dataset as a test set, whereas the rest is the training set. 

**Disadvantages**: When compared with k-Fold cross validation, LOOCV requires building _n_ models instead of _k_ models, where _n_ which stands for the number of samples in the dataset which is much higher than _k_. LOOCV is therefore more computationally expensive than k-Fold.



## Leave P-Out Cross Validation (LpOCV)

LpOCV is the general case of LOOCV and so instead of leaving one out (i.e., p=1) you can leave _p_ number of observations as validation data, and the remaining data is used to train the model. This is repeated in all ways to cut the original sample on a validation set of p observations and a training set.  Similar to LOOCV, LpOCV is an exhaustive approach. 

#### Algorithm: How it Works

The algorithm of LpOCV technique:

1. Choose p samples from the dataset which will be the test set
2. The remaining n – p samples will be the training set
3. Train the model on the training set. On each iteration, a new model must be trained
4. Validate on the test set
5. Save the result of the validation
6. Repeat steps 2 – 5 **C<sub>p</sub><sup>n</sup>** times 
7. To get the final score average the results that you got on step 5

#### Library Support

You can perform Leave-p-out CV using the library `sklearn.model_selection.LeavePOut`

#### Recommendations, Pros, and Cons

All mentioned advantanges and disadvantages for LOOCV are true for LpOCV. However, it is worth mentioning that unlike LOOCV and k-Fold, test sets will overlap for LpOCV if p is higher than 1. A variant of LpOCV with p=2 known as leave-pair-out cross-validation has been recommended as a nearly unbiased method for estimating the area under the ROC curve (AUC-ROC) of a binary classifier.

## Stratified Cross Validation 

Sometimes you may face a large imbalance of the target value in the dataset. For example, in a dataset concerning wristwatch prices, there might be a larger number of wristwatches having a high price. Stratified cross validation is a variation of the standard k-Fold technique which is designed to be effective in such cases of target imbalance. Stratified k-Fold splits the dataset on k folds such that each fold contains approximately the same percentage of samples of each target class as the complete set. In the case of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all the folds.

<img style="margin: 15px 15px 15px 15px;" src="../img/stratified.jpeg" width=40%>

#### Algorithm: How it Works

The algorithm for stratified cross validation:

1. Pick a number of folds – k
2. Split the dataset into k folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set 
3. Choose k – 1 folds which will be the training set. The remaining fold will be the test set
4. Train the model on the training set. On each iteration a new model must be trained
5. Validate on the test set
6. Save the result of the validation
7. Repeat steps 3 – 6 k times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.
8. To get the final score average the results that you got on step 6.
9. The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. 

#### Library Support

As you may have noticed, the algorithm for stratified k-Fold technique is similar to the standard k-Folds. There is a built-in support for this technique in `sklearn.model_selection.StratifiedKFold`.

#### Recommendations, Pros, and Cons

Everything previously mentioned about k-Fold cross validation is true for stratified k-Folds. When choosing between different k-folds methods, make sure you are using the proper one. For example, you might think that your model performs badly simply because you are using a non-stratified k-Fold to validate a model which was trained on a dataset with a class imbalance. To avoid these types of situations, you should always perform exploratory analysis on your data prior to selecting training and cross validation techniques.

## Repeated Cross Validation

Repeated cross validation also known as **repeated random sub-samplings cross validation** or **monte carlo cross validation**, is probably the most robust of all the techniques described in this training. It is a variation of k-Fold cross valiation but rather than _k_  being the number of folds, it is the number of times you train the model. In other words, the split of the dataset is not in groups or folds, but instead are random selections. For example, if you decide that 20% of the dataset will be your test set, then 20% of the samples will be randomly selected and the other 80% will become the training set. 



<img style="margin: 15px 15px 15px 15px;" src="../img/repeated.png" width=60%>

#### Algorithm: How It Works

The algorithm for repeated cross validation:

1. Pick k – a number of times the model will be trained
2. Pick a number of samples which will be the test set
3. Split the dataset
4. Train on the training set. On each iteration of cross-validation, a new model must be trained
5. Validate on the test set
6. Save the result of the validation
7. Repeat steps 3-6 k times
8. Get the final score by averaging the results that you got in step 6.

#### Library Support

There is built-in support for repeated cross validation within `sklearn.model_selection.RepeatedKFold`. In the implementation of this technique you must set  the number of splits `n_splits` and the number of times the split will be performed `n_repeats`. The function guarantees that you will have different folds on each iteration.

In [None]:
import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

#### Recommendations, Pros, and Cons

Repeated k-fold cross validation has clear advantages over standard k-folds. However, there are still some disadvantages. 

**Advantages**: 
* The proportion of train/test split is not dependent on the number of iterations. 
* You can set unique proportions for every iteration. 
* Random selection of samples from the dataset makes this technique even more robust to selection bias.

**Disadvantages**:
Standard k-Fold cross validation guarantees that the model will be tested on all samples, whereas repeated k-Fold is based on randomization.  This means that some samples may never be selected and used in the test set at all, while other samples may be selected multiple times.

## Nested Cross Validation

Recall that k-fold cross-validation is used to estimate the performance of machine learning models when making predictions on data not used during training. Such a technique can be used both when comparing and selecting a model for the dataset, or when optimizing the hyperparameters of a selected model. When the same cross-validation technique and dataset are used to both select and tune a model, it can lead to an optimistically biased evaluation of model performance. One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called nested cross-validation or double cross validation. 


Nested cross validation is a popular and often preferred way to evaluate and compare tuned machine learning models.  It is one of the more advanced topics in this area, and unfortunately there is no built-in library support so you generally have to code it yourself.

For a deep dive into nested cross validation check out [Nested Cross-Validation for Machine Learning with Python](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/).