# Data Splits for Predictive Model Comparison

We want our predictive models to generalize well to unseen data. To that end we

* Split our data into **training** and **testing** sets.  
    * We never do anything with the testing set until the **very end of our work** as a final sanity check.
* During model selection we further split our training set using either
    * A single **validation** set or
    * A k-fold **cross-validation** split

## What we will accomplish

In this notebook we will:
- Discuss the rationale for splitting our data set
- Introduce train test splits, validation sets, and cross-validation

In [1]:
## We will now start importing a common set
## of items at the onset of most notebooks
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

## Data splits guard against over-fitting

Over-fitting is when a model fits too closely to the data it was trained on and does not generalize to new data as well as it otherwise could.

We will give a more formal presentation in lecture 4 (the "Bias/Variance Tradeoff" notebook) but we need at least an informal understanding immediately.

<img src="lecture_assets/overfit.png"></img>

The 2nd model is over-fitting:  we can see that it would not generalize well to new data which follows the same distribution as our training data.

It was easy to see that we are over-fitting here because the relationship is relatively simple and the data is low dimensional enough that we can visualize it.  When we are dealing with real data we might have hundreds of features, and simple visual checks would not be sufficient.

One of the best ways to guard against over-fitting is to use a **data split**.

## Train test splits

The first split we will touch on is the first split you would do in a new data science project: the **train test split**.

The purpose of the train test split is to create two data sets:
1. <b>The training set</b> - This subset is used to fit models and compare model candidates. This data set is usually split further.
2. <b>The testing set</b> - This subset is used as a final check on your selected model prior to putting your model into its desired final state.

The training set usually contains the majority of the original data. Common train test split percentage divisions are $80\% - 20\%$ or $75\% - 25\%$, but it may sometimes be appropriate to use different split sizes. Train test splits are done randomly, with the form of randomness dependent upon your project.

Here is an illustration of a train test split:

<img src="lecture_assets/train_test.png" width="40%"></img>

<b>IMPORTANT:  The test set is not directly used to compare models</b>

Model comparison is typically done using further splits of the training set. 

It is embarrassing and costly to ship a product which doesn't perform well on novel data.  The test set serves as a **final sanity check** on your work before sending it out into the world.

### Performing train test splits in `sklearn`

The `sklearn` package has a useful `train_test_split` function that will perform the train test split. Here is a link to the documentation:

 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a>

In [2]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

continuous_features = [
    "Length",
    "Diameter",
    "Height",
    "Whole_weight",
    "Shucked_weight",
    "Viscera_weight",
    "Shell_weight",
]

X.loc[:, continuous_features] *= 200
y += 1.5

In [3]:
## Now we import train_test_split
from sklearn.model_selection import train_test_split

In [4]:
## Here we make the split
## train_test_split returns 4 outputs: X_train, X_test, y_train and y_test
##
## First you input the X and y for your data
##
## then set the shuffle argument to True, this randomly shuffles the
## data before it is split
##
## The random_state ensures that the random split is the same each time
## someone runs the code chunk, it can be any strictly positive integer
##
## You can specify the size of the test set with test_size,
## here I want 20% of the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=True, random_state=440, test_size=0.2
)

In [5]:
## check the data lengths to see that they match
## what we'd expect
print("The shape of X_train is", X_train.shape)
print("The shape of X_test is", X_test.shape)
print("The length of y_train is", len(y_train))
print("The length of y_test is", len(y_test))

The shape of X_train is (3341, 8)
The shape of X_test is (836, 8)
The length of y_train is 3341
The length of y_test is 836


## Two split types for model comparison and selection

We will now cover two data splits you can make from the training set for model comparison purposes. Which you choose depends upon the project you are working on, but we will give some reasons to choose one over the other below.

### Validation sets

A <i>validation set</i> is a subset of the training data (the result of the train test split defined above) used solely for the purpose of comparing candidate models. This split is typically also performed randomly. Further, the validation set should be a small subset, common sizes range from $10\%-25\%$ of the training set depending on the training set size. An illustration of this concept is given below:

<img src="lecture_assets/validation_set.png" width="45%"></img>

The best model in this setting would be the one that has the best performance metric on the validation set.

#### In practice

In practice we can once again use `sklearn`'s `train_test_split` function to make the validation split. Note that it is good practice to not overwrite the original `X_train` or `y_train` sets when making the validation split.

In [6]:
## Here we make a validation set with 15% of the
## training data in the validation set
X_train_train, X_val, y_train_train, y_val = train_test_split(
    X_train, y_train, shuffle=True, random_state=321, test_size=0.15
)

In [7]:
print("15% of", len(X_train), "is", 0.15 * len(X_train))

15% of 3341 is 501.15


In [8]:
print("Shape of X_train_train", X_train_train.shape)
print("Shape of X_val", X_val.shape)
print("Length of y_train_train", len(y_train_train))
print("Length of y_val", len(y_val))

Shape of X_train_train (2839, 8)
Shape of X_val (502, 8)
Length of y_train_train 2839
Length of y_val 502


### $k$-Fold cross-validation

The validation set approach only gives us one check on how well our model generalizes.  We might get unusually lucky or unlucky with this one check.  $k$-fold cross validation gives us $k$ opportunities to see how well our model will generalize instead of just one.

<img src="lecture_assets/cv1.png" width="60%"></img>

Common values for $k$ are between $5$ and $10$.  "Leave out one" cross validation is another strategy which is equivalent to taking $k = n-1$ where $n$ is the number of samples in your training data.

You can implement cross-validation using `sklearn`'s `KFold` object. Documentation for this method can be found here <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html</a>.

In [9]:
## import KFold
from sklearn.model_selection import KFold

In [10]:
## make a KFold object
## n_splits controls the value of k
## shuffle=True, randomly shuffles the data prior to splitting
## random_state is the same as for train_test_split
kfold = KFold(n_splits=5, shuffle=True, random_state=582)

In [11]:
## demonstrate.split
kfold.split(X_train, y_train)

<generator object _BaseKFold.split at 0x16928ef00>

Side note on generators:  Notice that kfold.split returns a generator object.  If you are not familiar with them, you can think of this as being similar to a list except that instead of storing all of the elements in memory it stores the current element and a rule for getting the next element.

KFold is implemented this way to deal with memory issues if you use a large number of splits.  For instance, if a Leave Out One split was implemented as a list on a dataset of size $10000$ the size of the list would be $10000*9999$.  If you use a generator instead then at each stage you only need to keep a list of size $10000$ in memory, also remember which element you should leave out next.

In [12]:
## use for loop to demonstrate .split
for train_index, test_index in kfold.split(X_train, y_train):
    print("Train index:", train_index)
    print("Test index:", test_index)
    print()
    print()

Train index: [   0    2    4 ... 3337 3339 3340]
Test index: [   1    3   18   19   40   41   42   44   46   47   48   52   59   64
   65   72   78   79   80   91   96  114  118  122  123  128  129  141
  142  144  152  156  160  167  168  172  176  183  184  187  193  199
  214  218  225  232  235  247  256  264  267  270  275  283  286  287
  304  311  320  321  322  324  325  326  336  341  349  352  354  369
  370  371  373  383  385  389  390  397  403  410  413  416  431  442
  444  445  449  450  454  464  465  467  476  486  487  491  495  496
  500  504  508  518  529  535  536  538  539  544  549  552  553  565
  567  568  573  575  583  595  607  613  618  619  622  625  631  633
  638  643  644  648  649  651  654  659  663  672  673  674  677  679
  682  685  692  705  720  727  729  732  737  747  769  773  780  783
  785  791  796  797  800  801  815  820  822  831  834  836  838  840
  841  842  845  846  851  860  863  872  875  883  885  889  895  903
  910  912  913 

### Choosing between our two models using cross validation

We have two predictive models of abalone age:  one which only uses length as a predictor, and one which uses all of the continuous predictors.  Let's assess which of these two models generalizes the best to unseen data. 

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

slr = LinearRegression()
mlr = LinearRegression()

# rmses will hold the cross validation root mean squared errors of each model.
rmses = np.zeros((2, 5))

for i, (train_index, test_index) in enumerate(kfold.split(X_train, y_train)):
    ## get the kfold training data
    X_train_train = X_train.iloc[train_index, :]
    y_train_train = y_train.iloc[train_index]

    ## get the holdout data
    X_holdout = X_train.iloc[test_index, :]
    y_holdout = y_train.iloc[test_index]

    ## Fit both models
    slr.fit(X_train_train[["Length"]], y_train_train)
    mlr.fit(X_train_train[continuous_features], y_train_train)

    ## Use both models to generate predictions on the holdout set
    slr_preds = slr.predict(X_holdout[["Length"]])
    mlr_preds = mlr.predict(X_holdout[continuous_features])

    ## Record the rmses
    rmses[0, i] = root_mean_squared_error(y_holdout, slr_preds)
    rmses[1, i] = root_mean_squared_error(y_holdout, mlr_preds)

In [14]:
rmses

array([[2.73941346, 2.63849335, 2.50858008, 2.86749772, 2.60895888],
       [2.46317672, 2.16690161, 2.07821673, 2.39625422, 2.18157873]])

In [15]:
rmses.mean(axis=1)

array([2.6725887, 2.2572256])

We can see that the multiple linear regression model does generalize a bit better.  Depending on our use case this improvement may, or may not, be worth it.  For example, it will be easier for an abalone diver to measure one length than it is to collect all $7$ features in our multiple linear regression model.

If we were done modeling we could then do our "final sanity check" on the testing data.  We would retrain our chosen model on the entire training set and then evaluate on the test set. In this case I want to do some more modeling using the same dataset in the coming weeks.  So we will hold off on evaluating the model on the testing data for now.

### Validation set or cross-validation

Cross-validation, when feasible, is preferred to a single validation set. In general it is better to have a collection of estimates of the generalization error instead of a single point estimate.

However, it is not always feasible to perform cross-validation. Models that take prohibitively long to train limit the usefulness of cross-validation since $k$-fold cross-validation requires you to train the model $k$ distinct times.

### Additional Considerations

It is very rare that a simple `train_test_split` or `KFold` will be the appropriate data split for your problem. You need to think carefully about exactly what information you will have access to at prediction time, and what new situations you would like to be able to generalize to.  You need to make sure that your split respects those considerations.

- **Geographic leakage**  
  - Nearby or correlated spatial units split across train/test.  
  - Example: adjacent houses, stores in the same shopping center, patients from the same hospital.  
  - Tools: manual grouping, [`GroupKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html)
    - `GroupKFold` alone is often insufficient:  custom splitters or splitting pipelines (e.g. first use spatial clustering techniques to assign groups, then use `GroupKFold`). 
- **Temporal leakage**  
  - Future information leaking into past predictions.  
  - Example: using 2023 features to predict 2022 outcomes.  
  - Tools: [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
    - `TimeSeriesSplit` is often not the correct structure and you will need to write a custom splitter.  For example, if data only becomes available one week before predictions are needed it is inappropriate to train on all data up to the prediction time.
- **Group-level leakage**  
  - Data about the same entity split across train/test.  
  - Example: Suppose you want to predict how a new student will perform on a standardized test at a new school. Your rows are individual student test events. If you use train_test_split or KFold, students from the same school end up in both train and test sets, so you never measure generalization to unseen schools. The fix is to split at the school level.
  - Tools: [`GroupKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html), [`GroupShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html), [`LeaveOneGroupOut`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html)
- **Feature leakage**  
  - Variables that encode target information directly.
- **Post-outcome leakage**  
  - Including features that were only known after the prediction time.  
  - No direct tool; requires domain awareness.
- **Overfitting the split**  
  - Repeatedly adjusting models or features to perform well on a single validation/test split risks overfitting that particular partition rather than learning patterns that generalize.  
  - Mitigation: use multiple splits (e.g., cross-validation, nested CV).  Always keep a truly final holdout set untouched until the very end.
  - You should only ever evaluate **one model** on the final test set: it isn't another validation set.
- **IMPORTANT NOTE**
  - In many cases, the built-in splitters won’t match your problem. You may need to implement a custom splitting strategy that explicitly mirrors the situations where you expect the model to generalize. Think carefully about the real deployment setting, and design your split to reflect it.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.  Modified by Steven Gubkin 2024.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)