# Data Splits

In data science and machine learning, data is our lifeblood.
It is at the center of pretty much everything we do.
So, this means that we usually want to use as much data as we can get.
However, we can actually make better (more robust) models if we partition our data and don't use all of it at once.
We call this "splitting" (or creating "splits" in) our data.
In this exercise, we will cover the basics of splitting a tabular dataset.

In [None]:
# Imports we will need for this example.

import numpy
import pandas
import sklearn.base
import sklearn.datasets
import sklearn.model_selection

## Motivation

Consider the following two-dimensional data where each point is labeled as either red or blue.

<center><img src="hypothesis-data.png" width='400px' style="background-color: white"/></center>

Assume that we have a class of hypothesis $ H $ that uses a function (of multiple functions) for a curve/line to classify the points.
Everything above the function is considered red and everything else is blue.

Of course we need to learn the exact hypothesis $ h $ we will use from $ H $.
Let's assume that we don't split our data, and we learn $ h $ from $ H $ using all of our data.
If we do that, then we may end up with a hypothesis represented by the green line.

<center><img src="hypothesis-bad.png" width='400px' style="background-color: white"/></center>

Technically this is a perfect hypothesis, it scores 100% accuracy.
But looking at it, something looks wrong.
Even as a junior data scientist / machine learner,
this hypothesis should make you feel uneasy.
We are generally looking for patterns in our data, but this hypothesis is so specific that it doesn't look like it is representing any general patterns.
It looks like it just learned about the specific points in this dataset, and not general patterns.
The hypothesis works for these points, but what if we got more points or slightly different points?
The hypothesis seems fragile and would probably fail on slightly different points.
(This general concept is called [overfitting](https://en.wikipedia.org/wiki/Overfitting),
and we will cover it later in this course.)

So not only do we have a bad hypothesis/model,
but now we can't even figure out how bad it is since we have no data other than the points we gave it.
We can't actually evaluate our hypothesis to see how well it does because it already got to see every data point.
That's like giving out all the exact questions on a test and then trying to use that exact test again.
Later on, we can see this probably in more complex models (especially neural nets) that "memorize" data points.
In this case, we need data the model has never seen so we can test to see how well it is actually performing.

To help solve both of these problems (the model getting to specific and memorizing data points),
we can just give our model less data.
Specifically, we can "holdout" part of our data that the model is not allowed to see when training/fitting,
and we can use this data to test/evaluate our model's performance.
Using this tactic, we can hopefully train a new hypothesis $ h $ that looks like the line below:

<center><img src="hypothesis-good.png" width='400px' style="background-color: white"/></center>

Here we can see a much more general model that may miss a few points,
but is clearly not over specializing on the data.

## Data Splits

When we partition our data into multiple parts, we call this "splitting" the data,
and each partition of the data is called a "split".
We usually split a dataset into [two or three splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets):
 - Train -- Used to fit (**train** the parameters of) your model/hypothesis.
 - Test -- Used the test your final model/hypothesis. These are the number that you can report in a paper/publication.
 - Validation -- Used the train the [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter) of your model. Sometimes left out if you have no hyperparameters.

When you have all three splits available, then the proper procedure is to follow the following diagram:

<center><img src="split-usage.png" width="600px" style='background-color: white'/></center>

We want to make sure that no data from the test split is ever shown to the model during training.
We call this [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)).

During this course (and many courses), we will take a more relaxed procedure and usually forego the validation data.
This makes for an easier workflow where you can replace the validation split in the diagram with our test split.
(When publishing results, the diagram must be followed strictly to prevent any data leakage.)

## Creating Splits

Creating splits from tabular data is pretty simple.
We just need to cut up the list/frame into two or three different chunks.
We may also want to shuffle the data before we cut it up.
For non-tabular data we will often use [Snowball sampling](https://en.wikipedia.org/wiki/Snowball_sampling),
but that is outside the scope of this course.

Let's create some test data that we can split!.
For this example, we will be generating some fake data using [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html).
This function is useful for quickly generating some classification data.
The data will be pretty simple, but works well as a starting point.

In [None]:
# n_samples = 200 -- Make 200 data points.
# n_features = 2 -- Generate two feature columns (perfect for plotting).
# n_redundant = 0 -- No redundant features (features with the same information as other features).
# n_informative = 2 -- Make our two features useful (and not just random).
# random_state = 5 -- The seed for the random number generator.
#                     The exact number doesn't matter, the same seed will generate the same data.
# n_clusters_per_class = 1 -- Make the data simple.
all_features, all_labels = sklearn.datasets.make_classification(
    n_samples = 200, n_features = 2,
    n_redundant = 0, n_informative = 2,
    random_state = 5, n_clusters_per_class = 1
)

# Turn the features into a frame, the labels can stay as a list.
all_features = pandas.DataFrame(all_features, columns = ['A', 'B'])

print(all_features[0:10])
print('---')
print(all_labels[0:10])

Now that we have some data, we need to split it.

We can start really simply and just split is using standard Python operators.
Specifically, Python's slicing syntax works well here:

In [None]:
# Split the data into even splits (100 train, 100 test).

# Take the first 100 points for training.
train_features = all_features[:100]
train_labels = all_labels[:100]

# Take the last 100 points for testing.
test_features = all_features[100:]
test_labels = all_labels[100:]

print("Training set size: ", len(train_features))
print(train_features[0:5])
print(train_labels[0:5])
print('---')
print("Test set size: ", len(test_features))
print(test_features[0:5])
print(test_labels[0:5])

We can also use scikit-learn's builtin function for creating splits,
[sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):

In [None]:
# test_size = 0.50 -- Use 50% of the data for test. We can also give a fixed number instead of a ratio.
# random_state = 4 -- The seed for the random number generator.
#                     The exact number doesn't matter, the same seed will generate the same data.
#                     By default the data will be shuffled before split, the randomness comes in here.
splits = sklearn.model_selection.train_test_split(all_features, all_labels,
                                                 test_size = 0.50, random_state = 4)

# The return from sklearn.model_selection.train_test_split contains all the split data,
# this could have been done on the above line, but it would have made the line very long.
train_features, test_features, train_labels, test_labels = splits

print("Training set size: ", len(train_features))
print(train_features[0:5])
print(train_labels[0:5])
print('---')
print("Test set size: ", len(test_features))
print(test_features[0:5])
print(test_labels[0:5])

## K-Fold Cross-Validation

Thus far, we have discussed splitting the once and using the resulting splits.
But what if we are unlucky (or lucky depending on how you look at it) and our test set only has easy examples?
Or our training set happens to be the best possible training set that we can't expect to see in the real world?
To minimize these chances of this happening, we generally like to use multiple sets of data splits.
There are many ways multiple splittings of the data can be used,
but the most common is probably [k-fold cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation).

In k-fold cross validation, we split out data into k (usually 5 or 10) evenly-sized chunks of data that we call **folds**.
One of these folds is assigned to be the test split (also called the **holdout**) and the rest are assigned to be the training split.
(You can also use one or more of the folds for a validation set if you need it.)
Therefore in 10-fold, 10% of the data is test and 90% is training.
After the model is trained on the training split and evaluated on the test spit,
the procedure is repeated with the next fold as the test split and the old test fold with the rest of the training data.
This is repeated k times (so each fold gets a chance to be the test split) and the results are averaged.

<center><img src="k-fold.png" width="600px"/></center>
<center style='font-size: small'>Image from <a href='http://karlrosaen.com/ml/learning-log/2016-06-20/'>Karl Rosaen </a>.</center>

Not only does this help reduce the randomness of picking good/bad splits,
but it also allows us to compute standard deviation and variance on the evaluation scores so we can get an idea how much the data affects our model.

Be aware that there are several variations that you can do on k-fold cross validation including (but not limited to):
 - Using one split for validation.
 - Instead of splitting your entire dataset, just split the training split and use the original test split as a final evaluation.
 - Use multiple splits for test data.

The concept of k-fold cross-validation is pretty simple,
and I am sure you can easily implement your own function to do it.
Additionally, you can also use the implementation provided by scikit-learn:
[sklearn.model_selection.cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).
Note that this function has a lot of parameters and is fairly complex.
This is because it can do many of the k-fold variation that we mentioned earlier (as well as other cross-validation methods that are not k-fold).

To use scikit-learn's function, you need to have a scikit-learn estimator,
which we have not covered yet in this course (we will).
But we can still make a simple class to implement a super-simple threshold-based hypothesis that just predicts a label of 1 if the 'A' feature is greater then a threshold
(which defaults to 0.0).

In [None]:
# Make a scikit-learn estimator that 
class SuperSimpleHypothesis(sklearn.base.BaseEstimator):
    def __init__(self):
        self.threshold = 0.0

    def fit(self, features, labels):
        # Don't do anything.
        # We will be using this later, but for now we just want a super simple hypothesis.
        pass

    def predict(self, features):
        labels = []
        
        for (_, row) in features.iterrows():
            label = 0
            if (row['A'] > self.threshold):
                label = 1

            labels.append(label)
            
        return labels

Now that we have a class we can use for our hypothesis, we can do cross-validation on it:

In [None]:
scores = sklearn.model_selection.cross_validate(SuperSimpleHypothesis(), all_features, all_labels, scoring = 'accuracy')['test_score']

print("All runs: ", list(scores))
print("Mean Score: ", numpy.mean(scores), "Standard Deviation: ", numpy.std(scores))

We can see the scores we got for each of the 5 runs,
and we can also compute the mean and standard deviation to get a set of reliable numbers to report the performance of our model.
Here we see a decently large range of results, from 92.5% all the way to 100% accuracy.
Here we can see the luck in randomized splits playing out.
By running over multiple splits, we can get a more realistic view of our model's performance.

Note that our hypothesis was very simple and we didn't even bother learning on our training data.
In the below class, you can see an example of a hypothesis that learns the threshold from the training data (but it is a bit more complicated).

In [None]:
# Make a scikit-learn estimator that 
class LearningHypothesis(sklearn.base.BaseEstimator):
    def __init__(self):
        self.threshold = 0.0
        
    def fit(self, features, labels):
        """
        Get the mean of the positive and negative labels,
        and put the threshold between those means.
        """
        
        zeroes = []
        ones = []

        for i in range(len(labels)):
            label = labels[i]
            value = features.iloc[[i]]['A']
            
            if (label == 0):
                zeroes.append(value)
            else:
                ones.append(value)

        self.threshold = ((numpy.mean(zeroes) + numpy.mean(ones)) / 2.0)

    def predict(self, features):
        labels = []
        
        for (_, row) in features.iterrows():
            label = 0
            if (row['A'] > self.threshold):
                label = 1

            labels.append(label)
            
        return labels

In [None]:
scores = sklearn.model_selection.cross_validate(LearningHypothesis(), all_features, all_labels, scoring = 'accuracy')['test_score']

print("All runs: ", list(scores))
print("Mean Score: ", numpy.mean(scores), "Standard Deviation: ", numpy.std(scores))

By learning, we can increase our score and decrease our standard deviation (which means we have a more robust hypothesis).