# An Aside on Train Test Splits

We brought this up in notebooks 7 and 8 from Regression, but it is worth repeating as we venture into classification.

Up to this point we've mostly applied train test splits without giving much thought to how we randomly sample.

This may be fine for a number of regression problems, but might cause issues in classification. 

## Preserving Categorical Distributions

For the sake of argument let's say we have a binary variable $y$ that can be $0$ or $1$.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
y = np.random.random(500)

y[y > .6] = 1
y[y < .6] = 0

In [None]:
print(np.round(sum(y)/len(y),4), "is the proportion of y that is 1")

If we perform a train test split using sklearn it is likely that our split will have a similar distribution.

In [None]:
y_train, y_test = train_test_split(y, test_size=.25, shuffle = True)

In [None]:
print(np.round(sum(y_train)/len(y_train),4), "is the proportion of y_train that is 1")
print(np.round(sum(y_test)/len(y_test),4), "is the proportion of y_test that is 1")

Run the past two code blocks a number of times and watch as the proportion changes each time.

Now let's try it was a new data set.

In [None]:
y = np.random.random(1000)

y[y > .99] = 1
y[y < .99] = 0

print(np.round(sum(y)/len(y),4), "is the proportion of y that is 1")

In [None]:
y_train, y_test = train_test_split(y, test_size=.25, shuffle = True)

print(np.round(sum(y_train)/len(y_train),8), "is the proportion of y_train that is 1")
print(np.round(sum(y_test)/len(y_test),8), "is the proportion of y_test that is 1")

If you run this enough times you'll eventually end up with a split where either the test or train set doesn't have very many observations of the $1$ class. This isn't great. All of our supervised learning algorithms operate on the assumption that our training data is pulled from the same distribution as our testing data. 
 
We can address this issue by </i>stratifying</i> our dataset. 

If we want say a $75-25$ split, we break our dataset into $1$s and $0$s, we randomly sample $75\%$ of the $1$s for training data and separately randomly sample $75\%$ of the $0$s for the training data. The remaining data is set aside for the test set.

Let's put this into action.

In [None]:
y_train, y_test = train_test_split(y, test_size=.3, shuffle = True, stratify=y)

print(np.round(sum(y_train)/len(y_train),9), "is the proportion of y_train that is 1")
print(np.round(sum(y_test)/len(y_test),9), "is the proportion of y_test that is 1")

When we get very extreme examples like say having the target only represent $.1\%$, or even $.00001\%$ we have to resort to other techniques, which we discuss in the classification homework. 


For the rest of the classification materials we'll perform stratified train test splits.