# Adjustments for Classification

We also have to adjust some of our data approaches for classification problems.

## What we will accomplish

In this notebook we will:
- Discuss the concept of and motivation for stratified splits,
- Demonstrate the `stratify` argument in `train_test_split` and
- Introduce `StratifiedKFold` in `sklearn`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

## Illustrating example

We will start with a contrived example to illustrate the motivation behind <i>stratified</i> splits.

In [2]:
## Some sample output data
y = [0,1,0,0,0,0,0,1,0,0]

Let's say we have some data we would like to model where the `y` above is our output. Let's now make some train test splits, and print them.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
np.random.seed(9)
for i in range(5):
    y_train, y_test = train_test_split(y, 
                                       shuffle=True, 
                                       test_size=.2)
    print("Split", i+1)
    print("y_train", y_train)
    print("y_test", y_test)
    print()

Split 1
y_train [1, 0, 1, 0, 0, 0, 0, 0]
y_test [0, 0]

Split 2
y_train [1, 0, 0, 0, 0, 0, 1, 0]
y_test [0, 0]

Split 3
y_train [0, 0, 0, 0, 0, 0, 0, 0]
y_test [1, 1]

Split 4
y_train [0, 1, 0, 0, 0, 1, 0, 0]
y_test [0, 0]

Split 5
y_train [0, 0, 0, 1, 1, 0, 0, 0]
y_test [0, 0]



While this may be a silly example, it does highlight an issue that can occur when doing train test splits with categorical data, particularly when your data is highly <i>imbalanced</i>, meaning one of the categories is far more present than the other(s).

A major assumption in supervised learning is that your data is always being drawn from the same underlying probability distribution. So when we make any kind of data split we want both sets in the split to look approximately the same:

<img src="train_test_class.png" width="80%"></img>

## Stratified splits

### In theory

How we can ensure that our splits are representative of the sample's distribution, with regard to the output variable of interest, is known as <i>stratification</i>. When we perform a data split stratified on a categorical variable we break our sample into the observations corresponding to each unique category. We then perform a randomized split on each of those subsets. After the random split all of the respective cateogries are recombined into two unique data sets with categorical splits roughly equal to the original sample distribution.

This may be easier to understand with a picture.

<img src="stratify.png" width="75%"></img> 

### In `sklearn`

We now demonstrate how to make stratified train test splits and stratified cross-validations.

In [13]:
beer = pd.read_csv("../../../Data/beer1.csv")

In [14]:
beer.head()

Unnamed: 0,IBU,ABV,Rating,Beer_Type
0,45,4.2,3.792,Stout
1,60,8.3,4.145,Stout
2,25,6.0,3.951,Stout
3,31,11.0,4.062,Stout
4,75,9.0,4.018,Stout


In [15]:
beer.Beer_Type.value_counts(normalize=True)

IPA      0.560694
Stout    0.439306
Name: Beer_Type, dtype: float64

In [19]:
import sklearn

print(sklearn.__version__)

1.0.2


#### `train_test_split`'s `stratify` argument

In [16]:
## Make the split
beer_train, beer_test = train_test_split(beer.copy(),
                                            shuffle = True,
                                            random_state = 590,
                                            test_size = .2,
                                            stratify = beer.Beer_Type)

In [17]:
## look at the distribution for the training data

beer_train.Beer_Type.value_counts(normalize=True)


IPA      0.561594
Stout    0.438406
Name: Beer_Type, dtype: float64

In [18]:
## look at the distribution for the test data
beer_test.Beer_Type.value_counts(normalize=True)



IPA      0.557143
Stout    0.442857
Name: Beer_Type, dtype: float64

#### `StratifiedKFold`

We can also perform a stratified $k$-fold cross-validation with the assistance of `sklearn`'s `StratifiedKFold` object.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html</a>

In [20]:
## import here
from sklearn.model_selection import StratifiedKFold

In [21]:
## make the kfold object
kfold = StratifiedKFold(5, shuffle=True, random_state=203)

In [22]:
## loop through train sets and test sets
i = 1
for train_index, test_index in kfold.split(beer_train[['IBU','ABV']], beer_train['Beer_Type']):
    ## print the beer type splits
    print("Split",i)
    print("CV Training Set Split")
    print(beer_train.iloc[train_index].Beer_Type.value_counts(normalize=True))
    
    print()
    
    print("CV Holdout Set Split")
    print(beer_train.iloc[test_index].Beer_Type.value_counts(normalize=True))
    
    print("+++++++++++++++")
    i = i + 1

Split 1
CV Training Set Split
IPA      0.563636
Stout    0.436364
Name: Beer_Type, dtype: float64

CV Holdout Set Split
IPA      0.553571
Stout    0.446429
Name: Beer_Type, dtype: float64
+++++++++++++++
Split 2
CV Training Set Split
IPA      0.561086
Stout    0.438914
Name: Beer_Type, dtype: float64

CV Holdout Set Split
IPA      0.563636
Stout    0.436364
Name: Beer_Type, dtype: float64
+++++++++++++++
Split 3
CV Training Set Split
IPA      0.561086
Stout    0.438914
Name: Beer_Type, dtype: float64

CV Holdout Set Split
IPA      0.563636
Stout    0.436364
Name: Beer_Type, dtype: float64
+++++++++++++++
Split 4
CV Training Set Split
IPA      0.561086
Stout    0.438914
Name: Beer_Type, dtype: float64

CV Holdout Set Split
IPA      0.563636
Stout    0.436364
Name: Beer_Type, dtype: float64
+++++++++++++++
Split 5
CV Training Set Split
IPA      0.561086
Stout    0.438914
Name: Beer_Type, dtype: float64

CV Holdout Set Split
IPA      0.563636
Stout    0.436364
Name: Beer_Type, dtype: floa

Now that we know how to adjust our data splits for categorical data, let's start classifying.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)