In [1]:
import os 
import tarfile 
from six.moves import urllib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Perform test train split using random sampling of data

Define the size of the test dataframe


In [2]:
SIZE=100

Generate the category labels for the dataset

In [3]:
# category 1 size is generated by adding a randomly generated integer between 0 and SIZE/6
cat1_size = int(SIZE/3) + np.random.randint(int(SIZE/6))

# category 2 size is generated by subtracting a randomly generated integer between 0 and SIZE/6
cat2_size = int(SIZE/3) - np.random.randint(int(SIZE/6))

# category 3 size is contrained by the SIZE
cat3_size = SIZE - cat1_size - cat2_size

# generate the category column by repeating the category labels '1', '2', and '3'
y=[1,]*cat1_size + [2,]*cat2_size + [3,]*cat3_size
# y

Generate the dataframe using indices as the main column and the category column "y"

In [4]:
dftest = pd.DataFrame({'test':np.arange(SIZE), 'cat':y  })
dftest[int(SIZE/3):int(SIZE/2)]

Unnamed: 0,test,cat
33,33,1
34,34,1
35,35,1
36,36,1
37,37,1
38,38,1
39,39,1
40,40,1
41,41,1
42,42,1


### Random split of data using np.random.permutation

Define splitratio and calculate the size of test sample

In [5]:
split_ratio=0.2
test_size=int(SIZE*split_ratio)

Shuffle the indices of the dataframe randomly

In [6]:
splt = np.random.permutation(SIZE)
splt

array([33, 68, 53, 71, 47, 32, 29, 51, 43, 99, 38, 24, 70, 52, 28, 83, 30,
       10,  6, 20, 96, 81, 69, 17, 92, 87, 63, 35, 67,  5, 26,  9, 19, 22,
       65,  3, 50, 78, 14, 15,  8, 60, 84, 76, 90, 88, 37, 66, 16, 95, 98,
       56, 25, 54, 42, 31, 44,  1, 74, 86, 82, 18, 85, 46, 13, 64, 91, 75,
       23, 39, 61, 27, 89, 79, 97, 62, 40, 77, 72, 41, 57, 12,  0,  7, 45,
       94, 36, 11, 21, 49,  4, 48, 58, 73, 34, 93, 80,  2, 55, 59])

Use shuffled indices to split data. The first few shuffled indices will select the test data randomly due to the random nature of those values 

In [7]:
train1, test1 = splt[test_size:], splt[:test_size]

In [20]:
dftest.iloc[test1].head()

Unnamed: 0,test,cat
33,33,1
68,68,2
53,53,2
71,71,3
47,47,2


In [9]:
dftest.iloc[train1].head() 

Unnamed: 0,test,cat
96,96,3
81,81,3
69,69,2
17,17,1
92,92,3


### Counting the distribution of the random samples

In [10]:
dftest.iloc[train1].groupby(['cat']).count()

Unnamed: 0_level_0,test
cat,Unnamed: 1_level_1
1,35
2,19
3,26


In [11]:
dftest.iloc[test1].groupby(['cat']).count()

Unnamed: 0_level_0,test
cat,Unnamed: 1_level_1
1,11
2,5
3,4


### Dataframe to compare the distribution of the random samples to the original distribution

In [12]:
dfratiotest = pd.DataFrame()
dfratiotest = dftest.groupby(['cat']).count()/SIZE
dfratiotest = dfratiotest.rename(columns={'test':'originalratio'})

dfratiotest['noShuffleSplit-test'] = dftest.iloc[test1].groupby(['cat']).count()/dftest.iloc[test1].groupby(['cat']).count().sum()
dfratiotest['noShuffleSplit-train'] = dftest.iloc[train1].groupby(['cat']).count()/dftest.iloc[train1].groupby(['cat']).count().sum()

dfratiotest

Unnamed: 0_level_0,originalratio,noShuffleSplit-test,noShuffleSplit-train
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.46,0.55,0.4375
2,0.24,0.25,0.2375
3,0.3,0.2,0.325


## Looks like that random sampling does not preserve the relative ratios of the categories

# StratifiedShuffleSplit


Now we demonstrate how StratifiedShuffleSplit can split the data randomly into test-train sets while maintaining the relative ratio of the categories

In [13]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1,random_state=42,test_size=0.2)

In [14]:
# for loop is not necessary for splits with n_splits=1
for i,j in sss.split(np.zeros(SIZE),dftest['cat']):
    print(i,j)


[16 67 69 83 48 13 30 88 15 58  0 68 74 92 62 43 53 36  6 99 91 41 37 40
  3 25 75  9 52 82 90 60 33 76 78  1 44 80 98 21 32 55 46 56 17  8 94 61
 57 29 93 89 26 19 64 70 72 39 97  5 31 50 51 84 10 86 27 65 77 34 24 96
  4 11 66 35  2 85 23 12] [59 79 54 81 42 28  7 95 45 18 14 87 20 47 22 49 71 38 73 63]


The split returns only the indices. Passing the data as X has no effect on the output so we can use a placeholder like np.zeros() which reduces overhead for large datasets

In [15]:
train2, test2 = list(sss.split(np.zeros(SIZE),dftest['cat']))[0]

sss.split returns a generator so we have to convert that to a list by `list(sss.split())`. The list returned is a list of list of two arrays in the format  **[ [trainset1, testset1], [trainset2, testset2], [trainset3, testset3], ... ]**. To extract the first list set we use `list(sss.split())[0]`

### Counting the distribution of the random samples

In [16]:
dftest.iloc[train2].groupby(['cat']).count()

Unnamed: 0_level_0,test
cat,Unnamed: 1_level_1
1,37
2,19
3,24


In [17]:
dftest.iloc[test2].groupby(['cat']).count()

Unnamed: 0_level_0,test
cat,Unnamed: 1_level_1
1,9
2,5
3,6


### Dataframe to compare the distribution of the random samples to the original distribution

In [18]:
dfratiotest['StratifiedShuffleSplit-test'] = dftest.iloc[test2].groupby(['cat']).count()/dftest.iloc[test2].groupby(['cat']).count().sum()
dfratiotest['StratifiedShuffleSplit-train'] = dftest.iloc[train2].groupby(['cat']).count()/dftest.iloc[train2].groupby(['cat']).count().sum()
dfratiotest

Unnamed: 0_level_0,originalratio,noShuffleSplit-test,noShuffleSplit-train,StratifiedShuffleSplit-test,StratifiedShuffleSplit-train
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.46,0.55,0.4375,0.45,0.4625
2,0.24,0.25,0.2375,0.25,0.2375
3,0.3,0.2,0.325,0.3,0.3


Stratified sampling came close to the original distribution but not quite. We now calculate how far off is each sampling method from the original distribution of data among the categories. We define the error rate as the relative deviation from the original values i.e. $ \epsilon = \Delta c_i / c_i $

In [19]:
dferror = pd.DataFrame()
for col in dfratiotest.columns[1:]:
    dferror[col] = (dfratiotest[col]-dfratiotest['originalratio'])/dfratiotest['originalratio']*100

dferror

Unnamed: 0_level_0,noShuffleSplit-test,noShuffleSplit-train,StratifiedShuffleSplit-test,StratifiedShuffleSplit-train
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,19.565217,-4.891304,-2.173913,0.543478
2,4.166667,-1.041667,4.166667,-1.041667
3,-33.333333,8.333333,0.0,0.0


We see that the error rate is considerably lower for StratifiedShuffleSplit. For smaller datasets StratifiedShuffleSplit performs monly slightly better than random sampling. However, as the datasets grow in size, StratifiedShuffleSplit performs considerably better than just random sampling. Try running the whole notebook by having different values of hte parameter SIZE and see the difference.
