# Scikit Learn API Experimentation
## StratifiedKFold, KFold, shuffle 

What does StratifiedKFold do that's different from KFold?  
What does shuffle=True do that's different than shuffle=False?  

### Cross Validation Resources
Good resources for understanding cross validation and overfiting in Python:
* [Train/Test Split and Cross Validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
* [Learning Curves](https://www.dataquest.io/blog/learning-curves-machine-learning/)

Good resources for understanding cross validation and overfitting in general:
* chapter 5.1 of [ISL](http://www-bcf.usc.edu/~gareth/ISL/)
* The first 3 videos for Chapter 5 [ISL Videos](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)

In [1]:
# Load Titanic Data
%cd -q ../projects/titanic
%run LoadTitanicData.py
%cd -q -

# X: features
# y: target variable
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)
print('X columns:\n', X.columns.values)
print('y name:',y.name)

X Shape:  (891, 11)
y Shape:  (891,)
X columns:
 ['PassengerId' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare'
 'Cabin' 'Embarked']
y name: Survived


### Experiement

Each train/test split from crossvalidation.split() generates two numpy array of indexes.  The first array picks out the records in the training set and the second array picks out the data in the test set.

In [2]:
from sklearn.model_selection import StratifiedKFold, KFold

In [3]:
k_folds = 10
random_seed = 5
crossvalidation = StratifiedKFold(n_splits=k_folds, shuffle=False)

# get train and test sets for crossvaldiation
train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# in Python, looking at data types helps understanding
print(f'List Len:                   {len(train_test_sets)}')
print(f'1st Element Type:           {type(train_test_sets[0])}')
print(f'1st Element Len:            {len(train_test_sets[0])}')
print(f'1st Element 1st Tuple Type: {type(train_test_sets[0][0])}')
print(f'1st Element 1st Tuple Len:  {len(train_test_sets[0][0])}')
print(f'1st Element 2nd Tuple Type: {type(train_test_sets[0][1])}')
print(f'1st Element 2nd Tuple Len:  {len(train_test_sets[0][1])}')
print(f'Data Length:                {len(X)}')

List Len:                   10
1st Element Type:           <class 'tuple'>
1st Element Len:            2
1st Element 1st Tuple Type: <class 'numpy.ndarray'>
1st Element 1st Tuple Len:  801
1st Element 2nd Tuple Type: <class 'numpy.ndarray'>
1st Element 2nd Tuple Len:  90
Data Length:                891


Describing the above in words:
* The train_test_sets list is of length 10 (10 CV folds).
* Each element in the list is a tuple which consists of 2 numpy arrays.
* The first array in the tuple are the indexes used to created the training data.  It is of length 801.
* The second array in the tuple are the indexes used to created the test data.  It is of length 90.
* The total length of all data is 891 records.

In [4]:
# Experiement: KFold with shuffle=False
crossvalidation = KFold(n_splits=k_folds, shuffle=False)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# Check: for contiguous blocks of records in the test set
# if the records are contiguous, each index differs by 1
for i in range(10):
    print((np.diff(train_test_sets[i][1]) == 1).all(), end=' ')

True True True True True True True True True True 

In [5]:
# print one fold of test set indexes
train_test_sets[0][1]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89])

So KFold with shuffle=False means we are using test sets that represent blocks of contiguous records.

A contiguous block of records for the test set means that the training set is as contiguous as possible.

In [6]:
# Experiement: KFold with shuffle=True
crossvalidation = KFold(n_splits=k_folds, shuffle=True, 
                        random_state=random_seed)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# Check: for contiguous blocks of records in the test set
# if the records are contiguous, each index differs by 1
for i in range(10):
    print((np.diff(train_test_sets[i][1]) == 1).all(), end=' ')

False False False False False False False False False False 

In [7]:
# print one fold of test set indexes
train_test_sets[0][1]

array([ 11,  12,  23,  28,  59,  60, 121, 126, 133, 138, 148, 176, 199,
       207, 212, 230, 244, 247, 258, 261, 267, 275, 286, 293, 295, 312,
       316, 322, 329, 349, 352, 354, 361, 363, 379, 383, 386, 409, 417,
       419, 424, 433, 434, 438, 440, 443, 445, 451, 452, 470, 475, 481,
       509, 544, 545, 563, 568, 576, 590, 591, 610, 636, 644, 673, 679,
       682, 683, 692, 695, 724, 727, 733, 735, 737, 747, 757, 759, 765,
       769, 792, 807, 827, 828, 840, 843, 845, 857, 872, 877, 886])

So shuffle=True caused non-consecutive indexes to be used for determining the test datasets.

This implies that non-consecutive indexes are also used for the train datasets.

In other words, we are no longer using blocks of records from the original dataset for our train and test sets.

In [8]:
# Experiement: KFold with shuffle=True
crossvalidation = KFold(n_splits=k_folds, shuffle=True, 
                        random_state=random_seed)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# Check: for frequency of class labels
# Note: y only has values of 0 or 1, so y.mean() is the frequency of 1 values
print('y: ', np.round(y.mean(), 2))

# print frequency of survival in the 10 train and 10 test sets
for i in range(10):
    for j in range(2):
        print(np.round(y[train_test_sets[i][j]].mean(), 2), end=' ')

y:  0.38
0.38 0.4 0.39 0.36 0.39 0.37 0.39 0.33 0.38 0.45 0.38 0.39 0.38 0.39 0.38 0.4 0.39 0.36 0.38 0.38 

So KFold did *not* keep the percentage of survivors the same in each dataset.  Values as low as 33% and as high as 45% are seen.

In [9]:
# Experiement: StratifiedKFold with shuffle=True
crossvalidation = StratifiedKFold(n_splits=k_folds, shuffle=True, 
                        random_state=random_seed)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# Check: for frequency of class labels
# Note: y only has values of 0 or 1, so y.mean() is the frequency of 1 values
print('y: ', np.round(y.mean(), 2))

# print frequency of survival in the 10 train and 10 test sets
for i in range(10):
    for j in range(2):
        print(np.round(y[train_test_sets[i][j]].mean(), 2), end=' ')

y:  0.38
0.38 0.39 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.39 

So StratifiedKFold caused about the same percentage of survivors to occur in each training and test dataset.

#### Summary of StratifedKFold and KFold
For classification, you want each train/test subset to have (about) the same frequency of class values as is represented in the entire target array, so you normally **choose StratifiedKFold instead of KFold**.

The original dataset may have an inherent ordering.  This ordering could bias your train/test splits.  To avoid this, you normally choose **shuffle=True**.

**NOTE**  
shuffle=True does **not** cause the test sets to overlap.  It is not like SuffleSplit.

In [10]:
# Show: test sets do not overlap when suffle=True
crossvalidation = StratifiedKFold(n_splits=k_folds, shuffle=True, 
                        random_state=random_seed)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

# In this example, there are 10 disjoint test sets.
# This is equivalent to saying that each check for intersection between
# each pair of test sets, has a length of 0

# Intersection is commutative, so we only need to check half of the possible
# pairs of test sets and we don't check a test set with itself

for i in range(10):
    for j in range(i+1, 10):
        print(len(np.intersect1d(train_test_sets[i][1],train_test_sets[j][1])), end=' ')

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

We see that the test sets are disjoint. shuffle=True in this context does not cause test set overlap.

In [11]:
# Show: train set is disjoint from its respective test set
crossvalidation = StratifiedKFold(n_splits=k_folds, shuffle=True, 
                        random_state=random_seed)

train_test_sets = [(train_idx, test_idx) for 
                   train_idx, test_idx in crossvalidation.split(X,y)]

for i in range(10):
        print(len(np.intersect1d(train_test_sets[i][0],train_test_sets[i][1])), end=' ')

0 0 0 0 0 0 0 0 0 0 

There are no index values in the train dataset that are in the corresponding test dataset.