# 33-split-data-with-groupby
> Functionality for creating training, validation, and testing data frames.

These functions will split the input data to return training, validation, and testing sets. In this notebook are two functions: `split_data`, and `split_data_grp`. The former will split a given pandas data frame into desired proportions for training, testing, and validation, while the latter will apply `split_data` a pandas data frame that is grouped by a specified grouping variable. The grouping functionality is essentially stratification for ensuring approximately equal proportions of each unique value of the grouping variable in the returned data sets.

Note, the user only needs to use `split_data_grp` directly. If no grouping variable is specified, then the function defaults to using the original `split_data` function.

In [None]:
#default_exp split_data

In [None]:
#export
#no_test
#dependencies
import pandas as pd
import numpy as np

## Basic data splitting
Split given pandas df into training, validation, and testing frames.  This splits data in the traditional manner according to rows.

In [None]:
#export
def split_data(df, train_prop = 0.80, validation_prop = None, seed = None):
    '''
        df: data frame to use for splitting
        train: proportion of df for training ; testing set is 1-training proportion
        validation: proportion of df for validation ; if None, testing set is 1-training proportion. If not None, testing set is 1 - (training_prop + validation_prop)
        seed: seed number to use for splitting the data
        
        returns: 2 or 3 dataframes based on the inputs
    '''
    
    #Create training frame
    train_df = df.sample(frac=train_prop,random_state = seed)
    
    #Conditionally create validation and testing frames
    if validation_prop != None:
        validation_pool = df.drop(train_df.index)
        validation_df = validation_pool.sample(n = int(validation_prop * len(df)), random_state = seed)
        
        #Create testing frame
        test_drop_index = train_df.index.union(validation_df.index)
        test_df = df.drop(test_drop_index)
        
        #Return frames
        return train_df, validation_df, test_df
    
    #Return testing w/o validation frame
    else:
        test_df = df.drop(train_df.index)
        
        return train_df, test_df

In [None]:
#no_test
#Define some data
import string
import random

data = pd.DataFrame(np.random.randint(0, 100, size = (200,3)), 
                     columns = list("ABC"))
data['foo'] = random.choices(string.ascii_lowercase, k=200)
data.head()

Unnamed: 0,A,B,C,foo
0,76,80,3,r
1,56,94,41,k
2,28,2,36,y
3,15,77,21,e
4,59,41,82,m


### Basic usage of split_data function


In [None]:
#no_test
train_df_no_grp, validation_df_no_grp, test_df_no_grp = split_data(data, train_prop = 0.70, validation_prop = 0.20)

In [None]:
#no_test
#Print split data
print("Training: ")
display(train_df_no_grp.head())
print("Training shape: ", train_df_no_grp.shape)

print("Validation: ")
display(validation_df_no_grp.head())
print("Validation shape: ", validation_df_no_grp.shape)

print("Testing: ")
display(test_df_no_grp.head())
print("Testing shape: ", test_df_no_grp.shape)

Training: 


Unnamed: 0,A,B,C,foo
81,23,87,51,l
89,88,48,86,w
5,79,68,89,r
80,14,69,20,w
132,82,86,65,m


Training shape:  (140, 4)
Validation: 


Unnamed: 0,A,B,C,foo
53,19,48,16,o
150,90,8,87,j
74,48,42,85,a
162,2,33,74,p
55,53,84,96,k


Validation shape:  (40, 4)
Testing: 


Unnamed: 0,A,B,C,foo
1,56,94,41,k
38,83,86,50,t
42,77,44,22,h
51,82,22,2,m
52,98,35,41,x


Testing shape:  (20, 4)


## Splitting based on entries
Split data into training, validation, and testing sets stratified on a given grouping variable.  The reason this is necessary in our application is that if we split on rows only, the testing set entities will have already been seen during training, but would have been unlabeled.  This may negatively impact the performance of the model.

In [None]:
#export
def split_data_grp(df, prop_train = 0.80, prop_validation = None, grp_var = None, seed = None):
    '''
     df: data frame to use for splitting
        train: proportion of df for training ; testing set is 1-training proportion
        validation: proportion of df for validation ; if None, testing set is 1-training proportion. If not None, testing set is 1 - (training_prop + validation_prop)
        grp_var: variable to split data frames on, passed as a string
        seed: seed number to use for splitting the data for reproducibility
        
        returns: 2 or 3 dataframes based on the inputs
    '''
    
    # If grouping variable is supplied
    if grp_var != None:
        
        #Determine the relevant splits of interest
        if prop_validation is None:
            prop_validation = 0
            
        prop_test = 1 - prop_train - prop_validation
        
        #Light error checking
        if prop_test <=0:
            raise ValueError("prop_train + prop_validation + prop_test must be equal to 1.")
        
        #Select out the unique groups (note: we reset index here because otherwise, the horzconcat below tries to align on the row indices)
        unique_groups = df[grp_var].drop_duplicates().reset_index(drop=True)
        n_grps = len(unique_groups)
        
        #Generate list with values 1, 2, and 3 in proportion to the train/valid/test splits
        rep_list = [1]*int(n_grps*prop_train) + [2]*int(n_grps*prop_validation) + [3]*int(n_grps*prop_test)
        
        #For non-even splits, just add these to the test set
        n_leftovers = n_grps - len(rep_list)
        rep_list = rep_list + [3]*n_leftovers
        
        #Randomly permute these values to get assignments
        grp_assigns = (pd.DataFrame(rep_list, columns=['split'])
                       .sample(frac=1, random_state = seed)
                       .reset_index(drop=True))
        
        #Concatenate these onto the unique_groups dataframe
        unique_groups = pd.concat([unique_groups, grp_assigns], axis=1)
        
        #Join the split assignments with the original dataframe (unique row split assignments will be broadcast to the non-unique ones)
        full_df = pd.merge(df, unique_groups, on=grp_var)
        
        #Split and drop columns
        tr_df = full_df.query('split==1').drop(columns=['split'])
        val_df = full_df.query('split==2').drop(columns=['split'])
        te_df = full_df.query('split==3').drop(columns=['split'])
        
        #Return the splits
        if prop_validation == 0:
            return tr_df, te_df
        else:
            return tr_df, val_df, te_df
    
    else:
        
        #no grouping variable applies original split_data function
        return split_data(df, train_prop = prop_train, validation_prop = prop_validation, seed = seed)

    

### Basic usage of `split_data_grp` function
Now, let's both check and demonstrate the usage of the `split_data_grp` function.  We need to make sure that none of the groups that are in `Train` are also in `Test` or `Valid` splits.

In [None]:
#no_test
#Example 1: with validation frame
train_df_grp, validation_df_grp, test_df_grp = split_data_grp(data, prop_train = 0.70, prop_validation = 0.20, grp_var = 'foo', seed=1234)

In [None]:
#no_test
#Print split data
print("Training: ")
display(train_df_grp.head())
print("Training shape: ", train_df_grp.shape)

print("Validation: ")
display(validation_df_grp.head())
print("Validation shape: ", validation_df_grp.shape)

print("Testing: ")
display(test_df_grp.head())
print("Testing shape: ", test_df_grp.shape)

Training: 


Unnamed: 0,A,B,C,foo
0,76,80,3,r
1,79,68,89,r
2,39,26,96,r
3,28,35,28,r
4,83,45,39,r


Training shape:  (129, 4)
Validation: 


Unnamed: 0,A,B,C,foo
45,16,5,35,h
46,74,2,54,h
47,95,2,60,h
48,77,44,22,h
49,76,63,56,h


Validation shape:  (45, 4)
Testing: 


Unnamed: 0,A,B,C,foo
88,18,10,60,t
89,83,75,15,t
90,9,86,11,t
91,83,86,50,t
92,93,74,4,t


Testing shape:  (26, 4)


The shape of the splits looks about correct, and note that test will be a little bigger based on decimal rounding.  Note that it's really the _group_ proportions that we split on, so if a group has a LOT of entries, the shapes of the original dataframe with all rows may not hold in the original proportions.  We also have only the columns now that we want (i.e., no `split` column).

In [None]:
#no_test
print('Train group shape: ', train_df_grp['foo'].drop_duplicates().shape)
print('Validation group shape: ', validation_df_grp['foo'].drop_duplicates().shape)
print('Test group shape: ', test_df_grp['foo'].drop_duplicates().shape)

Train group shape:  (18,)
Validation group shape:  (5,)
Test group shape:  (3,)


These proportions are correct given that there are 26 letters (groups) in our demo training set.  Now, let's look at the contents of the splits to make sure each split does not contain overlapping splits:

In [None]:
#no_test
set(train_df_grp['foo'].values).intersection(set(validation_df_grp['foo'].values))

set()

In [None]:
#no_test
set(train_df_grp['foo'].values).intersection(set(test_df_grp['foo'].values))

set()

In [None]:
#no_test
set(validation_df_grp['foo'].values).intersection(set(test_df_grp['foo'].values))

set()

Here, we can see that there is no intersection among these splits.  Now, let's make sure that the no validation set situation works:

In [None]:
#no_test
#Example  2: no validation frame
train_df_no_valid, test_df_no_valid = split_data_grp(data, prop_train = 0.70, grp_var = 'foo')

In [None]:
#no_test
#Print split data
print("Training: ")
display(train_df_no_valid.head())
print("Training shape: ", train_df_no_valid.shape)

print("Testing: ")
display(test_df_no_valid.head())
print("Testing shape: ", test_df_no_valid.shape)

Training: 


Unnamed: 0,A,B,C,foo
18,28,2,36,y
19,87,0,73,y
20,97,71,28,y
21,89,82,32,y
22,54,20,98,y


Training shape:  (137, 4)
Testing: 


Unnamed: 0,A,B,C,foo
0,76,80,3,r
1,79,68,89,r
2,39,26,96,r
3,28,35,28,r
4,83,45,39,r


Testing shape:  (63, 4)


Looks good as well.

In [None]:
#no_test
from nbdev.export import notebook2script
notebook2script()

Converted 31-collate-xml-entities-spans.ipynb.
Converted 33-split-data.ipynb.
Converted 41-generic-framework-for-spacy-training.ipynb.
Converted 42-initial-model.ipynb.
Converted data-preprocessing.ipynb.
Converted markup-to-spacy.ipynb.
Converted unstructured-to-markup.ipynb.
