# 12-training-dev-test
> Creating functions to split the data into training, validation, and test sets

The purpose of this notebook is to create a function which will reproducibly add a `split` column onto the training dataframe.

#### Helpful packages and preliminaries

In [None]:
#all_no_test

In [None]:
#export text_preprocessing
#data access and processing
import pandas as pd
import numpy as np

#splitting the data
from sklearn.model_selection import StratifiedShuffleSplit

#python and file system operations
import glob
import os.path
import docx
import re

# Suppress all warning
import warnings
warnings.filterwarnings('ignore')

#### File constants

In [None]:
#base_prefix = os.path.expanduser('~/Box Sync/DSI Documents/')
base_prefix = os.path.expanduser('/data/p_dsi/wise/data/')

file_directory = base_prefix + 'Audio Files & Tanscripts/Transcripts'
cleaned_transcripts_dir = base_prefix + 'cleaned_data/cleaned_transcripts' 

# Create a training, validation, and test set based on whole csv output
We'll use the `final_csv.csv` file created in 10.

## Read in the final concatenated data and show the head of it

In [None]:
cleaned_data_filename = base_prefix + 'cleaned_data/final_csv.csv'

In [None]:
#read full csv
full_df = pd.read_csv(cleaned_data_filename)

print(full_df.shape)
full_df.head()

(14008, 8)


Unnamed: 0,id,transcript_filepath,wave_filename,speech,start_timestamp,end_timestamp,label,transcriber_id
0,255-3,~ise/data/cleaned_data/cleaned_transcripts/255...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,(okay) before we pass out our character plain ...,00:00:00.00,00:02:05.28,OTR,198
1,255-3,~ise/data/cleaned_data/cleaned_transcripts/255...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,this time you're gonna look for four types of ...,00:00:00.00,00:02:05.28,NEU,198
2,255-3,~ise/data/cleaned_data/cleaned_transcripts/255...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,okay.,00:00:00.00,00:02:05.28,NEU,198
3,255-3,~ise/data/cleaned_data/cleaned_transcripts/255...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,yeah.,00:00:00.00,00:02:05.28,NEU,198
4,255-3,~ise/data/cleaned_data/cleaned_transcripts/255...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,you're gonna work as a whole group actually.,00:00:00.00,00:02:05.28,NEU,198


## We divide our dataset into three parts by two steps

### First step is to divide the whole dataset into two parts, train and test.

In [None]:
# We get 10% test set now and 90% training set
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.1, random_state= 0)

#### Drop NA in order to use ShuffleSplit function and create a copy of full data frame

In [None]:
# Count the number of NA
full_df["label"].isna().sum()

0

In [None]:
df = full_df[full_df['label'].notna()]

#### Count the frequency of class to avoid only one member in one class

In [None]:
np.unique(df['label'].to_numpy(), return_counts=True)

(array(['NEU', 'NUE', 'OTR', 'PRS', 'REP'], dtype=object),
 array([8089,    1, 3690, 1715,  513]))

Here, we see that there are some outputs that are mislabeled.  Let's fix them.

In [None]:
df['label'][df['label'] == "NO"] = "NEU"
df['label'][df['label'] == "NUE"] = "NEU"
df['label'][df['label'] == "OT"] = "OTR"
df['label'][df['label'] == "OTS"] = "OTR"

#### Check again the frequency to make sure there are at least 2 members in one class

In [None]:
np.unique(df['label'].to_numpy(), return_counts=True)

(array(['NEU', 'OTR', 'PRS', 'REP'], dtype=object),
 array([8090, 3690, 1715,  513]))

#### Do one split and set a random_state = 0 to make it reproducible

In [None]:
for train_index, test_index in sss.split(np.zeros(df.shape[0]), df['label'].to_numpy()):
    print("TRAIN:", train_index, "TEST:", test_index)
    print("Train size is:", len(train_index))
    print("Test size is:", len(test_index))
    df_train = df.iloc[train_index]
    df_test = df.iloc[test_index]

TRAIN: [ 1347 12542  5536 ...  5100  2879  9814] TEST: [ 8603  3719  5035 ...  9478 10760 13936]
Train size is: 12607
Test size is: 1401


In [None]:
print(df_train.shape, df_test.shape)

(12607, 8) (1401, 8)


### Second step is to divide df_train got previously into final training set and validation set

In [None]:
# We divide by 1/9 percentage of current new training set
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 1/9, random_state= 0)

In [None]:
for train_index, dev_index in sss.split(np.zeros(df_train.shape[0]), df_train['label'].to_numpy()):
    print("TRAIN:", train_index, "Validation:", dev_index)
    print("Train size is:", len(train_index))
    print("Validation size is:", len(dev_index))
    df_train = df.iloc[train_index]
    df_dev = df.iloc[dev_index]

TRAIN: [ 8804  2368 12230 ...  9144  3764 10815] Validation: [10843  3496  1403 ...  5609  5333   175]
Train size is: 11206
Validation size is: 1401


### Finally we got all three parts: df_train, df_dev, df_test
### Add a split column to indicate train = 0, validation = 1, test = 2

In [None]:
df_train["split"] = 0
df_dev["split"] = 1
df_test["split"] = 2

In [None]:
#pd.set_option('display.max_rows', None)
df_train_dev_test = pd.concat([df_train, df_dev, df_test])
df_train_dev_test.shape
df_train_dev_test.head()

Unnamed: 0,id,transcript_filepath,wave_filename,speech,start_timestamp,end_timestamp,label,transcriber_id,split
8804,123-1,~ise/data/cleaned_data/cleaned_transcripts/123...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,tuesday.,00:06:27.05,00:08:17.18,NEU,198,0
2368,083-2,~ise/data/cleaned_data/cleaned_transcripts/083...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,you worked real hard to help name get it toget...,00:00:00.00,00:01:15.06,PRS,198,0
12230,251-3,~ise/data/cleaned_data/cleaned_transcripts/251...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,anything el*>,00:08:05.04,00:09:59.27,OTR,198,0
11125,027-2,~ise/data/cleaned_data/cleaned_transcripts/027...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,ERE.,00:01:16.08,00:02:05.11,NEU,198,0
10454,131-1,~ise/data/cleaned_data/cleaned_transcripts/131...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,name where's a space between the words in our ...,00:07:22.23,00:08:39.09,OTR,198,0


## Define a function to warp all this process

In [None]:
#export text_preprocessing
def add_splits(full_df, seed = 0, test_ratio = 0.1, dev_ratio = 0.1):
    '''
    This function takes in a Pandas DataFrame and returns a new pandas data frame which uses StratifiedShuffleSplit functions to split them
    in accordance with the same distribution of the complete set
    train split = 0
    validation split = 1
    test split = 2
    
    Argument:
    data_name: data file name
    seed: control reproducible splits
    test_ratio: percentage of test set of the original complete data
    dev_ratio: percentage of validation set of the original complete data
    '''
    
    # Drop NA to make sure StratifiedSplit function can run it
    df = full_df[full_df['label'].notna()]
    
    # Correct typo labels
    df['label'][df['label'] == "NO"] = "NEU"
    df['label'][df['label'] == "NUE"] = "NEU"
    df['label'][df['label'] == "OT"] = "OTR"
    df['label'][df['label'] == "OTS"] = "OTR"
    
    # We get 10% test set now and 90% training set
    sss = StratifiedShuffleSplit(n_splits = 1, test_size = test_ratio, random_state= seed)
    
    # Split them into training and test set
    for train_index, test_index in sss.split(np.zeros(df.shape[0]), df['label'].to_numpy()):
        df_train = df.iloc[train_index]
        df_test = df.iloc[test_index]
        
    # We divide by 1/9 percentage of current new training set to get 10% validation set of the original whole dataset
    sss = StratifiedShuffleSplit(n_splits = 1, test_size = dev_ratio / (1 - test_ratio), random_state= seed)
    
    # Split them into final traning and validation set
    for train_index, dev_index in sss.split(np.zeros(df_train.shape[0]), df_train['label'].to_numpy()):
        df_train = df.iloc[train_index]
        df_dev = df.iloc[dev_index]
    
    # Add a split column, use train = 0, validation = 1, test = 2
    df_train["split"] = 0
    df_dev["split"] = 1
    df_test["split"] = 2
    
    # Row bind all dataframe
    df_train_dev_test = pd.concat([df_train, df_dev, df_test])
    
    # Print each dataset size
    print("Train size is: " + str((1-test_ratio-dev_ratio)*100) + "%,", df_train.shape[0])
    print("Validation size is: " + str((dev_ratio)*100) + "%," ,df_dev.shape[0])
    print("Test size is: " + str((test_ratio)*100) + "%,", df_test.shape[0])
    
    return df_train_dev_test

### Test this new add_splits() function on final_csv.csv dataset in DSI Documents folder

In [None]:
pd.set_option('display.max_rows', 6)
splits_out_df = add_splits(pd.read_csv(cleaned_data_filename), 1, 0.1, 0.1)

Train size is: 80.0%, 11206
Validation size is: 10.0%, 1401
Test size is: 10.0%, 1401


In [None]:
splits_out_df.sample(10)

Unnamed: 0,id,transcript_filepath,wave_filename,speech,start_timestamp,end_timestamp,label,transcriber_id,split
8115,252-3,~ise/data/cleaned_data/cleaned_transcripts/252...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,character.,00:00:00.00,00:01:59.07,NEU,198,2
11397,129-1,~ise/data/cleaned_data/cleaned_transcripts/129...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,name.,00:04:59.21,00:06:00.05,NEU,198,0
4143,027-3,~ise/data/cleaned_data/cleaned_transcripts/027...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,and it has 2001.,00:00:00.00,00:02:36.21,NEU,198,0
...,...,...,...,...,...,...,...,...,...
8115,252-3,~ise/data/cleaned_data/cleaned_transcripts/252...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,character.,00:00:00.00,00:01:59.07,NEU,198,0
1015,134-1,~ise/data/cleaned_data/cleaned_transcripts/134...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,okay.,00:08:00.21,00:10:01.01,NEU,198,0
761,135-2,~ise/data/cleaned_data/cleaned_transcripts/135...,/data/p_dsi/wise/data/Audio Files & Tanscripts...,they did ten times 100000.,00:04:01.23,00:06:11.27,NEU,198,0


Let's write this to file just for save keeping.

In [None]:
#splits_out_df.to_csv(base_prefix + 'cleaned_data/final_csv_wsplits.csv', index=False)