# Dataset selection

This notebook covers the steps to generate a test set and the validation sets. It uses the dataframe created by the Data Processing notebook.

The following libraries are used.

In [None]:
# File manipulation
import os

# Data manipulation
import pandas as pd

# Stratified data splitting
from sklearn.model_selection import train_test_split

## Import the full dataset dataframe

In [None]:
df_full = pd.read_csv('dataframes/dataset.csv')

## Create the test dataset

The split uses a 0.198 test_size to split the 1010 image dataset into a 810 image training set and a 200 image test set.

In [None]:
test_size = 0.198

X_train, X_test, y_train, y_test = train_test_split(df_full.drop(['percentage'], axis=1), 
                                                    df_full['percentage'], 
                                                    test_size=test_size, 
                                                    random_state=42, 
                                                    stratify=df_full['rbr'])

Concatenate the results to get the training and test sets.

In [None]:
df_train = pd.concat([X_train, y_train], axis=1).set_index('filename').sort_index()
df_test = pd.concat([X_test, y_test], axis=1).set_index('filename').sort_index()

Save the training and test sets to file.

In [None]:
df_train.to_csv('dataframes/train.csv')
df_test.to_csv('dataframes/test.csv')

## Create validation folds

Create 10 validation folds to be used to test model generalisability.

In [None]:
for idx in range(10):
    # Split the training frame randomly
    X_t, X_v, y_t, y_v = train_test_split(df_train.drop(['percentage'], axis=1),
                                          df_train['percentage'],
                                          test_size=0.197,
                                          random_state=idx,
                                          stratify=df_train['rbr'])

    # Create a subset for training and validation
    df_fold_train = pd.concat([X_t, y_t], axis=1).sort_index()
    df_fold_valid = pd.concat([X_v, y_v], axis=1).sort_index()
    
    # Concatenate them
    df_fold = pd.concat([df_fold_valid, df_fold_train])
    
    # Save to individual dataframes
    df_fold.to_csv('dataframes/train_' + str(idx) + '.csv')