# Create Data Splits (Train, Validation, Test)

This notebook is to guide you on splitting your data into training, validation, and test sets.
A "split" is hereby defined as a named collection of a master dataset, and 3 disjoint subsets of that dataset: train, validation, and test.
These are stored as 4 separate CSV files in a named folder in the `splits` directory. E.g. `splits/my_split/train.csv`, `splits/my_split/validation.csv`, `splits/my_split/test.csv`, and `splits/my_split/master.csv`.
Each line of the CSV files contains multiple columns to hold both file names of volumes (e.g. CT scans) and also plaintext information such as the presence of a tumor (0 or 1).

Keeping the master csv and naming each split allows us to keep track of the data used for a particular model checkpoint.
This allows us to prune out bad/corrupted data points from the master set and not include them in any of the train/val/test datasets.
Contaminated data does happen and can ruin a model if we are not careful.

Image "volumes" refers to large data such as 3D CT images OR their corresponding, equally large, segmentation labels.
With this system, these image volumes can be stored anywhere on the file system and the CSV holds only the path names relative to wherever the image directory is.
We supply the image/label source directory for volumes in our configuration file (or .env file) so that our code (DataModule/Dataset) knows where to look.
**note to self** - in the datamodule initialization, check that all paths are valid and that the files exist.

In [5]:
import rootutils
import os
root = rootutils.setup_root(search_from=os.getcwd(), indicator=".project-root", dotenv=True, pythonpath=True, cwd=True)

Below, we set the variables for the source CSV, split name, and the split ratios.
Notice we remove the test set from the train set **before** splitting the train set into train and val sets.
Thus, if both TEST_SIZE and VAL_SIZE are 0.2, the final split will be (.8 * .8 =) 64% train, (.8 * .2 =) 16% val, and 20% test.

In [6]:
CSV_SRC = 'splits/spleen.csv'
SPLIT_NAME = 'MySplit'
TEST_SIZE = 0.2
VAL_SIZE = 0.2

In [8]:
from sklearn.model_selection import train_test_split
import csv
import pandas as pd

# Load the data from the CSV source
dataframe = pd.read_csv(os.path.join(root, CSV_SRC))

# Split the data into train and test sets
# Notice we remove the test set from the train set before splitting the train set into train and val sets
# Thus, if both TEST_SIZE and VAL_SIZE are 0.2, the final split will be (.8 * .8 =) 64% train, (.8 * .2 =) 16% val, and 20% test
train_val, test = train_test_split(dataframe, test_size=TEST_SIZE, random_state=42)
train, val = train_test_split(train_val, test_size=VAL_SIZE, random_state=42)

# Make the named split directory inside of the splits directory
split_dir = os.path.join(root, 'splits', SPLIT_NAME)
if not os.path.exists(split_dir):
    os.makedirs(split_dir)

# Write the master, train, val, and test data to CSV files
dataframe.to_csv(os.path.join(split_dir, SPLIT_NAME + '.csv'), index=False)
train.to_csv(os.path.join(split_dir, 'train_' + SPLIT_NAME + '.csv'), index=False)
val.to_csv(os.path.join(split_dir, 'val_' + SPLIT_NAME + '.csv'), index=False)
test.to_csv(os.path.join(split_dir, 'test_' + SPLIT_NAME + '.csv'), index=False)