# <font color = teal> Yaml files of stratified split for training and testing </font>

With this notebook, you can create the yaml files needed in training and testing with a data split which is made  using **stratification**. More detailed information about the stratified data split itself is available in the notebook [Introduction to data handling](1_introduction_data_handling.ipynb).

------

Note that the hyperparameters considering training and testing are set in the yaml files.

In [4]:
# --- Parameters for the yaml files -------------
# Training parameters
batch_size = 10
num_workers = 0
epochs = 1
lr = 0.003
weight_decay = 0.00001

# Device configurations
device_count = 1

# -----------
# Decision threshold for predictions
threshold = 0.5

Examples of the training and the testing yaml files are provided below.

**Yaml file for training a model**
```
# INITIAL SETTINGS
train_file: train_split_1_1.csv
val_file: val_split_1_1.csv

# TRAINING SETTINGS
batch_size: 10
num_workers: 0
epochs: 1
lr: 0.003000
weight_decay: 0.000010

# VALIDATION SETTINGS
threshold = 0.5

# DEVICE CONFIGS
device_count: 1

```

**Yaml file for testing a model**

```
# INITIAL SETTINGS
test_file: test_split_1.csv
model: split_1_1.pth

# TESTING SETTINGS
threshold: 0.500000

# DEVICE CONFIGS
device_count: 1

```

In [5]:
import os, re
from pathlib import Path
from ruamel.yaml import YAML
  

# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# PARAMETERS TO CREATE STRATIFIED YAML FILES  
# ------------------------------------------

# From where to load the csv files of stratified split
csv_path = os.path.join(abs_path.parent.absolute(), 'data', 'split_csvs', 'stratified_smoke')

# Where to save the training yaml files
train_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'training', 'train_stratified_smoke')

# Where to save the testing yaml files
test_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'predicting', 'predict_stratified_smoke')

The directory of the csv files of the stratified data split should be found from `/data/split_csvs/`. 

In [6]:
# Stratified csv files
csv_files = sorted([file for file in os.listdir(csv_path) if not file.startswith('.') and file.endswith('.csv')])

print(*csv_files, sep='\n')

test_split_1.csv
train_split_1_1.csv
train_split_1_2.csv
train_split_1_3.csv
train_split_1_4.csv
val_split_1_1.csv
val_split_1_2.csv
val_split_1_3.csv
val_split_1_4.csv


## <font color = teal> Combinations of training and testing sets </font>

With stratified split, there are **five** different possibilities to split the Physionet 2021 databases into training and testing sets:

    1) train: PTB_PTBXL.csv, INCART.csv, G12EC.csv, ChapmanShaoxing_Ningbo.csv
       test: CPSC_CPSC-Extra.csv
       
    1) train: PTB_PTBXL.csv, INCART.csv, G12EC.csv, CPSC_CPSC-Extra.csv
       test: ChapmanShaoxing_Ningbo.csv

    2) train: PTB_PTBXL.csv, INCART.csv, ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv
       test: G12EC.csv

    3) train: PTB_PTBXL.csv, G12EC.csv, ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv
       test: INCART.csv

    4) train: G12EC.csv, INCART.csv, ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv
       test: PTB_PTBXL.csv
       
Training sets will be stratified into training and validation sets.

Let's combine the csv files of training, validation and testing sets, e.g., `train_split_1_1.csv`, `val_split_1_1.csv` and `test_split_1.csv`, into a list of lists.

In [7]:
# First, divide train and validation splits into own lists
train_files = [file for file in csv_files if 'train' in file]
val_files = [file for file in csv_files if 'val' in file]

# Zip these two and convert to list since they should be sorted similarly
train_val_pair = list(zip(train_files, val_files))
print('First 5 training and validation pairs')
print(*train_val_pair[:5], sep='\n')
print()

# Seems right based on the print:
# Add the prediction fi
test_files = [file for file in csv_files if 'test' in file]

split_nums = [] # These are for yaml files!!
train_val_test = []
for i, pair in enumerate(train_val_pair):
    
    # Training and validation files separately
    train_tmp, val_tmp = train_val_pair[i][0], train_val_pair[i][1]
    
    # Get the split number of training file
    split_num = re.search('_((\w*)_\d)', pair[0])
    split_nums.append(str(split_num.group(1) + '.yaml')) # For yaml files!!
    
    train_split_num = split_num.group(2)
    for test_tmp in test_files:
        # Get the split number of testing file
        test_split_num = re.search('_(\w*)', test_tmp).group(1)
        
        # If same split number in training, validation and prediction, combine
        if train_split_num == test_split_num:
            train_val_test.append([train_tmp, val_tmp, test_tmp])
            
print('Training, validation and testing pairs')
print(*train_val_test, sep='\n')
print()

print('Total of {} training, validation and testing sets'.format(len(train_val_test)))

First 5 training and validation pairs
('train_split_1_1.csv', 'val_split_1_1.csv')
('train_split_1_2.csv', 'val_split_1_2.csv')
('train_split_1_3.csv', 'val_split_1_3.csv')
('train_split_1_4.csv', 'val_split_1_4.csv')

Training, validation and testing pairs
['train_split_1_1.csv', 'val_split_1_1.csv', 'test_split_1.csv']
['train_split_1_2.csv', 'val_split_1_2.csv', 'test_split_1.csv']
['train_split_1_3.csv', 'val_split_1_3.csv', 'test_split_1.csv']
['train_split_1_4.csv', 'val_split_1_4.csv', 'test_split_1.csv']

Total of 4 training, validation and testing sets


From the sets above we are going to create the yaml files. 

In [8]:
def save_yaml(yaml_str, yaml_path, split):
    ''' Save the given string as a yaml file in the given location
    '''
    
    # Make the yaml directory
    if not os.path.isdir(yaml_path):
        os.mkdir(yaml_path)
    
    # Write the yaml file
    with open(os.path.join(yaml_path, split), 'w') as yaml_file:
        yaml = YAML()
        code = yaml.load(yaml_str)
        yaml.dump(code, yaml_file)
    
        
def create_testing_yaml(test_csv, split):
    ''' Make a yaml file for prediction. The base of it is presented above
    '''
    # The name of the model
    # e.g. trained with a yaml file ´split_0_0_smoke.yaml´
    #      model saved as `split_0_0_smoke.pth`
    model_name = split.split('.')[0] + '.pth'
    
    yaml_str = '''\
# INITIAL SETTINGS
    test_file: {}
    model: {}
    
# TESTING SETTINGS
    threshold: {:f}

# DEVICE CONFIGS
    device_count: {}  
    '''.format(test_csv,
               model_name,
               threshold,
               device_count)
    
    yaml_path = test_yaml_save_path
    save_yaml(yaml_str, yaml_path, split)
    

def create_training_yaml(train_csv, val_csv, split):
    ''' Make a yaml file for training. The base of it is presented above
    '''
    yaml_str = '''\
# INITIAL SETTINGS
    train_file: {}
    val_file: {}

# TRAINING SETTINGS
    batch_size: {}
    num_workers: {}
    epochs: {}
    lr: {:f}
    weight_decay: {:f}
    
# VALIDATION SETTINGS
    threshold: {:f}

# DEVICE CONFIGS
    device_count: {}   
    '''.format(train_csv,
               val_csv,
               batch_size,
               num_workers, 
               epochs,
               lr,
               weight_decay,
               threshold,
               device_count)
    
    yaml_path = train_yaml_save_path
    save_yaml(yaml_str, yaml_path, split)

sets_and_name = list(zip(train_val_test, split_nums))
for pair, split in sets_and_name:
    train_tmp, val_tmp, test_tmp = pair[0], pair[1], pair[2]
    
    print('Training, validation and testing sets are')
    print(train_tmp.split('.')[0], '\t', val_tmp.split('.')[0], '\t', test_tmp.split('.')[0])
    print('Yaml file will be named as', split)
    print()
    
    # Training yaml file
    create_training_yaml(train_tmp, val_tmp, split)
    
    # Testing yaml file
    create_testing_yaml(test_tmp, split)

Training, validation and testing sets are
train_split_1_1 	 val_split_1_1 	 test_split_1
Yaml file will be named as split_1_1.yaml

Training, validation and testing sets are
train_split_1_2 	 val_split_1_2 	 test_split_1
Yaml file will be named as split_1_2.yaml

Training, validation and testing sets are
train_split_1_3 	 val_split_1_3 	 test_split_1
Yaml file will be named as split_1_3.yaml

Training, validation and testing sets are
train_split_1_4 	 val_split_1_4 	 test_split_1
Yaml file will be named as split_1_4.yaml



Now all the yaml files for training, validation and testing are created! The training yaml files are located in `/configs/training/train_stratified_smoke/` named as `split_1_1.yaml`, `split0_1.yaml`. `split_1_2.yaml` and `split_1_3.yaml`, and the testing yaml files in `/configs/predicting/predict_stratified_smoke/` named with the same names.

<font color=red>**NOTE 1!**</font> It is extremely important that in the test yaml file the model is set with the same name as the yaml file which the model is trained with. E.g. when a model is trained using `split_1_1.yaml`, it will be saved as `split_1_1.pth`. This makes using the repository much easier and simpler. Mind this, if you want to edit the code below.

<font color=red>**NOTE 2!**</font> If you are now wondering why the yaml files don't have the csv values --- `train_file`, `val_file` and `test_file` --- in single quotation marks, it's ok. Scripts are able to read and load the values from the yaml files even without those marks.