# <font color = teal> Yaml files of database-wise split for training and testing </font>

With this notebook, you can create the yaml files needed in training and testing with a data split which is made **database-wise**. The so-called "original data split" should be first made with the script `create_data_csvs.py`. This ensures that there are already one csv file for each database. The combined csv files are created below. More detailed information about this is available in the notebook [Introduction to data handling](1_introduction_data_handling.ipynb). 

------

Note that the hyperparameters considering training and testing are set in the yaml files.

In [19]:
# --- Parameters for the yaml files -------------
# Training parameters
batch_size = 10
num_workers = 0
epochs = 1
lr = 0.003
weight_decay = 0.00001

# Device configurations
device_count = 1

# Decision threshold for predictions
threshold = 0.5

Examples of the training and the testing yaml files are provided below.

**Yaml file for training a model**
```
# INITIAL SETTINGS
train_file: train_split_1_1.csv
val_file: val_split_1_1.csv

# TRAINING SETTINGS
batch_size: 10
num_workers: 0
epochs: 1
lr: 0.003000
weight_decay: 0.000010

# VALIDATION SETTINGS
threshold = 0.5

# DEVICE CONFIGS
device_count: 1

```

**Yaml file for testing a model**

```
# INITIAL SETTINGS
test_file: test_split_1.csv
model: split_1_1.pth

# TESTING SETTINGS
threshold: 0.500000

# DEVICE CONFIGS
device_count: 1

```

In [20]:
import os
from pathlib import Path
from ruamel.yaml import YAML
from itertools import combinations

# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# PARAMETERS TO CREATE DBWISE YAML FILES  
# ----------------------------------------

# From where to load the csv files of "original data split"
# Note that this is the saving location for combined csv files too
csv_path = os.path.join(abs_path.parent.absolute(), 'data', 'split_csvs', 'dbwise_smoke')

# Where to save the training yaml files
train_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'training', 'train_dbwise_smoke')

# Where to save the testing yaml files
test_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'predicting', 'predict_dbwise_smoke')

# The files that are used for training, validation and test sets
data = ['PTB_PTBXL.csv', 'Shandong.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv', 'CPSC_CPSC-Extra.csv']

# Name for yaml files given as a string
# names will be formed as <name><index>.yaml
name = 'split_'

## <font color = teal> Combinations of different databases for training, validation and test sets </font>

Yaml files are created based on the csv files listed above. Yaml files can be divided into training, validation and testing yaml files. All the possible splits of Physionet 2021 data are as follows

**CPSC_CPSC-Extra.csv as test data**

    1) train: PTB_PTBXL.csv, INCART.csv, G12EC.csv
       val: ChapmanShaoxing_Ningbo.csv

    2) train: PTB_PTBXL.csv, INCART.csv, ChapmanShaoxing_Ningbo.csv
       val: G12EC.csv

    3) train: PTB_PTBXL.csv, ChapmanShaoxing_Ningbo.csv, G12EC.csv
       val: INCART.csv

    4) train: INCART.csv, ChapmanShaoxing_Ningbo.csv, G12EC.csv
       val: PTB_PTBXL.csv

**ChapmanShaoxing_Ningbo.csv as test data**

    1) train: PTB_PTBXL.csv, INCART.csv, G12EC.csv
       val: CPSC_CPSC-Extra.csv

    2) train: PTB_PTBXL.csv, INCART.csv, CPSC_CPSC-Extra.csv
       val: G12EC.csv

    3) train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: INCART.csv

    4) train: INCART.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: PTB_PTBXL.csv, 

**PTB_PTBXL.csv as test data**

    1) train: ChapmanShaoxing_Ningbo.csv, INCART.csv, G12EC.csv
       val: CPSC_CPSC-Extra.csv

    2) train: ChapmanShaoxing_Ningbo.csv, INCART.csv, CPSC_CPSC-Extra.csv
       val: G12EC.csv

    3) train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: INCART.csv

    4) train: INCART.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: ChapmanShaoxing_Ningbo.csv

**INCART.csv as test data**

    1) train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, G12EC.csv
       val: CPSC_CPSC-Extra.csv

    2) train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, CPSC_CPSC-Extra.csv
       val: G12EC.csv

    3) train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: PTB_PTBXL.csv

    4) train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, G12EC.csv
       val: ChapmanShaoxing_Ningbo.csv

**G12EC.csv as test data**

    1) train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, INCART.csv
       val: CPSC_CPSC-Extra.csv

    2) train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, CPSC_CPSC-Extra.csv
       val: INCART.csv

    3) train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, INCART.csv
       val: PTB_PTBXL.csv

    4) train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, INCART.csv
       val: ChapmanShaoxing_Ningbo.csv

Let's make a function to find the combinations which are set in the `training_data` attribute.

In [21]:
tvt_combs = []

# Find all combinations of the spesified data (= csv files of the databases)
# One is left for testing so taking combinations in size of len(data) -1
for train_val_set in combinations(data, len(data)-1):
    test = next(file for file in data if not file in train_val_set)

    # And one is left for validation set so len(combinations took within first loop) -1
    for train_set in combinations(train_val_set, len(train_val_set)-1):
        val = next(file for file in data if not file in train_set and file != test)
        tvt_combs.append([list(train_set), val, test])

print('The number of all the train-val-test combinations:', len(tvt_combs))
print(*tvt_combs, sep='\n')

The number of all the train-val-test combinations: 20
[['PTB_PTBXL.csv', 'Shandong.csv', 'G12EC.csv'], 'ChapmanShaoxing_Ningbo.csv', 'CPSC_CPSC-Extra.csv']
[['PTB_PTBXL.csv', 'Shandong.csv', 'ChapmanShaoxing_Ningbo.csv'], 'G12EC.csv', 'CPSC_CPSC-Extra.csv']
[['PTB_PTBXL.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv'], 'Shandong.csv', 'CPSC_CPSC-Extra.csv']
[['Shandong.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv'], 'PTB_PTBXL.csv', 'CPSC_CPSC-Extra.csv']
[['PTB_PTBXL.csv', 'Shandong.csv', 'G12EC.csv'], 'CPSC_CPSC-Extra.csv', 'ChapmanShaoxing_Ningbo.csv']
[['PTB_PTBXL.csv', 'Shandong.csv', 'CPSC_CPSC-Extra.csv'], 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv']
[['PTB_PTBXL.csv', 'G12EC.csv', 'CPSC_CPSC-Extra.csv'], 'Shandong.csv', 'ChapmanShaoxing_Ningbo.csv']
[['Shandong.csv', 'G12EC.csv', 'CPSC_CPSC-Extra.csv'], 'PTB_PTBXL.csv', 'ChapmanShaoxing_Ningbo.csv']
[['PTB_PTBXL.csv', 'Shandong.csv', 'ChapmanShaoxing_Ningbo.csv'], 'CPSC_CPSC-Extra.csv', 'G12EC.csv']
[['PTB_PTBXL.csv', 'Shandong

All the different training, validation and testing splits are stored in the `combinations_data` attribute. 

## <font color = teal> Yaml files </font>

The csv files for different combinations are already made with the `create_data_csvs.py` script.  All the yaml files for training will be saved to `/configs/training`. The yaml files for testing will be saved to `/configs/predicting`.

In [22]:
# Create as many names as there are train-val-test splits
split_names = []
for i in range(len(tvt_combs)):
    split_names.append(name + str(i+1) + '.yaml')  
    
print(f'{len(split_names)} names created, for example {split_names[:3]}')

20 names created, for example ['split_1.yaml', 'split_2.yaml', 'split_3.yaml']


In [23]:
def save_yaml(yaml_str, yaml_path, i):
    ''' Save the given string as a yaml file in the given location.
    '''
    # Make the yaml directory
    if not os.path.isdir(yaml_path):
        os.mkdir(yaml_path)
    
    # Write the yaml file
    with open(os.path.join(yaml_path, split_names[i] ), 'w') as yaml_file:
        yaml = YAML()
        code = yaml.load(yaml_str)
        yaml.dump(code, yaml_file)
    
        
def create_testing_yaml(test_csv, i):
    ''' Make a yaml file for prediction. The base of it is presented above.
    '''
    
    model_name = split_names[i].split('.')[0] + '.pth'
    yaml_str = '''\
# INITIAL SETTINGS
    test_file: {}
    model: {}
    
# TESTING SETTINGS
    threshold: {:f}

# DEVICE CONFIGS
    device_count: {}  
    '''.format(test_csv,
               model_name,
               threshold,
               device_count)
    
    yaml_path = test_yaml_save_path
    save_yaml(yaml_str, yaml_path, i)
    

def create_training_yaml(train_csv, val_csv, i):
    ''' Make a yaml file for training. The base of it is presented above.
    '''
    yaml_str = '''\
# INITIAL SETTINGS
    train_file: {}
    val_file: {}

# TRAINING SETTINGS
    batch_size: {}
    num_workers: {}
    epochs: {}
    lr: {:f}
    weight_decay: {:f}

# DEVICE CONFIGS
    device_count: {}   
    '''.format(train_csv,
               val_csv,
               batch_size,
               num_workers, 
               epochs,
               lr,
               weight_decay,
               device_count)
    
    yaml_path = train_yaml_save_path
    save_yaml(yaml_str, yaml_path, i)


# Make the yaml files
for i, data in enumerate(tvt_combs):
    train, val, test = data

    # Find the related train csv
    train_csvs = sorted([os.path.splitext(db)[0] for db in train])
    train_csv = '_'.join(sorted(train_csvs, key=str.lower)) + '.csv'

    assert os.path.join(os.path.join(csv_path, train_csv)), 'Can´t find the related train csv file.'
    
    print('Training data: `{}`, validation data: `{}`, test data: `{}`'.format(train_csv, val, test))
    
    create_training_yaml(train_csv, val, i)
    create_testing_yaml(test, i)
    print()

Training data: `G12EC_PTB_PTBXL_Shandong.csv`, validation data: `ChapmanShaoxing_Ningbo.csv`, test data: `CPSC_CPSC-Extra.csv`

Training data: `ChapmanShaoxing_Ningbo_PTB_PTBXL_Shandong.csv`, validation data: `G12EC.csv`, test data: `CPSC_CPSC-Extra.csv`

Training data: `ChapmanShaoxing_Ningbo_G12EC_PTB_PTBXL.csv`, validation data: `Shandong.csv`, test data: `CPSC_CPSC-Extra.csv`

Training data: `ChapmanShaoxing_Ningbo_G12EC_Shandong.csv`, validation data: `PTB_PTBXL.csv`, test data: `CPSC_CPSC-Extra.csv`

Training data: `G12EC_PTB_PTBXL_Shandong.csv`, validation data: `CPSC_CPSC-Extra.csv`, test data: `ChapmanShaoxing_Ningbo.csv`

Training data: `CPSC_CPSC-Extra_PTB_PTBXL_Shandong.csv`, validation data: `G12EC.csv`, test data: `ChapmanShaoxing_Ningbo.csv`

Training data: `CPSC_CPSC-Extra_G12EC_PTB_PTBXL.csv`, validation data: `Shandong.csv`, test data: `ChapmanShaoxing_Ningbo.csv`

Training data: `CPSC_CPSC-Extra_G12EC_Shandong.csv`, validation data: `PTB_PTBXL.csv`, test data: `Chapm

Now all the yaml files for training, validation and testing are created! The training yaml files are located in `/configs/training/train_dbwise_smoke/` and the testing yaml files in `/configs/predicting/predict_dbwise_smoke/`. There are also the combined csv files for ECGs created in `/data/split_csvs/dbwise_smoke/`.

<font color=red>**NOTE 1!**</font> It is extremely important that in the testing yaml file the model is set with the same name as the yaml file which the model is trained with. E.g. when a model is trained using `split_1.yaml`, it will be saved as `split_1.pth`. This makes using the repository much easier and simpler. Mind this, if you want to edit the code below.

<font color=red>**NOTE 2!**</font> If you are now wondering why the yaml files don't have the csv files in single quotation marks, it's ok. Scripts are able to read and load the values from the yaml files even without those marks.