# Yaml files of Database-wise Split for Training and Testing

With this notebook, you can create the yaml files needed in training and testing with a data split which is made **database-wise**. The so-called "original data split" should be first made with the script `create_data_split_csvs.py`. This ensures that there are already one csv file for each database. The combined csv files are created below. More detailed information about this is available in the notebook [Introduction for Data Handling](1_introduction_data_handling.ipynb). 

------

In [2]:
import os
from pathlib import Path

# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# PARAMETERS TO CREATE DBWISE YAML FILES  
# ----------------------------------------

# From where to load the csv files of "original data split"
csv_path = os.path.join(abs_path.parent.absolute(), 'data', 'split_csvs', 'physionet_DBwise_smoke')
print(csv_path)

# Where to save combined training csv files
# e.g. we want to use PTB_PTBXL.csv, INCART.csv and G12EC.csv for training
#      so we need to combine them first for the model to use them
combined_save_path = os.path.join(abs_path.parent.absolute(), 'data', 'split_csvs', 'physionet_DBwise_smoke')

# Where to save the training yaml files
train_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'training', 'train_DBwise_smoke')

# Where to save the testing yaml files
test_yaml_save_path = os.path.join(abs_path.parent.absolute(), 'configs', 'predicting', 'prediction_DBwise_smoke')

# The files which need to be split into training and validation data
# We have 5 different databases so 5 different train/val split sets
training_data = [['PTB_PTBXL.csv', 'INCART.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv']]

# !! All the other splits -> just add to the list above if wanted
# ['PTB_PTBXL.csv', 'INCART.csv', 'G12EC.csv', 'CPSC_CPSC-Extra.csv'],
# ['PTB_PTBXL.csv', 'INCART.csv', 'CPSC_CPSC-Extra.csv', 'ChapmanShaoxing_Ningbo.csv'],
# ['PTB_PTBXL.csv', 'CPSC_CPSC-Extra.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv'],
# ['INCART.csv', 'CPSC_CPSC-Extra.csv', 'G12EC.csv','ChapmanShaoxing_Ningbo.csv']

# Name for yaml files given as a string
# names will be formed as <name><index>.yaml
name = 'split'

# --- Parameters for training yaml files -------------
batch_size = 10
num_workers = 0
epochs = 1

e:\git\12-lead-ecg-classifier\data\split_csvs\physionet_DBwise_smoke


First, the csv files, from which the yaml files are created, need to be found. They should be located in `/data/split_csvs/`.

In [3]:
# DB-wise CSV files (only the original ones)
csv_files = []
for file in os.listdir(csv_path):
    if not file.startswith('.'):
        chars = [c for c in file]
        if chars.count('_') <= 1:
            csv_files.append(file)

print(*csv_files, sep='\n')

ChapmanShaoxing_Ningbo.csv
CPSC_CPSC-Extra.csv
G12EC.csv
INCART.csv
PTB_PTBXL.csv


## Combinations of Training and Validation Files

Yaml files are created based on the csv files listed above. Yaml files can be divided into training, validation and testing yaml files. All the possible splits are as follows

```
1) CPSC_CPSC-Extra.csv for test data

train: PTB_PTBXL.csv, INCART.csv, G12EC.csv
val: ChapmanShaoxing_Ningbo.csv
test: CPSC_CPSC-Extra.csv

train: PTB_PTBXL.csv, INCART.csv, ChapmanShaoxing_Ningbo.csv
val: G12EC.csv
test: CPSC_CPSC-Extra.csv

train: PTB_PTBXL.csv, ChapmanShaoxing_Ningbo.csv, G12EC.csv
val: INCART.csv
test: CPSC_CPSC-Extra.csv

train: INCART.csv, ChapmanShaoxing_Ningbo.csv, G12EC.csv
val: PTB_PTBXL.csv, 
test: CPSC_CPSC-Extra.csv

2) ChapmanShaoxing_Ningbo.csv for test data

train: PTB_PTBXL.csv, INCART.csv, G12EC.csv
val: CPSC_CPSC-Extra.csv
test: ChapmanShaoxing_Ningbo.csv

train: PTB_PTBXL.csv, INCART.csv, CPSC_CPSC-Extra.csv
val: G12EC.csv
test: ChapmanShaoxing_Ningbo.csv

train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: INCART.csv
test: ChapmanShaoxing_Ningbo.csv

train: INCART.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: PTB_PTBXL.csv, 
test: ChapmanShaoxing_Ningbo.csv

3) PTB_PTBXL.csv for test data

train: ChapmanShaoxing_Ningbo.csv, INCART.csv, G12EC.csv
val: CPSC_CPSC-Extra.csv
test: PTB_PTBXL.csv

train: ChapmanShaoxing_Ningbo.csv, INCART.csv, CPSC_CPSC-Extra.csv
val: G12EC.csv
test: PTB_PTBXL.csv

train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: INCART.csv
test: PTB_PTBXL.csv

train: INCART.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: ChapmanShaoxing_Ningbo.csv
test: PTB_PTBXL.csv

4) INCART.csv for test data

train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, G12EC.csv
val: CPSC_CPSC-Extra.csv
test: INCART.csv

train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, CPSC_CPSC-Extra.csv
val: G12EC.csv
test: INCART.csv

train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: PTB_PTBXL.csv
test: INCART.csv

train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, G12EC.csv
val: ChapmanShaoxing_Ningbo.csv
test: INCART.csv

5) G12EC.csv for test data

train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, INCART.csv
val: CPSC_CPSC-Extra.csv
test: G12EC.csv

train: ChapmanShaoxing_Ningbo.csv, PTB_PTBXL.csv, CPSC_CPSC-Extra.csv
val: INCART.csv
test: G12EC.csv

train: ChapmanShaoxing_Ningbo.csv, CPSC_CPSC-Extra.csv, INCART.csv
val: PTB_PTBXL.csv
test: G12EC.csv

train: PTB_PTBXL.csv, CPSC_CPSC-Extra.csv, INCART.csv
val: ChapmanShaoxing_Ningbo.csv
test: G12EC.csv

```


Let's make a function to find the combinations which are set in the `training_data` attribute.

In [4]:
from itertools import combinations

def different_combinations(files):
    '''Every combination of the files for train/val split'''
    
    all_combs = []
    for combs in combinations(files, 4):
        for c in combinations(combs, 3):
            train_tmp = list(c)
            val_tmp = [file for file in files if file not in c]
            train_val = [train_tmp, val_tmp]
            all_combs.append(train_val)
    
    return all_combs


combinations_data = []
for data in training_data:
    combs_tmp = different_combinations(data)
    combinations_data.append(combs_tmp)

train_val_set = []
# Find test data file for these so it's included neither in training nor validation data
for i, data in enumerate(combinations_data):
    for train_val_set in data:
        train_val_files = train_val_set[0] + train_val_set[1]
        test_file = [os.path.basename(file) for file in csv_files if os.path.basename(file) not in train_val_files]
        train_val_set.append(test_file) 

# For example, data_1
print(*combinations_data[0], sep='\n')
print('The length of the first set:', len(combinations_data[0]))

[['PTB_PTBXL.csv', 'INCART.csv', 'G12EC.csv'], ['ChapmanShaoxing_Ningbo.csv'], ['CPSC_CPSC-Extra.csv']]
[['PTB_PTBXL.csv', 'INCART.csv', 'ChapmanShaoxing_Ningbo.csv'], ['G12EC.csv'], ['CPSC_CPSC-Extra.csv']]
[['PTB_PTBXL.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv'], ['INCART.csv'], ['CPSC_CPSC-Extra.csv']]
[['INCART.csv', 'G12EC.csv', 'ChapmanShaoxing_Ningbo.csv'], ['PTB_PTBXL.csv'], ['CPSC_CPSC-Extra.csv']]
The length of the first set: 4


All the different training, validation and testing splits are stored in the `combinations_data` attribute. 

## Combined CSV files and Yaml files

Let's make the combined csv files as the yaml files in the training phase use two attributes --- `train_file` and `val_file` ---, and all the information of training data should be found from one csv file. I.e., all ECGs in training should be listed in one csv file, and respectively, all ECGs for validation should be listed in another csv file, as well as the ECGs for testing. All the yaml files for training will be saved to `/configs/training`. The yaml files for testing will be saved to `/configs/predicting`.

The base of the training yaml files is as follows

```
# INITIAL SETTINGS
train_file: PTB_PTBXL_INCART_G12EC.csv
val_file: ChapmanShaoxing_Ningbo.csv

# TRAINING SETTINGS
batch_size: 10
num_workers: 0

# SAVE, LOAD AND DISPLAY INFORMATION
epochs: 1

```

and the one of the testing yaml files as follows

```
# INITIAL SETTINGS
test_file: CPSC_CPSC-Extra.csv
model: split0.pth
```

In [5]:
import pandas as pd
from ruamel.yaml import YAML
 
# Names for yaml files
element_count = sum([len(elem) for elem in combinations_data]) # Counting all the elements in a list of lists
split_names = []
for i in range(element_count):
    split_names.append(name + str(i) + '.yaml')    
    
def save_yaml(yaml_str, yaml_path, i):
    ''' Save the given string as a yaml file in the given location.
    '''
    # Make the yaml directory
    if not os.path.isdir(yaml_path):
        os.mkdir(yaml_path)
    
    # Write the yaml file
    with open(os.path.join(yaml_path, split_names[i] ), 'w') as yaml_file:
        yaml = YAML()
        code = yaml.load(yaml_str)
        yaml.dump(code, yaml_file)
    
        
def create_prediction_yaml(test_csv, i):
    ''' Make a yaml file for prediction. The base of it is presented above.
    '''
    
    model_name = split_names[i].split('.')[0] + '.pth'
    yaml_str = '''\
# INITIAL SETTINGS
    test_file: {}
    model: {}
    '''.format(test_csv, model_name)
    yaml_path = test_yaml_save_path
    save_yaml(yaml_str, yaml_path, i)
    

def create_training_yaml(train_csv, val_csv, i):
    ''' Make a yaml file for training. The base of it is presented above.
    '''
    yaml_str = '''\
# INITIAL SETTINGS
    train_file: {}
    val_file: {}

# TRAINING SETTINGS
    batch_size: {}
    num_workers: {}

# SAVE, LOAD AND DISPLAY INFORMATION
    epochs: {}
    '''.format(train_csv, val_csv,
              batch_size, num_workers, epochs)
    yaml_path = train_yaml_save_path
    save_yaml(yaml_str, yaml_path, i)
    
        
def combine_csv(files, i):
    '''Combine all files in the list of train csv files. Save the result as a csv file.
    ''' 

    # As we con't have csv files of combined databases for training let's make them
    train_csv_name = [os.path.basename(file).split('.')[0] for file in files[0]]
    train_csv_name = '_'.join(train_csv_name) + '.csv'
    train_files = [os.path.join(csv_path, f) for f in files[0]]
    combined_train_csv = pd.concat([pd.read_csv(f) for f in train_files], ignore_index = True)
    print('Saving combined training data as', train_csv_name, 'with a length of {}.'.format(len(combined_train_csv)))

    save_path = combined_save_path
    # Make the save directory
    if not os.path.isdir(save_path):
        os.mkdir(save_path)
    
    # Saving a csv file
    combined_train_csv.to_csv(os.path.join(save_path, train_csv_name), sep=',', index=False)
    
    # Now we got the csv file for training data, e.g., PTB_PTBXL_INCART_G12EC.csv
    # Validation file is simply
    val_csv = ''.join(files[1])
    print('Validation data is from', val_csv)
    create_training_yaml(train_csv_name, val_csv, i)

    # Lastly, the prediction yaml files
    pred_csv = ''.join(files[2])
    print('Testing data is from', pred_csv)
    create_prediction_yaml(pred_csv, i)
    
split_i = 0  
for data in combinations_data:
    for train_val_set in data:
        combine_csv(train_val_set, split_i)
        split_i += 1
        print()

Saving combined training data as PTB_PTBXL_INCART_G12EC.csv with a length of 30.
Validation data is from ChapmanShaoxing_Ningbo.csv
Testing data is from CPSC_CPSC-Extra.csv

Saving combined training data as PTB_PTBXL_INCART_ChapmanShaoxing_Ningbo.csv with a length of 33.
Validation data is from G12EC.csv
Testing data is from CPSC_CPSC-Extra.csv

Saving combined training data as PTB_PTBXL_G12EC_ChapmanShaoxing_Ningbo.csv with a length of 34.
Validation data is from INCART.csv
Testing data is from CPSC_CPSC-Extra.csv

Saving combined training data as INCART_G12EC_ChapmanShaoxing_Ningbo.csv with a length of 29.
Validation data is from PTB_PTBXL.csv
Testing data is from CPSC_CPSC-Extra.csv



Now all the yaml files for training, validation and prediction are created! The training yaml files are located in `/configs/training/train_DBwise_smoke/` and the testing yaml files in `/configs/predicting/predict_DBwise_smoke/`. There are also the combined csv files for ECGs themselves created in `/data/split_csvs/physionet_DBwise_smoke/`.

<font color=red>**NOTE 1!**</font> It is extremely important that in the testing yaml file the model is set with the same name as the yaml file which the model is trained with. E.g. when a model is trained using `split0.yaml`, it will be saved as `split0.pth`. This makes using the repository much easier and simpler. Mind this, if you want to edit the code below.

<font color=red>**NOTE 2!**</font> If you are now wondering why the yaml files don't have the csv values in single quotation marks, it's ok. Scripts are able to read and load the values from the yaml files even without those marks!