# <font color = teal> Setup the runs </font>

This notebook contains information about every step that you should take before starting to train or evaluate the models.

1)  [<b>Data acquisition</b>](#data_acq): how to download data, with a focus on the[ Physionet 2021 dataset](#physionet) and [Shandong Provincial Hospital database](#sph)

2)  [<b>Label mapping</b>](#map): how to handle the situation when the ECGs are labeled with different standards

3)  [<b>Data preprocessing</b>](#data_prep): how to preprocess data before training a model, if required

4)  [<b>Data splitting for training and evaluation</b>](#split): the idea of splitting data into CSV files for training, validation and testing, and how to perform it

5)  [<b>Creation of the configuration files</b>](#yamls): how to create YAML files to configure the runs

In the end of the notebook, an [example situation](#example) is presented to illustrate the steps mentioned.

--------

## <font color = teal> 1) Data acquisition: PhysioNet 2021 dataset and Shandong Provincial Hospital database</font> <a id='data_acq'></a>

### <font color = teal> 1.1) PhysioNet Challenge 2021 </font> <a id='physionet'></a>

The exploration of the dataset is available in the notebook [Exploration of the PhysioNet2021 data](exploration_physionet2021_data.ipynb).

You can obtain the PhysioNet 2021 data in `tar.gz` format using either of the following methods:

1) Download the data manually from [here](https://moody-challenge.physionet.org/2021/) under **Data Access**

2) Utilize the provided code within this notebook to get access to the data: <font color = red> (TBA) This is not valid anymore, `wget` path and data structure changed by PhysioNet </font>


In [2]:
# All imports
import os, re
import tarfile
from pathlib import Path
import pandas as pd

In [14]:
# Download the PhysioNet2021 data
!wget -r -N -c -np https://physionet.org/files/challenge-2021/1.0.3/training/ # NEW PATH

--2024-02-28 14:53:34--  https://physionet.org/files/challenge-2021/1.0.3/training/
Resolving physionet.org (physionet.org)... 18.18.42.54
Connecting to physionet.org (physionet.org)|18.18.42.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘physionet.org/files/challenge-2021/1.0.3/training/index.html’

physionet.org/files     [ <=>                ]   1.15K  --.-KB/s    in 0s      

Last-modified header missing -- time-stamps turned off.
2024-02-28 14:53:35 (459 MB/s) - ‘physionet.org/files/challenge-2021/1.0.3/training/index.html’ saved [1177]

Loading robots.txt; please ignore errors.
--2024-02-28 14:53:35--  https://physionet.org/robots.txt
Reusing existing connection to physionet.org:443.
HTTP request sent, awaiting response... 200 OK

    The file is already fully retrieved; nothing to do.

--2024-02-28 14:53:35--  https://physionet.org/files/challenge-2021/1.0.3/training/chapman_shaoxing/
Reusing existing connection t

Once the `tar.gz` files are downloaded, they need to be extracted to the `data` directory located at the root of the repository. You can store the files in the structure you prefer, but as an example, one could want them in the following structure under the data directory:

- CPSC Database and CPSC-Extra Database
- St. Petersberg (INCART) Database
- PTB and PTB-XL Database
- The Georgia 12-lead ECG Challenge (G12EC) Database
- Chapman-Shaoxing and Ningbo Database

To begin, let's retrieve the names of the `tar.gz` files.

In [4]:
# All tar.gz files (in the current working directory)
curr_path = os.getcwd()
targz_files = [file for file in os.listdir(curr_path) if os.path.isfile(os.path.join(curr_path, file)) and file.endswith('tar.gz') and file.startswith('WFDB')]

# Let's sort the files
targz_files = sorted(targz_files)

for i, file in enumerate(targz_files):
    print(i, file)

0 WFDB_CPSC2018.tar.gz
1 WFDB_CPSC2018_2.tar.gz
2 WFDB_ChapmanShaoxing.tar.gz
3 WFDB_Ga.tar.gz
4 WFDB_Ningbo.tar.gz
5 WFDB_PTB.tar.gz
6 WFDB_PTBXL.tar.gz
7 WFDB_StPetersburg.tar.gz


To follow the structure outlined above, the listed `tar.gz` files will be extracted as follows:

* WFDB_CPSC2018.tar.gz + WFDB_CPSC2018_2.tar.gz
* WFDB_StPetersburg.tar.gz
* WFDB_PTB.tar.gz + WFDB_PTBXL.tar.gz
* WFDB_Ga.tar.gz
* WFDB_ChapmanShaoxing.tar.gz + WFDB_Ningbo.tar.gz

Let's create a subdirectory named as `physionet_data` for the files.

In [5]:
# Let's make the split as tuples of tar.gz files
# NB! If the split mentioned above wanted, SORTING is really important!
tar_split = [(targz_files[0], targz_files[1]),
             (targz_files[7], ),
             (targz_files[5], targz_files[6]),
             (targz_files[3], ),
             (targz_files[2], targz_files[4])]

print(*tar_split, sep="\n")

('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz')
('WFDB_StPetersburg.tar.gz',)
('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz')
('WFDB_Ga.tar.gz',)
('WFDB_ChapmanShaoxing.tar.gz', 'WFDB_Ningbo.tar.gz')


In [9]:
# Function to extract files from a given tar to a given directory
# Will exclude subdirectories from a given tar and load all the files directly to the given directory
def extract_files(tar, directory):
    
    file = tarfile.open(tar, 'r')
    
    n_files = 0
    for member in file.getmembers():
        if member.isreg(): # Skip if the TarInfo is not file
            member.name = os.path.basename(member.name) # Reset path
            file.extract(member, directory)
            n_files += 1
    
    file.close() 
    re_dir = re.search('data.*', directory)[0]
    print('- {} files extracted to {}'.format(n_files, './'+re_dir))

In [10]:
# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# Path to the physionet_data directory, i.e., save the dataset here
data_path = os.path.join(abs_path.parent.absolute(), 'data', 'physionet_data')

if not os.path.exists(data_path):
    os.makedirs(data_path)

# Directories to which extract the data
# NB! Gotta be at the same length than 'tar_split'
dir_names = ['CPSC_CPSC-Extra', 'INCART', 'PTB_PTBXL', 'G12EC', 'ChapmanShaoxing_Ningbo']

# Extracting right files to right subdirectories
for tar, directory in zip(tar_split, dir_names):
    
    print('Extracting tar.gz file(s) {} to the {} directory'.format(tar, directory))
    
    # Saving path for the specific files
    save_tmp = os.path.join(data_path, directory)
    # Preparing the directory
    if not os.path.exists(save_tmp):
        os.makedirs(save_tmp)
        
    if len(tar) > 1: # More than one database in tuple
        for one_tar in tar:
            extract_files(one_tar, save_tmp)
    else: # Only one database in tuple
        extract_files(tar[0], save_tmp)
        
print('Done!')

Extracting tar.gz file(s) ('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz') to the CPSC_CPSC-Extra directory


ReadError: file could not be opened successfully:
- method gz: ReadError('empty file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('empty file')

Now total of **176 506** files (based on the data exploration presented earlier) should be located in the `physionet_data` directory. One ECG recording consists of a binary MATLAB v4 file and a text file in header format. To verify this count, you can easily perform a file count as follows:

In [None]:
total_files = 0
for root, dirs, files in os.walk(data_path):
    total_files += len(files)
    
print('Total of {} files'.format(total_files))

Total of 176506 files


### <font color = teal> 1.2) Shandong Provincial Hospital and other sources </font> <a id='sph'></a>

New data can be downloaded and utilized with this repository, provided that the following guidelines are followed:

1\) ECG data can be in either `MATLAB v4` (.mat) or `h5` formats. When setting up training or testing, ECGs are loaded into `torch.utils.data.Dataset` using the following fuction from the [`dataset_utils.py`](../src/dataloader/dataset_utils.py) script:

<br>

```python
def load_data(case):
    ''' Load a MATLAB v4 file or a H5 file of an ECG recording
    '''

    if case.endswith('.mat'):
        x = loadmat(case)
        return np.asarray(x['val'], dtype=np.float64)
    else:
        with h5py.File(case) as f:
            x = f['ecg'][()]
        return np.asarray(x, dtype=np.float64)
```

<br>

In both cases, there should be either a `val` column in the MATLAB file or an `ecg` column in the `h5` file.

2\) Demographic data, including diagnoses, age and gender, are loaded from either `WFDB header format` format (.hea) or from CSV files. These files also contain other essential information about the ECGs, such as sample frequency. They are required when generating CSV files using the [`create_data_csvs.py`](../create_data_csvs.py) script. Header files have a structure similar to the one shown below:

<br>

```text
JS00001 12 500 5000 23-Mar-2021 20:20:47
JS00001.mat 16+24 1000/mV 16 0 -254 21756 0 I
JS00001.mat 16+24 1000/mV 16 0 264 -599 0 II
JS00001.mat 16+24 1000/mV 16 0 517 -22376 0 III
JS00001.mat 16+24 1000/mV 16 0 -5 28232 0 aVR
JS00001.mat 16+24 1000/mV 16 0 -386 16619 0 aVL
JS00001.mat 16+24 1000/mV 16 0 390 15121 0 aVF
JS00001.mat 16+24 1000/mV 16 0 -98 1568 0 V1
JS00001.mat 16+24 1000/mV 16 0 -312 -32761 0 V2
JS00001.mat 16+24 1000/mV 16 0 -98 32715 0 V3
JS00001.mat 16+24 1000/mV 16 0 810 15193 0 V4
JS00001.mat 16+24 1000/mV 16 0 810 14081 0 V5
JS00001.mat 16+24 1000/mV 16 0 527 32579 0 V6
#Age: 85
#Sex: Male
#Dx: 164889003,59118001,164934002
#Rx: Unknown
#Hx: Unknown
#Sx: Unknown
```

<br>

The third value in the first row is the sample frequency, and age, gender and diagnoses are gotten from the lines 14-16. The 12 lines after the first one are corresponding the 12 leads of the ECG recordings, `Rx` is refering to the medical prespriction, `Hx` to the medical history and `Sx` to symptom or surgery. 

For the Shandong Provincial Hospital dataset, the demographic metadata is stored as a CSV file. The CSV file has a structure shown below:

<br>

ECG_ID | AHA_Code |Patient_ID|Age|Sex|N|Date
-------|----------|----------|---|---|-|---
A00001|22;23|S00001|55|M|5000|2020-03-04
A00002|1|S00002|32|M|6000|2019-09-03
A00003|1|S00003|63|M|6500|2020-07-16
A00004|23|S00004|31|M|5000|2020-07-14
...|...|...|...|...|...|...|

<br>

Note that whether the metadata is in a CSV file or in a header file, <font color = red><i>all metadata files should be located in the same directory than the corresponding ECG files</i></font>. Also note, that <font color = red>the CSV files and header files should <b>always</b> contain the similarly named columns like `Age`, `Sex` and `Gender`</font>.

3\) It is highly recommended to download data directly into the `data` directory. Several predefined paths within the repository are configured to point to this directory, particularly when creating CSV files or YAML files for configuring training and testing.

Additionally, the code provided above can be used for extracting tar.gz files. The `extract_files(tar, save_path)` function is a general-purpose tool for this task, where `tar` represents the tar.gz file to be extracted, and `save_path` indicates the absolute path to which the file is extracted. For example, the following code snippet demonstrates its usage:

In [17]:
## Other sources
## -------------
'''
# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# The name of the tar gz file (located in the current directory)
tar = 'records.tar.gz'
save_path = os.path.join(abs_path.parent.absolute(), 'data', 'Shandong')
#extract_files(tar, save_path)

# If needed, the samples can be renamed
samples = sorted(os.listdir(save_path))  # get the current names of the samples
path_samples = sorted([os.path.join(save_path, s) for s in samples]) # add path to the current names
new_names = [name.replace('A', 'SPH') for name in samples] # rename the beginning of the file (e.g., A0001.h5 to SPH0001.h5)
path_new_names = sorted([os.path.join(save_path, nn) for nn in new_names]) # add path the the old names

# Rename samples
for old, new in zip(path_samples, path_new_names):
    os.rename(old, new)

# Also, if csv file of metadata has the IDs too, change them if needed
csv_file = pd.read_csv('metadata.csv')
csv_file ['ECG_ID'] = [s.replace('.h5', '') for s in new_names]

# Be also sure that we have a sample frequency in it
csv_file['fs'] = 500
csv_file.to_csv(os.path.join(save_path, 'metadata.csv'), index=False)
'''


----------------

## <font color = teal>2) Label mapping </font> <a id='map'></a>

The primary diagnostic code system utilized in this repository is SNOMED CT Codes.

Given that ECGs can be labeled with various code systems, we provide the [`label_mapping.py`](../label_mapping.py) script to facilitate the conversion of non-SNOMED CT Codes. It's important to note that the script assumes the metadata for a specific dataset is available in a CSV file. The core concept of this script is to map the labels extracted from the metadata file using the [`AHA_SNOMED_mapping.csv`](../data/AHA_SNOMED_mapping.csv) file (located in the `data` directory) and then add the corresponding SNOMED CT Code to an additional column named `SNOMEDCTCodes. The remaining content of the metadata file remains unaltered.

The `AHA_SNOMED_mapping.csv` file contains diagnostic statements conforming to the AHA standard along with their corresponding SNOMED CT Codes in the following format:

<br>

Dx|SNOMEDCTCode|AHA_Code
------|---------|-------------
1st degree av block|270492004|82
prolonged pr interval|164947007|82
atrial fibrillation|164889003|50
atrial fibrillation|164889003|50+346
atrial fibrillation|164889003|50+347
atrial flutter|164890007|51
incomplete right bundle branch block|713426002|105
... | ... | ...

<br>

New codes can be added to this file without issue, as long as the structure of the CSV file is maintained. The `Dx` column, while not used for mapping, serves as a helpful reference for diagnoses. A new CSV file for the updated metadata will be saved at the same location from where the original metadata file has been loaded. This needs to considered, as <i>there can be only one CSV file per data folder</i>. Thus, either remove or move the metadata CSV file which is not used in the later phases.

By default, the [`label_mapping.py`](../label_mapping.py) script is set to convert AHA statements into SNOMED CT Codes. There are attributes that need to be considered before running the script:

1) There are two paths for CSV files: `csv_path` refers to the path from where the file metadata file for the Shandong Provincial Hospital dataset is found. `csv_save_path` refers to the location and the filename to which the updated metadata file will be stored. By default, it's set to point to the metadata file of the Shandong Provincial hospital data in the `smoke_data` directory, i.e., `./data/smoke_data/SPH/metadata.csv`. 

2) The `map_path` attribute refers to the location from where the file containing the mapping between AHA and SNOMED standards is found. By default, it is set to use the provided mapping csv file in the `data` directory, i.e., `./data/AHA_SNOMED_mapping.csv`.

3) The `from_code` attribute refers to the diagnostic standard which is initially converted into some other standard. By default, it is set to AHA statements, i.e., these will be converted to SNOMED CT Codes.

4) The `to_code` attribute refers to the diagnostic standard into which some other diagnostic standard is likely to be converted. By default it is set to SNOMED CT codes, i.e., this is the standard into which the AHA statements are converted. 

5) The `imputation` attribute defines whether a sinus rhythm label is needed to be imputed to the Shandong Provincial Hospital data. The imputation is recommended if the sinus rhythm label will be used in classification. The basic idea behind the imputation is to fit a Logistic Regression model with PhysioNet 2021 data (which contains the sinus rhythm label) and this model then is used to predict the sinus rhythm label for the Shandong Provincial Hospital dataset. By default, this is set to `True`, i.e., the sinus rhythm labels are imputed to the Shandong Provincial Hospital dataset.

6) The `input_dir` attribute refers to the path from where the data is loaded to fit the Logistic Regression model. By default, this is set to the location of the PhysioNet 2021 data, i.e., it is used to fit the Logistic Regression model.

### <font color = teal> Terminal command </font>

To use the script, simply run the following command:

```
python label_mapping.py
```


----

## <font color = teal> 3) Preprocessing data (optional) </font> <a id='data_prep'></a>

You can preprocess all the data using various transformations with the [`preprocess_data.py`](../preprocess_data.py) script. There are two critical attributes to consider:

<br>

```python
# Original data location
from_directory = os.path.join(os.getcwd(), 'data', 'smoke_data')

# New location for preprocessed data
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_smoke_data')
```

<br>

The `from_directory` points to the directory where the data in its original format is loaded from, such as the downloaded Physionet Challenge 2021 data. The `new_directory` indicates the new location where the preprocessed ECGs will be saved. It's worth noting that <i>when saving the preprocessed ECGs, the metadata files (e.g., a CSV file or each corresponding .hea file) must also be copied to that location</i>. The script takes care of this task as well.

By default, two transforms are applied:

<br>

```python
# ------------------------------
# --- PREPROCESS TRANSFORMS ----
new_fs = 250

# - BandPass filter 
bpf = BandPassFilter(fs = ecg_fs)
ecg = bpf(ecg)

# - Linear interpolation
linear_interp = Linear_interpolation(fs_new = new_fs, fs_old = ecg_fs)
ecg = linear_interp(ecg)

# ------------------------------
# ------------------------------
```

<br>

It's important to note that the preprocessing step **is not mandatory for the repository to function**. However, if you plan to use transforms, like the two mentioned above, during the training phase, it's advisable to preprocess the data beforehand using the provided script. This can significantly improve training efficiency.

Other transformations are defined in the [`dataset.py`](../src/dataloader/dataset.py) script located in `src/dataloader/`, which is executed during training. Additionally, several transformations can be found in the [`transforms.py`](../src/dataloader/transforms.py) script, including mentioned `Linear_interpolation` and `BandPassFilte`, which are in the same directory.

### <font color = teal> Terminal command </font>

To use the script, simply run the following command:

```
python preprocess_data.py
```

<font color = red>**NOTE!** The preprocessed ECGs will have different names compared to the original ones, so it's important to keep track of whether the preprocessing step has been completed or not</font>

--------

## <font color = teal> 4) Splitting data into CSV files </font> <a id='split'></a>

The entire data splitting process is managed by the [`create_data_csvs.py`](../create_data_csvs.py) script. The primary objective of this script is to divide the data into CSV files, which can later be utilized for both training and testing purposes.

The CSV files possess the following columns: `path` (the path to a specific ECG recording), `age`, `gender`, and all the diagnoses represented in SNOMED CT codes, which are employed as labels for classification. A value of 1 indicates that the patient has a particular disease. The structure of these CSV files is as follows:

<br>

| path  | age  | gender  | 10370003  | 111975006 | 164890007 | *other diagnoses...* |
| ------------- |-------------|-------------| ------------- |-------------|-------------|-------------|
| ./data/A0002.mat | 49.0 | Female | 0 | 0 | 1 | ... |
| ./data/A0003.mat | 81.0 | Female | 0 | 1 | 1 | ... |
| ./data/A0004.mat | 45.0 |  Male  | 1 | 0 | 0 | ... |
| ... | ... |  ...  | ... | ... | ... | ... |

<br>

Before running the script, several attributes must be carefully configured within the main block:

1) The `stratified` attribute is used to specify the type of data split. If set to `True`, the script performs a stratified data split; if set to `False`, a database-wise split is executed.

2) The `data_dir` attribute should be set to point to the correct data directory from which the data is loaded. By default, it's configured to load data from the `smoke_data` directory, which is a subdirectory of the data directory.

3) The `csv_dir` attribute should be set to specify the desired folder where the generated CSV files will be saved. This directory will be created under the `split_csv` directory which is a subdirectory of the `data` directory. By default, CSV files are saved in the `smoke_stratified_shuffle` folder.

4) The class labels are needed to be set with the `labels` attribute in the script. By default, the labels are set as a list containing `426783006`, `426177001`, `427084000`, `164890007`, `164889003`, `427393009`, `164947007` and `270492004`. 

5) The `cv_type` attributes defines which kind of cross validation is used: ShuffleSplit or K-Fold. By default, it's set to `shufflesplit`.

6) If K-Fold cross validation is selected, the `cv_k` attributes defines the k value. By default, the k value is set to `5`.

The data can be split in two different ways:

<font color = forestgreen><b>Database-wise</b></font>. Above, the data was extracted into the following structure:

   * CPSC Database and CPSC-Extra Database
   * St. Petersberg (INCART) Database
   * PTB and PTB-XL Database
   * The Georgia 12-lead ECG Challenge (G12EC) Database
   * Chapman-Shaoxing and Ningbo Database
   
 The `dbwise_csvs(data_directory, save_directory, labels)` function leverages this structure to create CSV files. The `data_directory` parameter indicates the location of the data (note that subdirectories are considered as different databases), `save_directory` specifies where the CSV files will be saved, and `labels` lists the SNOMED CT coded labels used for classification. CSV files are named according to the directories from which they were created, e.g., a CSV file for CPSC Database and CPSC-Extra Database is named `CPSC_CPSC-Extra.csv`.

 Since models read only one CSV file to obtain the paths of the ECGs during training and testing, there may be a need to combine multiple databases into a single CSV file. For instance, if CPSC-Extra, CPSC, G12EC, PTB, and PTB XL are used for training, these combinations of different databases can be created in the script. However, there's an assumption behind the split: as the `data_directory` parameter is given, from where the names of the databases (subdirectories) are read, <i>one is considered as a test set, one as a validation set, and all the others as a training set</i>.

<font color = forestgreen><b>Stratified</b></font>. The `stratified_csvs(data_directory, save_directory, labels, train_test_splits)` function performs a stratified split. The parameters are similar to those in the `dbwise_csvs` function, but there's an additional parameter, `train_test_splits`, which is a dictionary specifying the train-test splits. The dictionary is structured as a collection of dictionaries, where the internal directories refer to specific train-test splits. For example, there is one train-test split set in the train_test_splits dictionary by default, as follows:

<br>

   ```python
   train_test_splits = {
   'split_1': {    
         'train': ['G12EC', 'SPH', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
         'test': ['CPSC_CPSC-Extra']
      }
   }
   ```
<br>

In this example, `split_1` is simply a name for this particular split, and it contains keys for train and test to specify which databases are considered as training data and which ones as test data. Training data is further divided into training and validation sets. Names (e.g., split_1) are used to identify the CSV files. Note that <i>`train` and `test` keys should have values set as a list, even though there would be only one database defined</i>!

Stratification itself is performed using one of the two the multilabel cross-validators from the `iterative-stratification` package: `MultilabelStratifiedShuffleSplit(n_splits, test_size, train_size, random_state)` or `MultilabelStratifiedKFold(n_splits, shuffle, random_state)`. In the case of ShuffleSplit, the script will use the number of splits (`n_splits`) equal to the length of the training dataset (in our case, it will be 4 as data is gathered from 'G12EC', 'SPH', 'PTB_PTBXL', and 'ChapmanShaoxing_Ningbo'). For K-Fold cross validation, the k value needs to be set manually. Additional information about this and other multilabel cross-validators can be found in [the GitHub repository of iterative-stratification](https://github.com/trent-b/iterative-stratification).

### <font color = teal> About the naming of csv files </font>

<font color = forestgreen><b>Database-wise</b></font>. The CSV files created through the database-wise split have intuitive names. They are named after the source database, for example, PTB_PTBXL.csv. The combined CSV files are named based on the combination of databases used to structure the CSV file. For instance, if the training data comes from CPSC/CPSC-Extra, SPH, and PTB/PTB-XL databases, the combined CSV files will be named `CPSC_CPSC-Extra_INCART_PTB_PTBXL.csv`.

<font color = forestgreen><b>Stratified</b></font>.  Since there are four different data sources, four distinct data splits need to be created, each using one dataset as a testing set while the others serve as training sets. The [`create_data_csvs.py`](../create_data_csvs.py) script names the resulting CSV files based on information from the keys of the `train_test_splits` dictionary and the outcomes of the `MultilabelStratifiedShuffleSplit()` cross validator. For example, the CSV file names could be as follows:

<br>
<center>
train_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp val_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp test_split_1.csv
</center>
<br>

First, the CSV files are categorized into `train`, `val`, and `test` subsets. Then, as the `train_test_splits` dictionary employs keys for indexing the splits (e.g., `split_1` and `split_2`), the first indeces correspond to this indexing. The latter indeces refer to the results of the `MultilabelStratifiedShuffleSplit()` cross-validator: Since data is collected from four different databases and stratified, it generates four different splits for the training and validation sets. Therefore, the latter indexing reflects the functionality of the mentioned cross-validator. The same applies to the naming of the CSV files by the `MultilabelStratifiedKFold` cross-validator.

### <font color = teal>  Terminal command </font>

Once the necessary attributes are initialized, you can execute the desired data split using the following terminal command:

```
python create_data_csvs.py
```

-------------

## <font color = teal> Creation of the configuration files </font> <a id='yamls'></a>

Once you have mapped and preprocessed the data and created the CSV files to tell the models the paths of the ECGs, the labels and demographic features, the YAML file can be created using either step-by-step implementation in the notebooks [Yaml files of database-wise split for training and testing](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of stratified split for training and testing](2_physionet_stratified_yaml_files.ipynb) or the [`create_yaml_files.py`](../create_yaml_files.py) script.

Before running the script, the following attributes need to be set:

1) `csv_path`: From where the data split CV files are read. This will be stored into the YAML files so make sure that the folder exists.

2) `train_yaml_save_path` and `test_yaml_save_path`: Where to store the YAML files for training and testing, respectively. The last argument in the path should only be changed so that the files are stored in `configs/training/` or `configs/predicting/`. 

3) `name`: The beginning for the YAML files. So if set as `split`, the YAML files will be names e.g. `split_1_1.yaml`.

4) `train_dict` and `test_dict`: To set the parameters for the YAML files. Do <b>NOT</b> change the keys as there is listed all the necessary keys used in training and evaluation scipts. New keys can be added.

The YAML files are created based on either the CSV files listed in the path set to `csv_path` or database-wise, when the databases are listed in the `data` attribute. There should be a pair of training and test files per each model. I.e., if you have splitted data into CSV files for training, validation and testing as follows

<br>
<center>
train_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp val_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp test_split_1.csv
</center>
<br>

you will create two yaml files, the first one below being for training and the second one for testing:

```text
train_file: train_split_1_1.csv
val_file: val_split_1_1.csv
csv_path: stratified_smoke
batch_size: 10
num_workers: 0
epochs: 1
lr: 0.003000
weight_decay: 0.000010
device_count: 1
threshold: 0.500000
bandwidth:
```

and

```text
test_file: test_split_1.csv
model: split_1_1.pth
csv_path: stratified_smoke
device_count: 1
threshold: 0.500000
bandwidth:



### <font color = teal>  Terminal command </font>

Once the necessary attributes are initialized, you can execute the yaml script using the following terminal command:

```
python create_yaml_files.py
```

-------------

## <font color = teal> Example: Creating CSV files from the provided smoke data </font> <a id='example'></a>

*All the data files for smoke testing are available in the repository.*

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_dir` **attribute is set with the assumption that *the data is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as the** `smoke_data` **directory.** The paths for ECGs will be different in the csv files depending on whether preprocessing has been used or not.</font>

First, we want to **preprocess the data**. Ensure that the [`preprocess_data.py`](../preprocess_data.py) script has the original and new directories set as follows

```python
from_directory = os.path.join(os.getcwd(), 'data', 'smoke_data')
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

The `from_directory` attribute refers to the directory where the original data is located, and the `new_directory` attribute refers to where the preprocessed data will be saved. You can perform preprocessing with the following command:

```
python preprocess_data.py
```

Once the data is preprocessed, you can proceed to **split the data into CSV files** using the [`create_data_csvs.py`](../create_data_csvs.py) script. Make sure the attributes are set as below before running the following command:

```
python create_data_csvs.py
```

###  <font color = forestgreen> Database-wise split </font>

Ensure that the following attributes are set **before the if-else statement** as follows:

```python
stratified = False
data_dir =  'preprocessed_smoke_data'
csv_dir =  'dbwise_smoke'
labels = ['426783006', '426177001', '427084000', '164890007', '164889003', '427393009', '164947007', '270492004']
```

The resulting CSV files will be saved in `./data/split_csvs/dbwise_smoke/` with the following files:

```text
ChapmanShaoxing_Ningbo_CPSC_CPSC-Extra_G12EC.csv
ChapmanShaoxing_Ningbo_CPSC_CPSC-Extra_PTB_PTBXL.csv
ChapmanShaoxing_Ningbo_CPSC_CPSC-Extra_SPH.csv
ChapmanShaoxing_Ningbo_G12EC_PTB_PTBXL.csv
ChapmanShaoxing_Ningbo_G12EC_SPH.csv
ChapmanShaoxing_Ningbo_PTB_PTBXL_SPH.csv
ChapmanShaoxing_Ningbo.csv
...
```

To create the YAML files, run the [`create_yaml_files.py`](../create_yaml_files.py) script. For the dbwise splitted data, make sure that you set

```python
kfold = False
csv_path = os.path.join(os.getcwd(), 'data', 'split_csvs', 'dbwise_smoke')
```

Then, you can store the YAML files in the folders `train_smoke_yamls` and `test_smoke_yamls` using `split` as the beginning of the filenames by setting

```python
train_yaml_save_path = os.path.join(os.getcwd(), 'configs', 'training', 'train_yamls_smoke')
test_yaml_save_path = os.path.join(os.getcwd(), 'configs', 'predicting', 'test_yamls_smoke')
name = 'split'
```

The YAML files for training and evaluation differ so they can be set e.g. as follows:

```python
train_dict = {
        'csv_path': os.path.basename(csv_path),

        # Training parameters
        'batch_size': 10,
        'num_workers': 0,
        'epochs': 1,
        'lr': 0.003,
        'weight_decay': 0.00001,

        # Device configurations
        'device_count': 1,

        # Decision threshold for predictions
        'threshold': 0.5,

        # For ECGs
        'bandwidth': ''
}
    
test_dict = {
        
        # Directory where the csv file for data split are in 'data/split_cvs/'
        # (the same value as already set in `csv_path`, however, only the basename)
        'csv_path': os.path.basename(csv_path),

        # Device configurations
        'device_count': 1,

        # Decision threshold for predictions
        'threshold': 0.5,

        # For ECGs
        'bandwidth': ''
}   
```

Finally, the command to run the YAML script was the following:

```
python create_yaml_files.py
```

After execution, you should be able to find the YAML files for training and evaluation in `configs/training/` and `configs/predicting/`, respectively. There should be similarly named folders as the attributes `train_yaml_save_path` and `test_yaml_save_path` are set.

### <font color = forestgreen> Stratified split </font>

For stratified data split, you can use a dictionary of dictionaries to specify the desired train-test splits. By default, there is one split provided in the script:

- Train data is from the directories G12EC, SPH, PTB_PTBXL, and ChapmanShaoxing_Ningbo.
- Test data is from the directory CPSC_CPSC-Extra.

Set the following attributes **before the if-else statement** as follows:

```python
stratified = True
data_dir =  'preprocessed_smoke_data'
csv_dir =  'stratified_smoke'
labels = ['426783006', '426177001', '427084000', '164890007', '164889003', '427393009', '164947007', '270492004']
```

Specify the databases used for training and testing by setting the `train_test_splits` attribute **within the if block**:

```python
train_test_splits = {
    'split_1': {    
        'train': ['G12EC', 'SPH', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
        'test': ['CPSC_CPSC-Extra']
    }
}
```

The resulting csv files will be saved in `./data/split_csvs/stratified_smoke/` with names like:

```text
test_split1.csv
train_split_1_1.csv
train_split_1_2.csv
train_split_1_3.csv
train_split_1_4.csv
val_split_1_1.csv
val_split_1_2.csv
val_split_1_3.csv
val_split_1_4.csv
```

As with the dbwise-splitted data, you can create the YAML files by running the [`create_yaml_files.py`](../create_yaml_files.py) script. You can set all the other attributes as shown with the dbwise-splitted data, but change the followings:

```python
kfold = True
csv_path = os.path.join(os.getcwd(), 'data', 'split_csvs', 'stratified_smoke')
```