# Introductions for Training a Model

Training can be performed either with a single yaml file or with several yaml files located in the same directory.

To train a model, you'll need csv files to tell a model which parts of the data (i.e., which ECGs) are used as training data, which part as validation data and which part as testing data, and yaml files based on these csv files for detailed configurations. The csv files can be created by following the introductions in the notebook [Introductions for Data Handling](1_introductions_data_handling.ipynb). Yaml files can be created with the notebooks [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of Stratified Split for Training and Prediction](2_physionet_stratified_yaml_files.ipynb).

------


First, check out the yaml files in `/configs/training/` which one you want to use or if you want to make one of your own. The yaml file (in this case, `train_smoke.yaml`) should have the following attributes

```
# INITIAL SETTINGS
train_file: 'train_split0_0.csv'
val_file: 'val_split0_0.csv'

# TRAINING SETTINGS
batch_size: 10
num_workers: 0

# SAVE, LOAD AND DISPLAY INFORMATION
epochs: 1
```

where `train_file` refers to a csv file which is used for training phase of epoch(s). It consists of the paths for ECG recordings, patients' gender and age, and labels used in the classification. `val_file` refers to a csv file which is used in the validation phase of epoch(s). `epochs` refers to the total number of training epochs.

If you want to run multiple yaml files at the same run, locate all individual yaml files in one directory.

In the script `train_model.py`, consider checking the paths for csv files (`csv_root`) and data (`data_root`) to make sure they point to the right locations. There are also other parameters for training as

```
args.device_count = 2
args.lr = 0.003
args.weight_decay = 0.00001
```

which can be set as desired.

<font color ='red'>**NOTE!**</font> The attribute `args.device_count` should be considered. It refers to the number of GPUs which are used in training. This reflects to the attribute `args.batch_size` since the batch size should be divided by device count. Obviously, the result of the division should be a positive integer.

Trained model(s), ROC curves of the training history will be saved in a subdirectory of the `experiments` directory. Each file or directory will be named after the used yaml file or the directory where the yaml files exist.

Model will be saved in a `pth` format, ROC curves as `png` images and history as a `pickle` file. ROC curves will have a directory of their own.

### Terminal commands

Run a terminal command which consist of the script and the yaml file *or* the directory where all the yaml files are located, so one of the followings

```
python train_model.py train_smoke.yaml
python train_model.py train_multiple_smoke
```

-----------------

## Example: Smoke testing

### One yaml file

Let's use the presented yaml file `train_smoke.yaml` from `/configs/training/` in smoke testing. The csv files `train_split0_0.csv` and `val_split0_0.csv` have already been constructed with the script `create_data_split_csvs.py`. That said, we are training model using the stratified split of the data. The first rows from both csvs are as follows

Train csv:
```
path,age,gender,fs,426783006,426177001,164934002,427084000,164890007,39732003,164889003,59931005,427393009,270492004
./data/physionet_preprocessed_smoke/G12EC/E00008_preprocessed.mat,76.0,Male,500.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/G12EC/E00006_preprocessed.mat,65.0,Male,500.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
./data/physionet_preprocessed_smoke/G12EC/E00005_preprocessed.mat,83.0,Male,500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/G12EC/E00004_preprocessed.mat,75.0,Male,500.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...
```

Val csv:
```
path,age,gender,fs,426783006,426177001,164934002,427084000,164890007,39732003,164889003,59931005,427393009,270492004
./data/physionet_preprocessed_smoke/G12EC/E00001_preprocessed.mat,-1.0,Female,500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/G12EC/E00003_preprocessed.mat,-1.0,Male,500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/INCART/I0020_preprocessed.mat,59.0,Female,257.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
./data/physionet_preprocessed_smoke/INCART/I0050_preprocessed.mat,70.0,Male,257.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/PTB_PTBXL/HR00008_preprocessed.mat,48.0,Male,500.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/ChapmanShaoxing_Ningbo
...
```

The paths in `train_model.py` should be set as below

```
csv_root = './data/split_csvs/physionet_stratified_smoke/'
data_root = './data/physionet_preprocessed_smoke/' 
```

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_root` **attribute is set with the assumption that *the data used is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as** `./data/physionet_data_smoke/`. The paths for ECGs will be different in the csv files based on the fact if the data is preprocessed or not.</font>

And other parameters as

```
args.device_count = 1
args.lr = 0.003
args.weight_decay = 0.00001
```

Now you should be ready to perform the training:

```
python train_model.py train_smoke.yaml
```

The trained model and training history can be found as `train_smoke.pth` and `train_smoke_history.pickle` in the `experiments` directory. ROC curves are saved in the `ROC_train_smoke` directory as `roc-e1.png`, `roc-e2.png` etc, named after the number of the epoch on which it's been drawn.

### Multiple yaml files in a directory

The idea is similar here: Now you should locate all the yaml files constructed as the presented yaml file `train_smoke.yaml`. There are a directory `train_multiple_smoke` in `/configs/training/` in which there are two yaml files named as `split0_0.yaml` and `split0_1.yaml`. Each yaml file has different yaml files for training and validation as follows

`split0_0.yaml`:
```
# INITIAL SETTINGS
train_file: train_split0_0.csv
val_file: val_split0_0.csv

# TRAINING SETTINGS
batch_size: 10
num_workers: 0

# SAVE, LOAD AND DISPLAY INFORMATION
epochs: 1
```

`split0_1.yaml`:
```
# INITIAL SETTINGS
train_file: train_split0_1.csv
val_file: val_split0_1.csv

# TRAINING SETTINGS
batch_size: 10
num_workers: 0

# SAVE, LOAD AND DISPLAY INFORMATION
epochs: 1

```

Both of the files are constructed from the same stratified split where training data is from databases G12EC, INCART, PTB_PTBXL, ChapmanShaoxing_Ningbo as it was instructed in  `physionet_stratified_smoke.yaml`.

The paths in `train_model.py` should still be set as below

```
csv_root = './data/split_csvs/physionet_stratified_smoke/'
data_root = './data/physionet_preprocessed_smoke/' 
```

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_root` **attribute is set with the assumption that *the data used is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as** `./data/physionet_data_smoke/`. The paths for ECGs will be different in the csv files based on the fact if the data is preprocessed or not.</font> 

Terminal command for training is now

```
python train_model.py train_multiple_smoke
```

Trained models (of which there are now three) are now saved as `smoke0.pth`, `smoke1.pth` and `smoke2.pth` in the `train_multiple_smoke` subdirectory (named after the directory in which the yaml files are located) of the `experiments` directory. Similarly there are three similarly named `pickle` files for each training history. ROC curves can be found from three different directories that are also named after the yaml files.