# Introductions for the Prediction and Evaluation

Predicting phase can be performed either with a single yaml file or with several yaml files located in the same directory.

To make predictions with a trained model, you'll need a csv file to tell a model which part of the data (i.e., which ECGs) is used as testing data, and a yaml file which names this csv file for the model. The csv file(s) can be created by following the introductions in the notebook [Introductions for Data Handling](1_introductions_data_handling.ipynb). Yaml files can be created with the notebooks [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of Stratified Split for Training and Prediction](2_physionet_stratified_yaml_files.ipynb).

-----------------

<font color ='red'> **NOTE!** </font> *Before you start testing the models, especially when you have made predictions multiple times, check the saving directory that it either contains the predictions you have made in the previous iteration or is empty. If you use different test data with the same yaml file, you might end up having predictions from different csv files and evaluation doesn't work. Mind this especially, if you get an **AssertionError**.*

As with the training phase, yaml files are also used in the prediction and evaluation phase. They have a structure as follows (`predict_smoke.yaml`)

```
# INITIAL SETTINGS
test_file: 'test_split0.csv'
model: 'split0_0.pth'
```

where `test_file` refers to the csv file of the test data and `model` refers to the file of the trained model which you want to test.

The script for this phase is `test_model.py`. You should first check the paths `csv_root` and `data_root` that they point to the right locations in the `data` directory (and its subdirectories). The attribute `csv_root` is set to find the csv file of the test data, and `data_root` is set to find the data. `model` will be searched from the `experiments` directory automatically so only the name of the file is nessessary.

<font color ='red'>**NOTE!**</font> The attribute `args.device_count` should be considered. It refers to the number of GPUs which are used in prediction, as in the training phase.

The predictions are saved as csv files with the following structure

```
#Record ID
164889003, 270492004, 164909002, 426783006, 59118001, 284470004,  164884008,
        1,         1,         0,         0,         0,        0,          0,        
      0.9,       0.6,       0.2,       0.05,      0.2,      0.35,       0.35,  
```

where the first row, `#Record ID`, refers to the file name from which the prediction is made, and the second to the class labels used in SNOMED CT codes, and the third row to the predicted label in binary form (1 - patient is predicted to have the diagnosis above, and 0 - the opposite), and the fourth row to the probability scores for each predicted label. 

The script performs the evaluation automatically after the predictions are made. In the evaluation phase, `metrics.py` is used from `/src/modeling/` to compute the wanted metrics. The function `evaluate_predictions(test_data, pred_dir)` is called where the parameter `test_data` refers to the location of the test data, and the parameter `pred_dir` to the location of the predictions made from the test data. These parameters refer to the arguments `args.test_path` and `args.output_dir` in the script `test_model.py`. 

The evaluation metrics will be saved in the same directory as the predictions and is in the form of a `pickle` file.

If you want to run multiple yaml files at the same run, locate all individual yaml files in one directory, just like in the training phase.

### Terminal commands

Run a terminal command which consist of the script and the yaml file *or* the directory where all the yaml files are located, so one of the followings

```
python test_model.py predict_smoke.yaml
python test_model.py predict_multiple_smoke
```

---------------------

## Example: Smoke testing 

### One yaml file

The yaml file for smoke testing --- `predict_smoke.yaml` --- is available in `/configs/predicting/`. Make sure the model is trained first and is named as `train_smoke.pth`! Obviously, perform training explained in the notebook [Introductions for Training a Model](3_introductions_training.ipynb) first.

*And before anything, check if there exists a directory named `predict_smoke` in the `experiments` directory. If there are other predictions made and they are not the ones from the files listed below, evaluation won't work correctly. Mind this especially when you get an **AssertionError**.*

The csv file `test_split0.csv` has the following structure

```
path,age,gender,fs,426783006,426177001,164934002,427084000,164890007,39732003,164889003,59931005,427393009,270492004
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/A0004_preprocessed.mat,45.0,Male,500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/A0003_preprocessed.mat,81.0,Female,500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/A0007_preprocessed.mat,74.0,Male,500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/Q0001_preprocessed.mat,53.0,Male,500.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/A0009_preprocessed.mat,81.0,Male,500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
./data/physionet_preprocessed_smoke/CPSC_CPSC-Extra/A0002_preprocessed.mat,49.0,Female,500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
```

so total of six files are considered as test data. All of them are from the `CPSC` and `CPSC-Extra` databases.

Let's then check the paths and the number of devices used in `test_model.py` to point the right locations. For testing, they are as follows

```
csv_root = './data/split_csvs/physionet_stratified_smoke/'
data_root = './data/physionet_preprocessed_smoke/'

args.device_count = 1
```

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_root` **attribute is set with the assumption that *the data used is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as** `./data/physionet_data_smoke/`. The paths for ECGs will be different in the csv files based on the fact if the data is preprocessed or not.</font> 

Then you can just run the command to make the predictions with the trained model saved as `train_smoke.pth` as

```
python test_model.py predict_smoke.yaml
```

The predictions can be found in the `predict_smoke` subdirectory of the `experiments` directory. Each prediction is named after the original file name from which the predictions have been made. In smoke testing, they are as follows

```
A0002_preprocessed.csv
A0003_preprocessed.csv
A0004_preprocessed.csv
A0007_preprocessed.csv
A0009_preprocessed.csv
Q0001_preprocessed.csv
```

Each have the structure of the one presented above. 

After all the predictions are made, the scripts calls `evaluate_predictions(test_data, pred_dir)` to evaluate these predictions. Such metrics are shown in terminal as follows

```
Micro Average Precision: 0.11131725417439703
Micro AUROC:             0.5205811138014528
Accuracy:                0.0
Micro F1-score:          0.14285714285714285
```

These metrics are now saved in the file `eval_history.pickle` and can be found in the same directory as the predictions are located.


### Multiple yaml files in a directory

The idea is similar here: Now you should locate all the yaml files constructed as the presented yaml file `predict_smoke.yaml`. There are a directory `predict_multiple_smoke` in `/configs/predicting/` in which there are two yaml files named as `split0_0.yaml` and `split0_1.yaml`. The csv files have the following content:

`split0_0.yaml`:
```
# INITIAL SETTINGS
test_file: test_split0.csv
model: split0_0.pth
```

`split0_1.yaml`:
```
# INITIAL SETTINGS
test_file: test_split0.csv
model: split0_1.pth
```

As both are from the same stratified train-test split, they both have the same test set. The trained models are different since there were different training and validation splits used. 

Let's then check the paths and the number of devices used in `test_model.py` to point the right locations. For testing, they are as follows

```
csv_root = './data/split_csvs/physionet_stratified_smoke/'
data_root = './data/physionet_preprocessed_smoke/'

args.device_count = 1
```

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_root` **attribute is set with the assumption that *the data used is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as** `./data/physionet_data_smoke/`. The paths for ECGs will be different in the csv files based on the fact if the data is preprocessed or not.</font> 

Terminal command for training is now

```
python test_model.py predict_multiple_smoke
```

The predictions can be found from two subdirectories of the `predict_multiple_smoke` directory in `/experiments/` named as used yaml files, `split0_0` and `split0_1`. Also the evaluation metrics are saved in both subdirectories in `pickle` format.