# Introductions for Data Handling

This notebook contains informations about

1)  How to download the data into the repository (especially the Physionet 2021 data)

2)  How to preprocess data if needed

3)  The base idea of splitting data into csv files and how to perform it

When you have performed possible preprocessing and the data splitting into csv files, you may want to create `yaml` files based on these files for training and prediction. To do this, check the notebooks [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of Stratified Split for Training and Prediction](2_physionet_stratified_yaml_files.ipynb)

--------

## 1) Downloading data

### Physionet 2021 data

The exploration of the dataset is available in the notebook [Exploration of the PhysioNet2021 Data](exploration_physionet2021_data.ipynb).

There are two ways to download the Physionet Challenge 2021 data in `tar.gz` format: 

1) Downloading it manually from [here](https://moody-challenge.physionet.org/2021/) under **Data Access**

2) Letting this notebook do the job with the following code


In [1]:
# First we need the tar.gz files of each database so let's download them first

!wget -O WFDB_CPSC2018.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018.tar.gz/
        
!wget -O WFDB_CPSC2018_2.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018_2.tar.gz/
        
!wget -O WFDB_StPetersburg.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining//WFDB_StPetersburg.tar.gz/
        
!wget -O WFDB_PTB.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTB.tar.gz/
        
!wget -O WFDB_PTBXL.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTBXL.tar.gz/
        
!wget -O WFDB_Ga.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ga.tar.gz/
        
!wget -O WFDB_ChapmanShaoxing.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_ChapmanShaoxing.tar.gz/
        
!wget -O WFDB_Ningbo.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ningbo.tar.gz/
        

--2022-08-24 14:59:50--  https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018.tar.gz/
Resolving pipelineapi.org (pipelineapi.org)... 35.237.166.166
Connecting to pipelineapi.org (pipelineapi.org)|35.237.166.166|:9555... connected.
HTTP request sent, awaiting response... 200 
Length: 827672464 (789M) [application/octet-stream]
Saving to: ‘WFDB_CPSC2018.tar.gz’


2022-08-24 15:00:44 (14.8 MB/s) - ‘WFDB_CPSC2018.tar.gz’ saved [827672464/827672464]

--2022-08-24 15:00:44--  https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018_2.tar.gz/
Resolving pipelineapi.org (pipelineapi.org)... 35.237.166.166
Connecting to pipelineapi.org (pipelineapi.org)|35.237.166.166|:9555... connected.
HTTP request sent, awaiting response... 200 
Length: 423189282 (404M) [application/octet-stream]
Saving to: ‘WFDB_CPSC2018_2.tar.gz’


2022-08-24 15:01:11 (15.1 MB/s) - ‘WFDB_CPSC2018_2.tar.gz’ saved [423189282/423189282]

--2022-08-24 15:01:11--  https://pipelineapi.org:955

Once we have the files, they need to be extracted to the `data` directory which is located in the root of the repository. We might want to extract the files based on the source as follows

- CPSC Database and CPSC-Extra Database
- St. Petersberg (INCART) Database
- PTB and PTB-XL Database
- The Georgia 12-lead ECG Challenge (G12EC) Database
- Chapman-Shaoxing and Ningbo Database

Let's have the data files in such structure. 

In [13]:
import os

# All tar.gz files (in the current working directory)
curr_path = os.getcwd()
targz_files = [file for file in os.listdir(curr_path) if os.path.isfile(os.path.join(curr_path, file)) and file.endswith('tar.gz')]

# Let's sort the files
targz_files = sorted(targz_files)

for i, file in enumerate(targz_files):
    print(i, file)

0 WFDB_CPSC2018.tar.gz
1 WFDB_CPSC2018_2.tar.gz
2 WFDB_ChapmanShaoxing.tar.gz
3 WFDB_Ga.tar.gz
4 WFDB_Ningbo.tar.gz
5 WFDB_PTB.tar.gz
6 WFDB_PTBXL.tar.gz
7 WFDB_StPetersburg.tar.gz


So we want to extract the tar.gz files listed above as

* WFDB_CPSC2018.tar.gz + WFDB_CPSC2018_2.tar.gz
* WFDB_StPetersburg.tar.gz
* WFDB_PTB.tar.gz + WFDB_PTBXL.tar.gz
* WFDB_Ga.tar.gz
* WFDB_ChapmanShaoxing.tar.gz + WFDB_Ningbo.tar.gz

In [16]:
# Let's make the split as tuples of tar.gz files
# NB! If the split mentioned above wanted, SORTING is really important!
tar_split = [(targz_files[0], targz_files[1]),
             (targz_files[7], ),
             (targz_files[5], targz_files[6]),
             (targz_files[3], ),
             (targz_files[2], targz_files[4])]

print(*tar_split, sep="\n")

('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz')
('WFDB_StPetersburg.tar.gz',)
('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz')
('WFDB_Ga.tar.gz',)
('WFDB_ChapmanShaoxing.tar.gz', 'WFDB_Ningbo.tar.gz')


In [17]:
import tarfile

# Function to extract files from a given tar to a given directory
# Will exclude subdirectories from a given tar and load all the files directly to the given directory
def extract_files(tar, directory):
    
    file = tarfile.open(tar, 'r')
    
    n_files = 0
    for member in file.getmembers():
        if member.isreg(): # Skip if the TarInfo is not file
            member.name = os.path.basename(member.name) # Reset path
            file.extract(member, directory)
            n_files += 1
    
    file.close() 
    print('- {} files extracted to {}'.format(n_files, directory))

In [18]:
# Path to the physionet_data directory, i.e., save the dataset here
data_path = '../data/physionet_data'
if not os.path.exists(data_path):
       os.makedirs(data_path)

# Directories to which extract the data
# NB! Gotta be at the same length than 'tar_split'
dir_names = ['CPSC_CPSC-Extra', 'INCART', 'PTB_PTBXL', 'G12EC', 'ChapmanShaoxing_Ningbo']

# Extracting right files to right subdirectories
for tar, directory in zip(tar_split, dir_names):
    
    print('Extracting tar.gz file(s) {} to the {} directory'.format(tar, directory))
    
    # Saving path for the specific files
    save_tmp = os.path.join(data_path, directory)
    # Preparing the directory
    if not os.path.exists(save_tmp):
        os.makedirs(save_tmp)
        
    if len(tar) > 1: # More than one database in tuple
        for one_tar in tar:
            extract_files(one_tar, save_tmp)
    else: # Only one database in tuple
        extract_files(tar[0], save_tmp)
        
print('Done!')

Extracting tar.gz file(s) ('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz') to the CPSC_CPSC-Extra directory
- 13754 files extracted to ../data/physionet_data/CPSC_CPSC-Extra
- 6906 files extracted to ../data/physionet_data/CPSC_CPSC-Extra
Extracting tar.gz file(s) ('WFDB_StPetersburg.tar.gz',) to the INCART directory
- 148 files extracted to ../data/physionet_data/INCART
Extracting tar.gz file(s) ('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz') to the PTB_PTBXL directory
- 1032 files extracted to ../data/physionet_data/PTB_PTBXL
- 43674 files extracted to ../data/physionet_data/PTB_PTBXL
Extracting tar.gz file(s) ('WFDB_Ga.tar.gz',) to the G12EC directory
- 20688 files extracted to ../data/physionet_data/G12EC
Extracting tar.gz file(s) ('WFDB_ChapmanShaoxing.tar.gz', 'WFDB_Ningbo.tar.gz') to the ChapmanShaoxing_Ningbo directory
- 20494 files extracted to ../data/physionet_data/ChapmanShaoxing_Ningbo
- 69810 files extracted to ../data/physionet_data/ChapmanShaoxing_Ningbo


Now we should have total of **176 506** files (if we want to believe the data exploration presented above) in the `physionet_data` directory as one ECG recording consists of a binary MATLAB v4 file and a text file in header format. We might doublecheck that easily:

In [21]:
total_files = 0
for root, dirs, files in os.walk(data_path):
    total_files += len(files)
    
print('Total of {} files'.format(total_files))

Total of 176506 files


### Other data sources

Wanted data can also be downloaded from other sources when few quidelines are followed:

1) When using this repository in training and testing, the model processes ECGs in `MATLAB v4` format (.mat) and header files in `WFDB header format` format (.hea). Header files consist of the describtion of the recording and patient attributes, including *diagnoses*. 

The following code is used to load the data from MATLAB files:

```
def load_data(case):
    ''' Loading the MATLAB v4 file of ECG recording
    '''
    x = loadmat(case)
    return np.asarray(x['val'], dtype=np.float64)
```

So there is a column named `val` in which the recording is located. Consider this when loading other MATLAB files.

2) Data should be located in the `data` directory. For example, then training and making predictions, the attribute `data_root` is set from where the ECG recordings are loaded.

The above code extracts tar.gz files and the chunk consisting of `extract_files(tar, directory)` is generally usable. The function parameters `tar` refers to tar.gz file which needs to be extracted, and `directory` refers to the path in which the file is extracted to. The path is formated as a relative path, e.g. `../data/physionet_data/G12EC`.
   

In [7]:
## Other sources
## -------------

# tar = 'example.tar.gz'
# save_path = '../data/example/'
# extract_files(tar, save_path)

----

## 2) Preprocessing data

All the data can be preprocessed with different transforms with the script `preprocess_data.py`. There are two important attributes to consider:

```
# Original data location
from_directory = os.path.join(os.getcwd(), 'data', 'physionet_data_smoke')

# New location for preprocessed data
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

`from_directory` refers to the directory where the data in the original format is loaded from, such as downloaded Physionet Challenge data. `new_directory`, as it's name suggests, refers to the new location where the directory tree of the original data location is first copied using a function `copy_tree` from the module `distutils.dir_util`. After this, *each directory in the new location* is iterated over and all the ECGs (which should be in MatLab format) are preprocessed with wanted transforms. *An original version of ECG is afterwards deleted and the preprocessed one saved in the directory.*

By default there are two transforms used, linear interpolation and BandPass filter:

```
# ------------------------------
# --- PREPROCESS TRANSFORMS ----

# - BandPass filter 
bpf = BandPassFilter(fs = ecg_fs)
ecg = bpf(ecg)

# - Linear interpolation
linear_interp = Linear_interpolation(fs_new = 257, fs_old = fs)
ecg = linear_interp(ecg)

# ------------------------------
# ------------------------------
```

The preprocessing part **is not mandatory for the repository to work**. But if transforms, such as the two mentioned, are used e.g. during the training phase, that can significantly slow down training. That's why it's recommended to preprocess the data before training using the script mentioned.

All the other transforms are set in the script `dataset.py` in `src/dataloader/`, which is run during training. Several transforms are already available in the script `transforms.py` --- from where `Linear_interpolation` and `BandPassFilter` can be found too --- in the same path.

### Terminal command

To use the script, simply use the following command

```
python preprocess_data.py
```

<font color = red>**NOTE!** The preprocessed ECGs will have different names as the original ones so it's important to mind if the preprosessing part is done or not!</font>

--------

## 3) Splitting data for training

All the data splitting is done with the script `prepare_data.py`. The main idea for that script is to split the data into csv files which we can later use in training and testing a model. The csv files will be stored in a directory named after the yaml file used for split in `/data/split_csvs/`.

Csv files have the columns `path` (path for an ECG recording), `age` , `gender` and all the diagnoses in SNOMED CT codes used as labels in classification. The main structure of csv files are as follows:


| path  | age  | gender  | 10370003  | 111975006 | 164890007 | *other diagnoses...* |
| ------------- |-------------|-------------| ------------- |-------------|-------------|-------------|
| ./data/A0002.mat | 49.0 | Female | 0 | 0 | 1 | ... |
| ./data/A0003.mat | 81.0 | Female | 0 | 1 | 1 | ... |
| ./data/A0004.mat | 45.0 |  Male  | 1 | 0 | 0 | ... |
| ... | ... |  ...  | ... | ... | ... | ... |


The script uses `yaml` files to load the configurations for different splits. They are located in `/configs/data_splitting`. There are two types of yaml files based on the split type (described below), *database-wise (DBwise)* and *stratified*.

The yaml file for a <font color = green>**database-wise split**</font> is constructed like one below (`physionet_DBwise_smoke.yaml`):

```
# STRATIFIED OR DATABASE-WISE SPLIT
stratified: False

# INITIAL SETTINGS
data_dir: './data/physionet_preprocessed_smoke/'
save_dir: './data/split_csvs/'

```

So there are a boolean valued variable `stratified` which makes the difference between the two types of splits, `data_dir` which refers to the location where the data is loaded from, and `save_dir` which refers to the location where the csv files of splits are saved in.

The yaml file for a <font color = green>**stratified split**</font> is constructed like one below (`physionet_stratified_smoke.yaml`):

```
# STRATIFIED OR DATABASE-WISE SPLIT
stratified: True

# INITIAL SETTINGS
data_dir: './data/physionet_preprocessed_smoke/'
save_dir: './data/split_csvs/'

splits:
    - split_1:
      train: ['G12EC', 'INCART', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo']
      test: 'CPSC_CPSC-Extra'
```

where the same variables `stratified`, `data_dir` and `save_dir` are set as with a database-wise yaml but there also is `splits` which is a dictionary of training-testing splits to make different splits of the data at the same time and that way there's no need to run multiple yaml files separately. Train data will be further stratified into training and validation splits which are both saved in csv files. 

To perform data splits, the class labels are needed to be set in the script. Set the attribute `labels` for that use.

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_dir` **attribute is set with the assumption that *the data is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as** `./data/physionet_data_smoke/`. The paths for ECGs will be different in the csv files based on the fact if the data is preprocessed or not.</font>

The splitting itself can be done in two ways

1) **Database-wise**. Above, we extracted data in the following way 

   * CPSC Database and CPSC-Extra Database
   * St. Petersberg (INCART) Database
   * PTB and PTB-XL Database
   * The Georgia 12-lead ECG Challenge (G12EC) Database
   * Chapman-Shaoxing and Ningbo Database
    
Now we can use this structure as a baseline for the data split. Simply, the function `read_split_DB(data_directory, save_directory, labels)` uses this structure and creates csvs based on it. The function parameter `data_directory` refers to the location of the data, `save_directory` refers to the location where the csv files will be saved, and `labels` refers to the list of Snomed CT Coded labels which will be used in the classification. Csv files are named according to the directories from which they were created, e.g., a csv file of CPSC Database and CPSC-Extra Database is names as `CPSC_CPSC-Extra.csv`.

We can use this structure when creating yaml files for training and testing. But for example if we need to train a model using the first four sources in the list and using only the Chapman-Shaoxing database in testing, we need to create combined yaml files for training phase. In training we only give one csv file for a model to read which ECGs to use. The other csv files, in which there are ECGs from different databases, are made in the notebook [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb) when the training and testing csv files are created.

2) **Stratified**. The function `read_split_stratified(data_directory, save_directory, labels, train_val_splits)` will do the stratified split. The function parameters are similar to the ones with the function `read_split_DB` but there are also `train_val_splits` which refers to the dictionary of the splits wanted to perform.

Stratification is performed by the multilabel cross validator `MultilabelStratifiedShuffleSplit(n_splits, test_size, train_size, random_state)` from `iterative-stratification` package. The script will be using n_splits sized of the length of training dataset (in the yaml file it will be *4* as data is gathered from 'G12EC', 'INCART', 'PTB_PTBXL' and 'ChapmanShaoxing_Ningbo'). *n_splits must always be at least 2!* More information about this and other multilabel cross validators is available in [the GitHub repository of iterative-stratification](https://github.com/trent-b/iterative-stratification).

### Terminal commands

The terminal command to perform the database-wise data splitting is as

```
python prepare_data.py physionet_DBwise_smoke.yaml
```

The terminal command to perform the stratified data splitting is as

```
python prepare_data.py physionet_stratified_smoke.yaml
```

-------------

## Example: Smoke testing

*All the data files for smoke testing are available in the repository.*

First, we want to preprocess the data. We make sure that the script `preprocess_data.py` has the original and new directories set as follows

```
from_directory = os.path.join(os.getcwd(), 'data', 'physionet_data_smoke')
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

So the attribute `from_directory` refers to the directory where the original data is located, and the attribute `new_directory` where the preprocessed data is saved. Now we can perform preprocessing with the following command:

```
python preprocess_data.py
```

As data is preprocessed, we can move on to the splitting part. The script `prepare_data.py` should have the attribute `label` set as

```
labels = ['426783006', '426177001', '164934002', '427084000', '164890007', '39732003', '164889003', '59931005', '427393009', '270492004']
```

So the class label used are the 10 most common class labels in the whole Physionet Challenge dataset (referring to [Exploration of the PhysioNet2021 Data](#exploration_physionet2021_data)). The class labels in more detail are as follows:

name | SNOMED CT code | Total number of diagnoses <br>in the whole data
-----|----------------|-------------------------------------------
sinus rhythm |426783006 | 28971
sinus bradycardia| 426177001 | 18918 
t wave abnormal| 164934002 | 11716
sinus tachycardia |427084000 | 9657 
atrial flutter| 164890007 | 8374
left axis deviation |39732003 | 7631 
atrial fibrillation |164889003 | 5255 
t wave inversion| 59931005 | 3989 
sinus arrhythmia |427393009 | 3790
1st degree av block| 270492004 | 3534 

#### Database-wise split

The yaml file for the database-wise split is presented above, `physionet_DBwise_smoke.yaml`.

Simply, the split is made with the command

```
python prepare_data.py physionet_DBwise_smoke.yaml
```

The csv files are then located in `./data/split_csvs/physionet_DBwise_smoke/` as

```
ChapmanShaoxing_Ningbo.csv
CPSC_CPSC-Extra.csv
G12EC.csv
INCART.csv
PTB_PTBXL.csv
```

#### Stratified split

The yaml for the stratified split is presented above, `physionet_stratified_smoke.yaml`. There is one split which is made by running the file.

- Train data is from the directories *G12EC, INCART, PTB_PTBXL* and *ChapmanShaoxing_Ningbo*
- Test data is from the directory *CPSC_CPSC-Extra*.

The command is the following

```
python prepare_data.py physionet_stratified_smoke.yaml
```

The csv files are then located in `./data/split_csvs/physionet_stratified_smoke/` and you should now find the following files in the path:

```
test_split0.csv
train_split0_0.csv
train_split0_1.csv
train_split0_2.csv
train_split0_3.csv
val_split0_0.csv
val_split0_1.csv
val_split0_2.csv
val_split0_3.csv
```