# Icentia11k Dataset

> About the Icentia 11k dataset (size, classes) & train/test splits 

In [None]:
import pandas as pd
from sklearn.model_selection import StratifiedKFold

## Overview of Icentia11k

### Purpose

Icentia11k is (as of 2019) the largest public ECG dataset of continuous raw signals containing 11 thousand patients and 2 billion labelled beats.
The purpose of the dataset is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events.

### Collection Methods

The ECG recordings were obtained using a single-lead portable ECG device ([CartioSTAT](https://www.cardiostat.com/) by [Icentia](https://www.icentia.com/)).
Patents wore the device betwen 3 - 14 days.

### Dataset Statistics

The sample rate was 250 Hz. The total dataset size 271.27GB.

### Patient demographics

| Attribute | Characteristic |
|--|--|
| Age | Average $62.2 \pm 17.4$ years of age |
| Sex | $42.6 \%$ male, $45.3 %$ female, $12.2 \%$ unknown |

### Privacy

Patients are identified only by a random integer ID

### Authors

The dataset was published by Tan et al. 2019. For more information, refer to the paper [Icentia11k: An Unsupervised ECG Representation Learning Dataset for Arrhythmia Subtype Discovery](https://arxiv.org/pdf/1910.09570v1) and refer to the dataset hosted at [PhysioNet](https://physionet.org/content/icentia11k-continuous-ecg/1.0/)

## We Use Only the Fully-Labelled Subset

The dataset conists of ECG recordings for each patient.
Patient IDs 9,000 - 10,999 contain fully labelled recordings.

> The authors originally intended data from patients 9,000 - 10,999 to be the evaluation subset for unsupervised models trained on partially labelled recordings for patient IDs <9000.
However, for our supervised ECG classification task, there is enough data in the intended 'test' set alone to produce a substantial train/test split.
However, this means we cannot compare our work to the baseline model performance reported in Tan et al. 2019.

## The Data is Hierarchical




A recording for a given patient is not continuous; instead the authors randomly selected 50 ~70min segments for a given patient.

> Note that not all patients have 50 segments of data. For example, p9894 only has 40 segments [(refer to dataset files on PhysioNet)](https://physionet.org/files/icentia11k-continuous-ecg/1.0/p09/p09894/)

```shell
p##### # <- patient ##### (9,000 to 10,999)
    s## # <- segment ## (0 - 49)
        p#####.atr # <- Attributes/annotations for ECG recording
        p#####.hea # <- Header/metadata for ECG recording
        p#####.dat # <- ECG recording data
```

A segment is really large (1,048,577 measurements per segment), so we will split a segment into a number of frames of a fixed window-size. So, the data will be hierarchical:

```shell
Patient i
    Segment j
        Frame k
```



### Implications for Model Building => Grouped K-Fold Cross Validation

When performing cross-validation, we need to ensure that data we split frames into training and validation sets according to **patients**. This is because frames from the same segment are associated (e.g. temporally), and segments from the same patient are associated (e.g. because the same patient produces the segments).

In other words, we need to perform repeated **Grouped K-fold Cross Validation**, where we group by patients.
We can use the [GroupKFold - Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold)

## Class Imbalance

From Table 2 in [Tan et al. 2019](https://www.cinc.org/2021/Program/accepted/229_Preprint.pdf)

Beat labels in the evaluation set

| Beat labels | Count | Proportion |
|---|---|---|
| Normal                                | 174,249   | 0.6271 |
| Premature Atrial Contractions         | 58,780    | 0.2115 |
| Premature Ventricular contractions    | 44_835    | 0.1614 |

Rhythm labels in the evaluation set

| Rhythm Labels | Count | Proportion |
|---|---|---|
| NSR(Normal Sinusal Rhythm) |   261,377 | 0.941 |
| AFib (Atrial Fibrillation) |   13,056  | 0.047 |
| AFlutter (Atrial Flutter)  |   3,330   | 0.012 |


### Implications for Modelling => Stratified, Group K-Fold AND/OR Oversampling

[SMOTE]()

[StratifiedGroupKFold - Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html#stratifiedgroupkfold)

## Some Descriptive Stats

Size

Structure of data points

Class Imbalance

## Train/Test Splits

Different patient's should be in train vs test splits

ECG signals from the same patient are correlated, so we shouldn't randomly shuffle segments

1. Take one frame from each patient


[Matworks: Classify ECG Signal](https://www.mathworks.com/help/signal/ug/classify-ecg-signals-using-long-short-term-memory-networks.html) - used random shuffling, because they had one 9000-sample long window of ECG signal for each patient. Bc each signal was from different patients, they were not temporally correlated.

## Evaluating Several Different Classifiers


### Implications for Modelling

[Calibrating a classifier - Scikit-Learn](https://scikit-learn.org/stable/modules/calibration.html#calibrating-a-classifier)