# Data Preprocessing

This notebook outlines my approach to feature engineering and ensuring a high-quality dataset for model training. I have chosen to approach this problem as a binary classification task, aiming to build a model that differentiates Normal ECGs from Abnormal ECGs based on labels derived from highly likely and unambiguous SCP codes. The features used are the unprocessed ECG signals, preserving all the information inherent in the signal.

#### Imports

In [1]:
from os.path import join

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)

In [2]:
data_path = "../data/"
sampling_rate = 100

In [3]:
Y = pd.read_csv(join(data_path, "processed/ecg_metadata.csv"), index_col="ecg_id")
print(f"Number of ECGs: {Y.shape[0]:,}; Number of patients: {Y.patient_id.nunique():,}")

Number of ECGs: 18,793; Number of patients: 16,563


#### Problematic ECGs

While the presence of extra beats is noteworthy, it does not necessarily indicate an unhealthy heart, particularly in asymptomatic individuals with normal heart structure. For the purpose of this analysis, I have chosen to exclude such ECGs to focus on more indicative cardiac events. Additionally, I have removed ECGs that exhibit recording issues to ensure the quality and reliability of the dataset.

In [4]:
# Remove ECGs with problems
Y = Y.loc[Y.electrodes_problems.isna()].drop(columns="electrodes_problems")
Y = Y.loc[Y.baseline_drift.isna()].drop(columns="baseline_drift")
Y = Y.loc[Y.static_noise.isna()].drop(columns="static_noise")
Y = Y.loc[Y.burst_noise.isna()].drop(columns="burst_noise")
Y = Y.loc[Y.extra_beats.isna()].drop(columns="extra_beats")

Since the presence of a pacemaker can significantly alter a patient’s ECG, including such patients could introduce biases into the analysis. Therefore, I have excluded patients with pacemakers from the dataset. Additionally, I will focus exclusively on patients over 18 years old, as ECG characteristics can differ considerably between children and adults.

In [5]:
Y = Y.loc[Y.pacemaker.isna()].drop(columns="pacemaker")
Y = Y.loc[Y.age >= 18.0]

In [6]:
print(f"Number of ECGs: {Y.shape[0]:,}; Number of patients: {Y.patient_id.nunique():,}")

Number of ECGs: 13,245; Number of patients: 12,201


#### Multilabel ECGs

Many ECGs have a high likelihood for Normal and Abnormal labels. I exclude such ECGs to eliminate ambiguity.

In [7]:
multilabel_ecgs = Y["diagnostic_class"].str.contains("(?=.*\|)(?=.*NORM)")
Y = Y.loc[~multilabel_ecgs]
print(f"# Ambiguous ECGs: {multilabel_ecgs.sum():,}")

# Ambiguous ECGs: 248


In [8]:
print(f"Number of ECGs: {Y.shape[0]:,}; Number of patients: {Y.patient_id.nunique():,}")

Number of ECGs: 12,997; Number of patients: 11,966


#### Normal ECGs with contradicting features

I removed normal ECGs that displayed contradicting features, such as infarction signs and abnormal heart axis, as these inconsistencies could compromise the accuracy of the labels.

In [9]:
problematic_patients = (Y.diagnostic_class == "NORM") & (
    ~Y["heart_axis"].isin(["MID", np.nan]) | ~Y["infarction_stadium1"].isna() | ~Y["infarction_stadium2"].isna()
)
Y = Y.loc[~problematic_patients]
print(f"# Contradicting ECGs: {multilabel_ecgs.sum():,}")

# Contradicting ECGs: 248


In [10]:
print(f"Number of ECGs: {Y.shape[0]:,}; Number of patients: {Y.patient_id.nunique():,}")

Number of ECGs: 12,323; Number of patients: 11,354


#### Data Labeling

Given the time constraints and my focus on a purely ECG-based approach for the model features, I’ve decided not to include age and gender in the initial feature set. However, I will assess the model’s performance across different age and gender subsets to evaluate consistency of the results.

In [11]:
Y["LABEL"] = np.where(Y["diagnostic_class"] == "NORM", 0, 1)
Y["LABEL"].value_counts(normalize=True)

LABEL
1    0.533555
0    0.466445
Name: proportion, dtype: float64

The labels are fairly balanced, with slightly more ECGs in the abnormal class. Given the minor imbalance, I do not consider this to be a significant issue.

In [12]:
Y = Y.drop(columns=["heart_axis", "infarction_stadium1", "infarction_stadium2"])
Y.head(3)

Unnamed: 0_level_0,patient_id,age,sex,strat_fold,filename_lr,diagnostic_class,LABEL
ecg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,13243,19.0,0,2,records100/00000/00002_lr,NORM,0
3,20372,37.0,1,5,records100/00000/00003_lr,NORM,0
10,9456,22.0,1,9,records100/00000/00010_lr,NORM,0


In [13]:
Y.to_csv(join(data_path, "processed/labels.csv"))