## Introduction

Atrial fibrillation (AF) can cause significant symptoms; impair functional status, hemodynamics, and quality-of-life; and increase the risk of stroke and death. 

The diagnosis of AF is often based on a 12-lead electrocardiogram (ECG) characterized by absence of discrete P waves and an irregularly irregular ventricular rate. In most patients, a single ECG is sufficient to secure the diagnosis, assuming the patient is in AF at the time of the ECG. In some patients, AF is diagnosed using a heart rhythm recording via cardiac telemetry, Holter monitors, implantable loop recorders, or event recorders. In this project, we will build a machine learning model to detect AF on the patients.

## Data Source

[coorteeqsrafva.csv](https://www.kaggle.com/arjunascagnetto/ptbxl-atrial-fibrillation-detection): This is a subset of the PTB-XL, a large publicly available electrocardiography dataset, found on Kaggle. This dataset includes 3 ecg rhythms in the ritmi column: Normal (SR), Atrial Fibrillation (AF), all other arrhythmia (VA).

[ecgeq-500hzsrfava.npy](https://www.kaggle.com/arjunascagnetto/ptbxl-atrial-fibrillation-detection): This Numpy file are the 12-leads ecg of the patients in 'coorteeqsrfava' file. These are the recording (6528) of the ecg we selected to have one and only one of these condition:

* Sinusal Rhythm (SR). The condition of a normal ecg.
* Atrial Fibrillation (AF). The condition of having the specific arrhythmia of Atrial Fibrillation
* Various Arrhythmia (VA). The condition of having one of the possible other types of arrhythmia.

*Note: Since I do not have knowledge about ECG leads, I conducted some research to help me understand ECG leads better. According to [ECG & ECHO Learning](https://ecgwaves.com/topic/ekg-ecg-leads-electrodes-systems-limb-chest-precordial/), an ECG lead is a graphical description of the electrical activity of the heart and it is created by analysing several electrodes. In other words, each ECG lead is computed by analysing the electrical currents detected by several electrodes. The standard ECG – which is referred to as a 12-lead ECG since it includes 12 leads – is obtained using 10 electrodes. These 12 leads consists of two sets of ECG leads: limb leads and chest leads.*

## Data Wrangling

In [1]:
# import essential libraries
import numpy as np
import pandas as pd

In [2]:
# read in csv file
afib_df = pd.read_csv('../../../data/afib_data/coorteeqsrafva.csv', sep=';', header=0, index_col=0)
afib_df.head()

Unnamed: 0,diagnosi,ecg_id,ritmi,patient_id,age,sex,height,weight,nurse,site,...,validated_by_human,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr
0,STACH,10900,VA,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
1,AFLT,10900,AF,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
2,SR,8209,SR,12281.0,55.0,0,,,1.0,2.0,...,True,,,,,,,10,records100/08000/08209_lr,records500/08000/08209_hr
3,STACH,17620,VA,2007.0,29.0,1,164.0,56.0,7.0,1.0,...,True,,,,,,,1,records100/17000/17620_lr,records500/17000/17620_hr
4,SBRAD,12967,VA,8685.0,57.0,0,,,0.0,0.0,...,False,,", I-AVR,",,,,,1,records100/12000/12967_lr,records500/12000/12967_hr


In [3]:
# rows and columns
afib_df.shape

(6428, 30)

In [4]:
print('Normal (SR) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'SR'].shape[0]))
print('Atrial Fibrillation (AF) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'AF'].shape[0]))
print('Other arrhythmia (VA) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'VA'].shape[0]))

Normal (SR) has a total of 2000 rows
Atrial Fibrillation (AF) has a total of 1587 rows
Other arrhythmia (VA) has a total of 2841 rows


We need to convert these categorical values into numeric values. 0 for SR, 1 for AF, 2 for VA.

In [5]:
# dictionary to hold values 
num_di = {'SR': 0, 'AF': 1, 'VA': 2}

# replace SR with 0, AF with 1, VA with 2
afib_df = afib_df.replace({'ritmi': num_di})

In [6]:
# check afib_df
afib_df

Unnamed: 0,diagnosi,ecg_id,ritmi,patient_id,age,sex,height,weight,nurse,site,...,validated_by_human,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr
0,STACH,10900,2,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
1,AFLT,10900,1,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
2,SR,8209,0,12281.0,55.0,0,,,1.0,2.0,...,True,,,,,,,10,records100/08000/08209_lr,records500/08000/08209_hr
3,STACH,17620,2,2007.0,29.0,1,164.0,56.0,7.0,1.0,...,True,,,,,,,1,records100/17000/17620_lr,records500/17000/17620_hr
4,SBRAD,12967,2,8685.0,57.0,0,,,0.0,0.0,...,False,,", I-AVR,",,,,,1,records100/12000/12967_lr,records500/12000/12967_hr
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6423,SARRH,4131,2,3829.0,81.0,0,178.0,70.0,11.0,1.0,...,True,,,,,,,4,records100/04000/04131_lr,records500/04000/04131_hr
6424,STACH,18644,2,3866.0,88.0,0,152.0,45.0,11.0,1.0,...,True,"v3,",,,,2ES,,10,records100/18000/18644_lr,records500/18000/18644_hr
6425,SR,3693,0,17345.0,83.0,1,,,1.0,2.0,...,True,,", I-AVR,",,,,,5,records100/03000/03693_lr,records500/03000/03693_hr
6426,AFIB,1039,1,6038.0,75.0,1,177.0,80.0,,34.0,...,True,,,,,2ES,,7,records100/01000/01039_lr,records500/01000/01039_hr


The dataframe consists of **6428 rows** and **30 columns**.

In [7]:
# print infor of each colums
afib_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6428 entries, 0 to 6427
Data columns (total 30 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   diagnosi                      6428 non-null   object 
 1   ecg_id                        6428 non-null   int64  
 2   ritmi                         6428 non-null   int64  
 3   patient_id                    6428 non-null   float64
 4   age                           6394 non-null   float64
 5   sex                           6428 non-null   int64  
 6   height                        1866 non-null   float64
 7   weight                        2428 non-null   float64
 8   nurse                         6097 non-null   float64
 9   site                          6423 non-null   float64
 10  device                        6428 non-null   object 
 11  recording_date                6428 non-null   object 
 12  report                        6428 non-null   object 
 13  scp

These columns have null values: **age, height, weight, nurse, site, heart_axis, infarction_stadium1, infarction_stadium2, validated_by, baseline_drfit, static_noise, burst_noise, electrodes_problems, extra_beats, pacemaker**. We might not want to drop all the null values since it will reduce our data points. Instead, we can replace the null values with the mean or the median for the quantitative columns in the feature engineering phase. We also will drop the columns that do not provide any significant insights in the EDA.

In [8]:
# get descriptive statistics for quantitative columns
afib_df.describe()

Unnamed: 0,ecg_id,ritmi,patient_id,age,sex,height,weight,nurse,site,validated_by,strat_fold
count,6428.0,6428.0,6428.0,6394.0,6428.0,1866.0,2428.0,6097.0,6423.0,3676.0,6428.0
mean,11394.336341,1.130834,11597.60252,61.740069,0.478376,166.796356,69.841845,2.093817,1.478281,0.753536,5.525047
std,6239.52046,0.857968,6248.076594,17.739252,0.499571,10.249504,16.795521,3.124924,3.891928,1.075984,2.871204
min,2.0,0.0,304.0,4.0,0.0,95.0,5.0,0.0,0.0,0.0,1.0
25%,6112.25,0.0,6489.25,52.0,0.0,160.0,58.0,0.0,0.0,0.0,3.0
50%,11550.5,1.0,11976.0,64.0,0.0,167.0,69.0,1.0,1.0,1.0,6.0
75%,16785.5,2.0,16958.0,75.0,1.0,174.0,79.0,2.0,2.0,1.0,8.0
max,21833.0,2.0,21792.0,95.0,1.0,195.0,210.0,11.0,49.0,10.0,10.0


We can exclude ecg_id, patient_id, sex, nurse, site, validated_by, and strat_fold as these columns do not provide us much useful information when we're looking at the descriptive statistics.

In [9]:
# get descriptive statistics for age, height, and weight columns
afib_df[['age', 'height', 'weight']].describe()

Unnamed: 0,age,height,weight
count,6394.0,1866.0,2428.0
mean,61.740069,166.796356,69.841845
std,17.739252,10.249504,16.795521
min,4.0,95.0,5.0
25%,52.0,160.0,58.0
50%,64.0,167.0,69.0
75%,75.0,174.0,79.0
max,95.0,195.0,210.0


Looking at min, max, and mean for each column, we can conclude that there is no outlier exist in these three columns.

In [10]:
# read in npy file 
ecgeq_df = np.load('../../../data/afib_data/ecgeq-500hzsrfava.npy')
ecgeq_df

array([[[-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        ...,
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045]],

       [[-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        ...,
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045]],

       [[-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        [-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        [-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        ...,
        [ 0.

In [11]:
ecgeq_df.shape

(6428, 5000, 12)

This is a 3D array, which contains **6428 layers, 5000 rows, and 12 columns**. 12 columns represent for 12 leads, which are lead I, II, III, aVF, aVR, aVL, V1, V2, V3, V4, V5, V6. Leads I, II, III, aVR, aVL, aVF are denoted the limb leads while the V1, V2, V3, V4, V5, and V6 are precordial leads.