## Introduction

Atrial fibrillation is an irregular and often rapid heart rate that can increase your risk of strokes, heart failure and other heart-related complications.

During atrial fibrillation, the heart's two upper chambers (the atria) beat chaotically and irregularly — out of coordination with the two lower chambers (the ventricles) of the heart. Atrial fibrillation symptoms often include heart palpitations, shortness of breath and weakness.

Episodes of atrial fibrillation may come and go, or one may develop atrial fibrillation that doesn't go away and may require treatment. Although atrial fibrillation itself usually isn't life-threatening, it is a serious medical condition that sometimes requires emergency treatment.

A major concern with atrial fibrillation is the potential to develop blood clots within the upper chambers of the heart. These blood clots forming in the heart may circulate to other organs and lead to blocked blood flow (ischemia).

Treatments for atrial fibrillation may include medications and other interventions to try to alter the heart's electrical system.

In this project, we will build a machine learning model to detect atrial fibrillation on the patients; thus, they can get treatment as soon as possible.

## Data Source

[coorteeqsrafva.csv](https://www.kaggle.com/arjunascagnetto/ptbxl-atrial-fibrillation-detection): This is a subset of the PTB-XL, a large publicly available electrocardiography dataset, found on Kaggle. This dataset includes 3 ecg rhythms in the ritmi column: Normal (SR), Atrial Fibrillation (AF), all other arrhythmia (VA).

[ecgeq-500hzsrfava.npy](https://www.kaggle.com/arjunascagnetto/ptbxl-atrial-fibrillation-detection): This Numpy file are the 12-leads ecg of the patients in 'coorteeqsrfava' file. These are the recording (6528) of the ecg that have one and only one of these condition:

* Sinusal Rhythm (SR). The condition of a normal ecg.
* Atrial Fibrillation (AF). The condition of having the specific arrhythmia of Atrial Fibrillation
* Various Arrhythmia (VA). The condition of having one of the possible other types of arrhythmia.

*Note: Since I do not have knowledge about ECG leads, I conducted some research to help me understand ECG leads better. According to [ECG & ECHO Learning](https://ecgwaves.com/topic/ekg-ecg-leads-electrodes-systems-limb-chest-precordial/), an ECG lead is a graphical description of the electrical activity of the heart and it is created by analysing several electrodes. In other words, each ECG lead is computed by analysing the electrical currents detected by several electrodes. The standard ECG – which is referred to as a 12-lead ECG since it includes 12 leads – is obtained using 10 electrodes. These 12 leads consists of two sets of ECG leads: limb leads and chest leads.*

## Data Wrangling

In [1]:
# import essential libraries
import numpy as np
import pandas as pd

### Codebook

<img src="../../data/afib_data/codebook.png">

In [2]:
# read in csv file
afib_df = pd.read_csv('../../../data/afib_data/coorteeqsrafva.csv', sep=';', header=0, index_col=0)

# display all columns
pd.options.display.max_columns = None

# print df
afib_df.head()

Unnamed: 0,diagnosi,ecg_id,ritmi,patient_id,age,sex,height,weight,nurse,site,device,recording_date,report,scp_codes,heart_axis,infarction_stadium1,infarction_stadium2,validated_by,second_opinion,initial_autogenerated_report,validated_by_human,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr
0,STACH,10900,VA,15654.0,54.0,0,,,0.0,0.0,CS100 3,1993-09-01 11:31:17,"sinustachykardie wpw, typ a lagetyp normal 4.4...","{'AFLT': 100.0, 'STACH': 0.0}",MID,,,,False,True,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
1,AFLT,10900,AF,15654.0,54.0,0,,,0.0,0.0,CS100 3,1993-09-01 11:31:17,"sinustachykardie wpw, typ a lagetyp normal 4.4...","{'AFLT': 100.0, 'STACH': 0.0}",MID,,,,False,True,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
2,SR,8209,SR,12281.0,55.0,0,,,1.0,2.0,CS-12,1992-06-09 15:52:36,sinusrhythmus linkslagetyp intraventr. leitung...,"{'LVH': 100.0, 'ISC_': 100.0, 'SR': 0.0}",LAD,,,1.0,False,False,True,,,,,,,10,records100/08000/08209_lr,records500/08000/08209_hr
3,STACH,17620,VA,2007.0,29.0,1,164.0,56.0,7.0,1.0,AT-6 C 5.6,1997-02-08 18:33:30,sinus tachycardia. otherwise normal ecg.,"{'NORM': 80.0, 'STACH': 0.0}",,,,0.0,False,False,True,,,,,,,1,records100/17000/17620_lr,records500/17000/17620_hr
4,SBRAD,12967,VA,8685.0,57.0,0,,,0.0,0.0,CS100 3,1994-09-13 10:21:14,sinusbradykardie lagetyp normal sonst normales...,"{'NORM': 80.0, 'SBRAD': 0.0}",MID,,,,False,True,False,,", I-AVR,",,,,,1,records100/12000/12967_lr,records500/12000/12967_hr


In [3]:
# rows and columns
afib_df.shape

(6428, 30)

The dataframe consists of **6428 rows** and **30 columns**.

In [4]:
# check rows for each category in ritmi column
print('Normal (SR) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'SR'].shape[0]))
print('Atrial Fibrillation (AF) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'AF'].shape[0]))
print('Other arrhythmia (VA) has a total of {} rows'.format(afib_df.loc[afib_df['ritmi'] == 'VA'].shape[0]))

Normal (SR) has a total of 2000 rows
Atrial Fibrillation (AF) has a total of 1587 rows
Other arrhythmia (VA) has a total of 2841 rows


In [5]:
# print info of each colums
afib_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6428 entries, 0 to 6427
Data columns (total 30 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   diagnosi                      6428 non-null   object 
 1   ecg_id                        6428 non-null   int64  
 2   ritmi                         6428 non-null   object 
 3   patient_id                    6428 non-null   float64
 4   age                           6394 non-null   float64
 5   sex                           6428 non-null   int64  
 6   height                        1866 non-null   float64
 7   weight                        2428 non-null   float64
 8   nurse                         6097 non-null   float64
 9   site                          6423 non-null   float64
 10  device                        6428 non-null   object 
 11  recording_date                6428 non-null   object 
 12  report                        6428 non-null   object 
 13  scp

These columns have null values: **age, height, weight, nurse, site, heart_axis, infarction_stadium1, infarction_stadium2, validated_by, baseline_drfit, static_noise, burst_noise, electrodes_problems, extra_beats, pacemaker**. We might not want to drop all the null values since it will reduce our data points. Instead, we can replace the null values with the mean or the median for the quantitative columns in the feature engineering phase. We also will drop the columns that do not provide any significant insights. Let's take a look at the descriptive statistics for each column to detect the outliers.

In [6]:
# get descriptive statistics for quantitative columns
afib_df.describe()

Unnamed: 0,ecg_id,patient_id,age,sex,height,weight,nurse,site,validated_by,strat_fold
count,6428.0,6428.0,6394.0,6428.0,1866.0,2428.0,6097.0,6423.0,3676.0,6428.0
mean,11394.336341,11597.60252,61.740069,0.478376,166.796356,69.841845,2.093817,1.478281,0.753536,5.525047
std,6239.52046,6248.076594,17.739252,0.499571,10.249504,16.795521,3.124924,3.891928,1.075984,2.871204
min,2.0,304.0,4.0,0.0,95.0,5.0,0.0,0.0,0.0,1.0
25%,6112.25,6489.25,52.0,0.0,160.0,58.0,0.0,0.0,0.0,3.0
50%,11550.5,11976.0,64.0,0.0,167.0,69.0,1.0,1.0,1.0,6.0
75%,16785.5,16958.0,75.0,1.0,174.0,79.0,2.0,2.0,1.0,8.0
max,21833.0,21792.0,95.0,1.0,195.0,210.0,11.0,49.0,10.0,10.0


We can only include age, height, and weight columns as these columns do provide us useful information when we're looking at the descriptive statistics.

In [7]:
# get descriptive statistics for age, height, and weight columns
afib_df[['age', 'height', 'weight']].describe()

Unnamed: 0,age,height,weight
count,6394.0,1866.0,2428.0
mean,61.740069,166.796356,69.841845
std,17.739252,10.249504,16.795521
min,4.0,95.0,5.0
25%,52.0,160.0,58.0
50%,64.0,167.0,69.0
75%,75.0,174.0,79.0
max,95.0,195.0,210.0


Looking at min, max, and mean for each column, we can conclude that there is no outlier exist in these three columns. Thus, we do not need to remove any outliers. 

Before writing the dataframe to csv file, we need to determine which column we should drop as we have 31 columns. By looking at the dataframe, we can tell that there are 6 important columns so far, which are diagnosi, ritmi, age, sex, height, weight. What about the rest? Let's take a look at the unique values for each column to determine which column we will drop.

In [8]:
# get columns
afib_df.columns

Index(['diagnosi', 'ecg_id', 'ritmi', 'patient_id', 'age', 'sex', 'height',
       'weight', 'nurse', 'site', 'device', 'recording_date', 'report',
       'scp_codes', 'heart_axis', 'infarction_stadium1', 'infarction_stadium2',
       'validated_by', 'second_opinion', 'initial_autogenerated_report',
       'validated_by_human', 'baseline_drift', 'static_noise', 'burst_noise',
       'electrodes_problems', 'extra_beats', 'pacemaker', 'strat_fold',
       'filename_lr', 'filename_hr'],
      dtype='object')

In [9]:
# unique ECG identifier
afib_df['ecg_id'].value_counts() # will drop

10564    2
2889     2
1232     2
20052    2
21541    2
        ..
19148    1
717      1
4815     1
13011    1
16384    1
Name: ecg_id, Length: 6341, dtype: int64

In [10]:
# unique patient identifier
afib_df['patient_id'].value_counts() # will drop 

17542.0    10
15654.0     9
8304.0      9
20318.0     7
10099.0     6
           ..
17900.0     1
3605.0      1
16205.0     1
3907.0      1
424.0       1
Name: patient_id, Length: 5698, dtype: int64

In [11]:
# involved nurse (pseudonymized)
afib_df['nurse'].value_counts() # will drop

0.0     2521
1.0     1908
9.0      181
10.0     181
3.0      179
5.0      177
6.0      166
4.0      159
8.0      159
2.0      157
7.0      155
11.0     154
Name: nurse, dtype: int64

In [12]:
# recording site (pseudonymized)
afib_df['site'].value_counts() # will drop

0.0     2679
2.0     1743
1.0     1670
3.0      101
4.0       18
10.0      16
7.0       14
5.0       12
8.0       12
11.0      12
15.0       9
16.0       9
9.0        9
14.0       8
17.0       8
20.0       7
6.0        7
36.0       6
28.0       6
23.0       5
18.0       5
22.0       4
26.0       4
24.0       4
12.0       4
32.0       4
40.0       4
27.0       4
33.0       3
31.0       3
43.0       3
21.0       3
34.0       3
29.0       3
37.0       3
13.0       2
30.0       2
39.0       2
19.0       2
38.0       2
35.0       2
47.0       2
45.0       1
49.0       1
46.0       1
41.0       1
Name: site, dtype: int64

In [13]:
# recording device
afib_df['device'].value_counts() # will drop

CS100    3    2025
CS-12         1388
AT-6 C 5.5    1004
CS-12   E      669
AT-6     6     633
AT-60    3     334
AT-6 C 5.8     216
AT-6 C          95
AT-6 C 5.0      30
AT-6 C 5.3      17
AT-6 C 5.6      17
Name: device, dtype: int64

In [14]:
# ECG recording date and time
afib_df['recording_date'].value_counts() # will keep

1992-04-09 16:52:20    2
1993-02-27 13:36:18    2
1993-04-23 07:12:28    2
1996-07-20 15:28:09    2
1996-06-28 11:13:59    2
                      ..
1986-09-17 09:33:37    1
1987-04-29 17:02:39    1
1994-08-03 11:18:54    1
1994-04-24 15:02:16    1
1993-04-26 12:56:59    1
Name: recording_date, Length: 6339, dtype: int64

In [15]:
# ECG report from diagnosing cardiologist
afib_df['report'].value_counts() # will drop

sinus rhythm. normal ecg.                                                                                                                                                                                                                     202
schrittmacher ekg 4.46                          unbestÄtigter bericht                                                                                                                                                                         198
sinusrhythmus lagetyp normal normales ekg                                                                                                                                                                                                     150
sinusrhythmus lagetyp normal normales ekg 4.46                          unbestÄtigter bericht                                                                                                                                                 134
sinusrhythmus normales ekg      

In [16]:
# SCP ECG statements
afib_df['scp_codes'].value_counts() # will drop

{'NORM': 100.0, 'SR': 0.0}                                     752
{'PACE': 100.0}                                                274
{'NORM': 80.0, 'SARRH': 0.0}                                   211
{'NORM': 80.0, 'SBRAD': 0.0}                                   151
{'NORM': 80.0, 'STACH': 0.0}                                   128
                                                              ... 
{'LAFB': 100.0, 'CRBBB': 100.0, 'ISCIN': 100.0, 'SR': 0.0}       1
{'CRBBB': 100.0, 'LAFB': 100.0, 'RVH': 50.0, 'SR': 0.0}          1
{'ISCAS': 100.0, 'LAFB': 100.0, '1AVB': 100.0, 'AFLT': 0.0}      1
{'ASMI': 100.0, 'IMI': 15.0, 'LAFB': 100.0, 'SBRAD': 0.0}        1
{'IPLMI': 50.0, 'SR': 0.0}                                       1
Name: scp_codes, Length: 2515, dtype: int64

In [17]:
# heart’s electrical axis 
afib_df['heart_axis'].value_counts() # will keep

MID     2262
LAD     1187
ALAD     482
RAD       87
ARAD      60
AXR       23
AXL       22
SAG        1
Name: heart_axis, dtype: int64

In [18]:
# infarction stadium 
afib_df['infarction_stadium1'].value_counts() # will drop

unknown           1092
Stadium III        316
Stadium II-III     297
Stadium I           66
Stadium II          26
Stadium I-II         3
Name: infarction_stadium1, dtype: int64

In [19]:
# second infarction stadium
afib_df['infarction_stadium2'].value_counts() # will drop

Stadium III    14
Stadium I       6
Stadium II      6
Name: infarction_stadium2, dtype: int64

In [20]:
# validating cardiologist 
afib_df['validated_by'].value_counts() # will drop

1.0     1761
0.0     1621
2.0      156
3.0       39
6.0       27
4.0       26
7.0       23
5.0       12
8.0        7
9.0        3
10.0       1
Name: validated_by, dtype: int64

In [21]:
# flag for second (deviating) opinion
afib_df['second_opinion'].value_counts() # will drop

False    6320
True      108
Name: second_opinion, dtype: int64

In [22]:
# initial autogenerated report by ECG device
afib_df['initial_autogenerated_report'].value_counts() # will keep

False    4223
True     2205
Name: initial_autogenerated_report, dtype: int64

In [23]:
# validated by human
afib_df['validated_by_human'].value_counts() # will keep

True     4559
False    1869
Name: validated_by_human, dtype: int64

In [24]:
# baseline drift or jump present
afib_df['baseline_drift'].value_counts() # will drop

 , V6             69
 , V3             22
 , V1             19
 , alles          18
 , I-AVF          17
                  ..
 , I - AVF         1
 , III,V5,V6       1
I,III,V3,V4,       1
 , drift           1
 , AVL             1
Name: baseline_drift, Length: 138, dtype: int64

In [25]:
# electric hum/static noise present
afib_df['static_noise'].value_counts() # will drop

 , I-AVR,                    255
 , I-AVF,                    255
 , alles,                    226
 , I-V2,                      54
 , I-V1,                      49
                            ... 
 , I,II,AVL,AVF,               1
I,III,AVF,AVL,V2,  ,           1
 , I,II,AVL-AVF,               1
 , I-V2,V6,                    1
 , I-V1, noisy recording,      1
Name: static_noise, Length: 67, dtype: int64

In [26]:
# burst_noise
afib_df['burst_noise'].value_counts() # will drop

alles    57
V1       33
V1,V2    27
I-V1     11
I-V2      9
         ..
V1-V5     1
V1-V4     1
V3-V6     1
V5,V6     1
v1,2      1
Name: burst_noise, Length: 68, dtype: int64

In [27]:
# electrodes problems
afib_df['electrodes_problems'].value_counts() # will drop

V1        2
V6        2
V4        2
v4, v5    1
aVL???    1
V5        1
V1???     1
Name: electrodes_problems, dtype: int64

In [28]:
# extra beats
afib_df['extra_beats'].value_counts() # will drop

1ES           172
VES           101
SVES           96
VES1,alles     84
2ES            72
             ... 
2,II            1
2,V1-V3         1
I-AVR           1
VES1,I-AVF      1
4ES,SVES        1
Name: extra_beats, Length: 80, dtype: int64

In [29]:
# pacemaker
afib_df['pacemaker'].value_counts() # will drop

ja, pacemaker    291
ja, nan            1
?, nan             1
PACE????, nan      1
Name: pacemaker, dtype: int64

In [30]:
# suggested stratified folds
afib_df['strat_fold'].value_counts() # will keep

7     667
10    648
5     648
9     646
8     646
1     640
6     635
2     635
4     635
3     628
Name: strat_fold, dtype: int64

In [31]:
# filename_lr
afib_df['filename_lr'].value_counts() # will drop

records100/10000/10363_lr    2
records100/10000/10075_lr    2
records100/16000/16929_lr    2
records100/04000/04763_lr    2
records100/04000/04262_lr    2
                            ..
records100/06000/06354_lr    1
records100/00000/00656_lr    1
records100/15000/15681_lr    1
records100/06000/06671_lr    1
records100/16000/16756_lr    1
Name: filename_lr, Length: 6341, dtype: int64

In [32]:
# filename_hr
afib_df['filename_hr'].value_counts() # will drop

records500/03000/03267_hr    2
records500/10000/10900_hr    2
records500/00000/00017_hr    2
records500/10000/10363_hr    2
records500/01000/01919_hr    2
                            ..
records500/11000/11766_hr    1
records500/17000/17312_hr    1
records500/01000/01796_hr    1
records500/03000/03034_hr    1
records500/10000/10371_hr    1
Name: filename_hr, Length: 6341, dtype: int64

After successfully checking unique values for each column, we decide to remove **ecg_id, patient_id, nurse, site, report, scp_codes, validated_by, second_opinion, initial_autogenerated_report, baseline_drift, static_noise, burst_noise, electrodes_problems, extra_beats, pacemaker, filename_lr, filename_hr** as some of them do not provide significant information, while some of them will make it very hard to visualize.

In [33]:
# drop columns
afib_df = afib_df.drop(columns=['ecg_id', 'patient_id', 'nurse', 'site', 'device', 'report', 'scp_codes', 'infarction_stadium1', 'infarction_stadium2', 'validated_by', 'second_opinion', 'initial_autogenerated_report', 'baseline_drift', 'static_noise', 'burst_noise', 'electrodes_problems', 'extra_beats', 'pacemaker', 'filename_lr', 'filename_hr'])

In [34]:
# check df
afib_df.head()

Unnamed: 0,diagnosi,ritmi,age,sex,height,weight,recording_date,heart_axis,validated_by_human,strat_fold
0,STACH,VA,54.0,0,,,1993-09-01 11:31:17,MID,False,6
1,AFLT,AF,54.0,0,,,1993-09-01 11:31:17,MID,False,6
2,SR,SR,55.0,0,,,1992-06-09 15:52:36,LAD,True,10
3,STACH,VA,29.0,1,164.0,56.0,1997-02-08 18:33:30,,True,1
4,SBRAD,VA,57.0,0,,,1994-09-13 10:21:14,MID,False,1


Let's recode values for ritmi and validated_by_human from string and boolean to numeric values. We also create grouped variables for age, height, weight, and recording_date to make it easier for visualizing.

In [35]:
# dictionary to hold values 
num_di = {'SR': 0, 'AF': 1, 'VA': 2}

# replace SR with 0, AF with 1, VA with 2
afib_df = afib_df.replace({'ritmi': num_di})

In [36]:
# dictionary to hold values 
bool_di = {False: 0, True: 1}

# replace False with 0, True with 1
afib_df = afib_df.replace({'validated_by_human': bool_di})

We will also create a grouped variable for age called age_group. There will be 10 groups:
* 0 - 9 Years	
* 10 - 19 Years	
* 20 - 29 Years	
* 30 - 39 Years	
* 40 - 49 Years	
* 50 - 59 Years	
* 60 - 69 Years	
* 70 - 79 Years	
* 80+ Years	
* Missing (For null values)

In [37]:
# define a function to recode age
def get_age_group(age):
    age_group = ''
    if (age >=0 and age <=9):
        age_group = '0-9 Years'
    elif (age >= 10 and age <=19):
        age_group = '10-19 Years'
    elif (age >=20 and age <= 29):
        age_group = '20-29 Years'
    elif (age >=30 and age <= 39):
        age_group = '30-39 Years'
    elif (age >= 40 and age <= 49):
        age_group = '40-49 Years'
    elif (age >= 50 and age <= 59):
        age_group = '50-59 Years'
    elif (age >= 60 and age <= 69):
        age_group = '60-69 Years'
    elif (age >= 70 and age <= 79):
        age_group = '70-79 Years'
    elif (age >= 80):
        age_group = '80+ Years'
    else:
        age_group = 'Missing'
    return age_group

# add the new column called age_group and apply the above function
afib_df['age_group'] = afib_df['age'].apply(get_age_group)

Similarly, we'll create a grouped variable for height called height_group. There will be 7 groups:
* <1.50m: Less than 1.50m	
* 1.50m +: 1.50m and above
* 1.60m +: 1.60m and above
* 1.70m +: 1.70m and above
* 1.80m +: 1.80m and above
* 1.90m +: 1.90m and above
* Missing (For null values)

In [38]:
# define a function to recode height
def get_height_group(height):
    height_group = ''
    if (height < 150.0):
        height_group = '<1.50m'
    elif (height >= 150.0 and height <= 159.9):
        height_group = '1.50m +'
    elif (height >= 160.0 and height <= 169.9):
        height_group = '1.60m +'
    elif (height >= 170.0 and height <= 179.9):
        height_group = '1.70m +'
    elif (height >= 180.0 and height <= 189.9):
        height_group = '1.80m +'
    elif (height >= 190.0 and height <= 199.9):
        height_group = '1.90m +'
    else: 
        height_group = 'Missing'
    return height_group

# add the new column called age_group and apply the above function
afib_df['height_group'] = afib_df['height'].apply(get_height_group)

Likewise, we'll create a grouped variable for weight called weight_group. There will be 7 groups:
* <60kg: Less than 60kg	
* 60kg +: 60kg and above
* 70kg +: 70kg and above
* 80kg +: 80kg and above
* 90kg +: 90kg and above
* 100kg +: 100kg and above
* Missing (For null values)

In [39]:
# define a function to recode weight
def get_weight_group(weight):
    weight_group = ''
    if (weight < 60.0):
        weight_group = '<60kg'
    elif (weight >= 60.0 and weight <= 69.9):
        weight_group = '60kg +'
    elif (weight >= 70.0 and weight <= 79.9):
        weight_group = '70kg +'
    elif (weight >= 80.0 and weight <= 89.9):
        weight_group = '80kg +'
    elif (weight >= 90.0 and weight <= 99.9):
        weight_group = '90kg +'
    elif (weight >= 100.0):
        weight_group = '100kg +'
    else: 
        weight_group = 'Missing'
    return weight_group

# add the new column called age_group and apply the above function
afib_df['weight_group'] = afib_df['weight'].apply(get_weight_group)

We also want to get a year for each record; thus, creating a grouped variable called recording_year.

In [40]:
# get year from recording_date
afib_df['recording_year'] = pd.to_datetime(afib_df['recording_date']).dt.to_period('Y')

In [41]:
# check afib_df
afib_df.head()

Unnamed: 0,diagnosi,ritmi,age,sex,height,weight,recording_date,heart_axis,validated_by_human,strat_fold,age_group,height_group,weight_group,recording_year
0,STACH,2,54.0,0,,,1993-09-01 11:31:17,MID,0,6,50-59 Years,Missing,Missing,1993
1,AFLT,1,54.0,0,,,1993-09-01 11:31:17,MID,0,6,50-59 Years,Missing,Missing,1993
2,SR,0,55.0,0,,,1992-06-09 15:52:36,LAD,1,10,50-59 Years,Missing,Missing,1992
3,STACH,2,29.0,1,164.0,56.0,1997-02-08 18:33:30,,1,1,20-29 Years,1.60m +,<60kg,1997
4,SBRAD,2,57.0,0,,,1994-09-13 10:21:14,MID,0,1,50-59 Years,Missing,Missing,1994


In [42]:
# check shape
afib_df.shape

(6428, 14)

Now we can write the modified dataframe to the new csv file, so we can use it for EDA.

In [43]:
afib_df.to_csv('../../../data/afib_data/new_coorteeqsrafva.csv', index=False)

In [44]:
# read in npy file 
ecgeq_arr = np.load('../../../data/afib_data/ecgeq-500hzsrfava.npy')
ecgeq_arr

array([[[-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        ...,
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045]],

       [[-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        [-0.005,  0.135,  0.14 , ..., -0.21 , -0.145, -0.08 ],
        ...,
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045],
        [ 0.03 , -0.045, -0.075, ..., -0.02 , -0.035, -0.045]],

       [[-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        [-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        [-0.17 , -0.13 ,  0.04 , ..., -0.14 , -0.05 , -0.03 ],
        ...,
        [ 0.

In [45]:
ecgeq_arr.shape

(6428, 5000, 12)

This is a 3D array, which contains **6428 layers, 5000 rows, and 12 columns**. 12 columns represent for 12 leads, which are lead I, II, III, aVF, aVR, aVL, V1, V2, V3, V4, V5, V6. Leads I, II, III, aVR, aVL, aVF are denoted the limb leads while the V1, V2, V3, V4, V5, and V6 are precordial leads.