## Introduction

Atrial fibrillation (AF) is an irregular and often rapid heart rate that can increase your risk of strokes, heart failure and other heart-related complications. AF symptoms often include heart palpitations, shortness of breath and weakness. AF is also independently associated with a significantly greater risk of mortality. For instance, AF patients have a 46% greater risk of mortality than patients without AF and the rate of mortality is 40% among new patients diagnosed with AF. Around 15-30% of patients are asymptomatic, which is of concern as AF is a major risk factor for stroke. 

As AF progresses, patients are more likely to experience greater impairments in their quality of life, such as increased pain and discomfort. Early detection and appropriate management reduce stroke risk by two-thirds. As a result, early detection of AF is important to ensure prompt and adequate management which not only aims to control symptoms but to avoid later complications.

In [None]:
# import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

In [None]:
# read in csv file
df = pd.read_csv('../input/ptbxl-atrial-fibrillation-detection/coorteeqsrafva.csv', sep=';', header=0, index_col=0)

# print df
print(df.shape)
df.head()

In [None]:
# check rows for each category in ritmi column
print('Normal (SR) has a total of {} rows'.format(df.loc[df['ritmi'] == 'SR'].shape[0]))
print('Atrial Fibrillation (AF) has a total of {} rows'.format(df.loc[df['ritmi'] == 'AF'].shape[0]))
print('Other arrhythmia (VA) has a total of {} rows'.format(df.loc[df['ritmi'] == 'VA'].shape[0]))

In [None]:
# read in npy file 
ecgeq_arr = np.load('../input/ptbxl-atrial-fibrillation-detection/ecgeq-500hzsrfava.npy')
print(ecgeq_arr.shape)
ecgeq_arr

This is a 3D array, which contains **6428 layers, 5000 rows, and 12 columns**. 12 columns represent for 12 leads, which are lead I, II, III, aVF, aVR, aVL, V1, V2, V3, V4, V5, V6. Leads I, II, III, aVR, aVL, aVF are denoted the limb leads while the V1, V2, V3, V4, V5, and V6 are precordial leads.

### Data Preprocessing

Let's start to clean the data and prepare for the Exploratory Data Analysis.

In [None]:
# make a copy of the df
afib_df = df.copy()

# drop columns
afib_df = afib_df.drop(columns=['ecg_id', 'patient_id', 'nurse', 'site', 'device', 'report', 'scp_codes', 'infarction_stadium1', 'infarction_stadium2', 'validated_by', 'second_opinion', 'initial_autogenerated_report', 'baseline_drift', 'static_noise', 'burst_noise', 'electrodes_problems', 'extra_beats', 'pacemaker', 'filename_lr', 'filename_hr'])

# dictionary to hold values for ritmi column
num_di = {'SR': 0, 'AF': 1, 'VA': 2}

# replace SR with 0, AF with 1, VA with 2
afib_df = afib_df.replace({'ritmi': num_di})

# dictionary to hold values for validated_by_human column
bool_di = {False: 0, True: 1}

# replace False with 0, True with 1
afib_df = afib_df.replace({'validated_by_human': bool_di})

In [None]:
# define a function to recode age
def get_age_group(age):
    age_group = ''
    if (age >=0 and age <=9):
        age_group = '0-9 Years'
    elif (age >= 10 and age <=19):
        age_group = '10-19 Years'
    elif (age >=20 and age <= 29):
        age_group = '20-29 Years'
    elif (age >=30 and age <= 39):
        age_group = '30-39 Years'
    elif (age >= 40 and age <= 49):
        age_group = '40-49 Years'
    elif (age >= 50 and age <= 59):
        age_group = '50-59 Years'
    elif (age >= 60 and age <= 69):
        age_group = '60-69 Years'
    elif (age >= 70 and age <= 79):
        age_group = '70-79 Years'
    elif (age >= 80):
        age_group = '80+ Years'
    else:
        age_group = 'Missing'
    return age_group

# add the new column called age_group and apply the above function
afib_df['age_group'] = afib_df['age'].apply(get_age_group)

In [None]:
# define a function to recode height
def get_height_group(height):
    height_group = ''
    if (height < 150.0):
        height_group = '<1.50m'
    elif (height >= 150.0 and height <= 159.9):
        height_group = '1.50m +'
    elif (height >= 160.0 and height <= 169.9):
        height_group = '1.60m +'
    elif (height >= 170.0 and height <= 179.9):
        height_group = '1.70m +'
    elif (height >= 180.0 and height <= 189.9):
        height_group = '1.80m +'
    elif (height >= 190.0 and height <= 199.9):
        height_group = '1.90m +'
    else: 
        height_group = 'Missing'
    return height_group

# add the new column called age_group and apply the above function
afib_df['height_group'] = afib_df['height'].apply(get_height_group)

In [None]:
# define a function to recode weight
def get_weight_group(weight):
    weight_group = ''
    if (weight < 60.0):
        weight_group = '<60kg'
    elif (weight >= 60.0 and weight <= 69.9):
        weight_group = '60kg +'
    elif (weight >= 70.0 and weight <= 79.9):
        weight_group = '70kg +'
    elif (weight >= 80.0 and weight <= 89.9):
        weight_group = '80kg +'
    elif (weight >= 90.0 and weight <= 99.9):
        weight_group = '90kg +'
    elif (weight >= 100.0):
        weight_group = '100kg +'
    else: 
        weight_group = 'Missing'
    return weight_group

# add the new column called age_group and apply the above function
afib_df['weight_group'] = afib_df['weight'].apply(get_weight_group)

In [None]:
# get year from recording_date
afib_df['recording_year'] = pd.to_datetime(afib_df['recording_date']).dt.to_period('Y')

In [None]:
# check afib_df
print(afib_df.shape)
afib_df.head()

## Exploratory Data Analysis

In [None]:
# set up size and color for sns
sns.set(rc={'figure.figsize':(12,2)})
plt.rcParams['figure.dpi'] = 300

### 1. Which gender usually has a higher risk of getting AFib? (0 is male, 1 is female)

In [None]:
# countplot for ritmi, grouped by sex
sns.countplot(x='ritmi', data=afib_df, hue='sex', order = afib_df['ritmi'].value_counts().index, palette='GnBu')
plt.xlabel('Rhythm')
plt.legend(fontsize='x-small', title_fontsize='5', framealpha=0)
plt.show()

**Answer: Male patients are at higher risk of getting AFib than female patients..**

### 2. Which age-group is associated with higher risk of having AFib than others?

In [None]:
# countplot for ritmi, grouped by age
sns.countplot(x='ritmi', data=afib_df, hue='age_group', order = afib_df['ritmi'].value_counts().index, palette='tab20_r')
plt.xlabel('Rhythm')
plt.legend(fontsize='xx-small', title_fontsize='5', framealpha=0, loc='best')
plt.show()

**Answer: Patients who are 70 to 89 years old have a higher risk of having AFib than others.**

### 3. What is the common weight of patients who have AFib?

In [None]:
# countplot for ritmi, grouped by weight
sns.countplot(x='ritmi', data=afib_df, hue='weight_group', order = afib_df['ritmi'].value_counts().index, palette='tab20')
plt.xlabel('Rhythm')
plt.legend(fontsize='x-small', title_fontsize='5', framealpha=0, loc='upper right')
plt.show()

**Answer: Patients who have AFib are usually less than 60kg, or 60 to 79kg.**

### 4. What is the common height of patients who have AFib?

In [None]:
# countplot for ritmi, grouped by height
sns.countplot(x='ritmi', data=afib_df, hue='height_group', order = afib_df['ritmi'].value_counts().index, palette='twilight_shifted')
plt.xlabel('Rhythm')
plt.legend(fontsize='x-small', title_fontsize='5', framealpha=0, loc='upper right')
plt.show()

**Answer: Patients who have AFib are usually from 1.50m to 1.79m.**

### 5. What is the most common heart's electrical axis associated with AFib patients?

In [None]:
# countplot for ritmi, grouped by heart_axis
sns.countplot(x='ritmi', data=afib_df, hue='heart_axis', order = afib_df['ritmi'].value_counts().index, palette='tab20c')
plt.xlabel('Rhythm')
plt.legend(fontsize='x-small', title_fontsize='5', framealpha=0, loc='upper right')
plt.show()

**Answer: Most Afib patients have normal heart's electrical axis.**

### 5 Random Normal ECG

In [None]:
# get random normal cases
normal_case = random.choice(list(afib_df[afib_df['ritmi']==0].index))

# plot using numpy array data with afib_case as layers.
fig,ax = plt.subplots(5,1,figsize=(20,15),sharex=True,sharey=False)
for i in range(5):
    ax[i].plot(ecgeq_arr[normal_case,:,i])

### 5 Random Atrial Fibrillation ECG

In [None]:
# get random afib cases
afib_case = random.choice(list(afib_df[afib_df['ritmi']==1].index))

# plot using numpy array data with afib_case as layers.
fig,ax = plt.subplots(5,1,figsize=(20,15),sharex=True,sharey=False)
for i in range(5):
    ax[i].plot(ecgeq_arr[afib_case,:,i])

### 5 Random Other Arrhythmia ECG

In [None]:
# get random other arrhythmia cases
other_case = random.choice(list(afib_df[afib_df['ritmi']==2].index))

# plot using numpy array data with afib_case as layers.
fig,ax = plt.subplots(5,1,figsize=(20,15),sharex=True,sharey=False)
for i in range(5):
    ax[i].plot(ecgeq_arr[other_case,:,i])

## Modeling

We would like to detect Atrial Fibrillation cases using the 12 leads in the numpy file along with some of the features in the csv file. Without using the 12 leads, detecting AF cases only using height, weight, age, etc. is not persuading the cardiology physicians. Therefore, we generated a dataset that includes all the features from the numpy file and 14 features from the csv file. However, to reduce the size of the dataset that makes it easier to train the model, we only keep 700 rows instead of 5000 rows from the numpy file. The final dataset has a total of 4,319,176 observations and 26 variables. Please see the entire process [here](https://github.com/tvo10/atrial-fibrillation-detection/blob/main/03_afib_detection_feature_engineering.ipynb).

- **Features:** `I`, `II`, `III`, `aVF`, `aVR`, `aVL`, `V1`, `V2`, `V3`, `V4`, `V5`, `V6`, `age`, `sex`, `height`, `weight`, `nurse`, `site`, `device`, `heart_axis`, `validated_by`, `second_opinion`, `validated_by_human`, `pacemaker`, `strat_fold`
- **Label:** `ritmi`

In [None]:
# read in csv
df = pd.read_csv('../input/af-dataset/af_dataset.csv')
df

In [None]:
# convert all the columns to float64
for i in range(26):
    df[df.columns[i]] = df[df.columns[i]].astype('float64')
    
# get info for columns
df.info()

In [None]:
# train-test split
X = df.drop(columns='ritmi')
y = df['ritmi']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size = 0.25, random_state = 1234)

In [None]:
# random forest classification algorithm
rf = RandomForestClassifier()
rf_param_grid = {'n_estimators': [45], 'criterion': ['entropy'], 'max_depth': [45]} 
rf_cv= GridSearchCV(rf,rf_param_grid,cv=7)
rf_cv.fit(X_train,y_train)

print("Best Score:" + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))

In [None]:
# classification report
y_pred = rf_cv.predict(X_test)
print(classification_report(y_test, y_pred))

Thank you for reading my notebook until the end! Note that this notebook is a simplified version of my capstone project. 

I'd like to say thank you to the author of these two datasets as they helped me learn a lot while working on my project. If you would like to take a look at all of my work, including data wranling, eda, feature engineering, and modeling, please find it [here](https://github.com/tvo10/atrial-fibrillation-detection). Thank you!