## Heart Failure Prediction Dataset

Context 


Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

## Attribute Information
* Age: age of the patient [years]
* Sex: sex of the patient [M: Male, F: Female]
* ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, * * ASY: Asymptomatic]
* RestingBP: resting blood pressure [mm Hg]
* Cholesterol: serum cholesterol [mm/dl]
* FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
* ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
* Oldpeak: oldpeak = ST [Numeric value measured in depression]
* ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* HeartDisease: output class [1: heart disease, 0: Normal]


## Source
This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

* Cleveland: 303 observations
* Hungarian: 294 observations
* Switzerland: 123 observations
* Long Beach VA: 200 observations
* Stalog (Heart) Data Set: 270 observations
* Total: 1190 observations
* Duplicated: 272 observations

+ Final dataset: 918 observations

Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

In [4]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns 

## Reading Dataset

In [6]:
dataset = pd.read_csv("./data/heart.csv")

In [8]:
dataset.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## EDA 

In [9]:
dataset.shape 

(918, 12)

## Checking missing value

In [10]:
dataset.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

Insights 
+ There is no null value

## Dataset Information

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [12]:
dataset.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


## Dependent and Independent features

In [13]:
dataset.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [14]:
x  = dataset.drop(['HeartDisease'],axis=1)
y = dataset['HeartDisease']

In [15]:
x.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up


In [16]:
dataset['Sex'].unique()

array(['M', 'F'], dtype=object)

In [17]:
dataset['ChestPainType'].unique()

array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [18]:
dataset['RestingECG'].unique()

array(['Normal', 'ST', 'LVH'], dtype=object)

In [19]:
dataset['ExerciseAngina'].unique()

array(['N', 'Y'], dtype=object)

In [21]:
dataset['ST_Slope'].unique()

array(['Up', 'Flat', 'Down'], dtype=object)

In [27]:
# categorical features
categorical_fet = [fet for fet in x.columns if x[fet].dtype == 'O']
categorical_fet

['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

In [26]:
# numerical features
numerical_fet = [fet for fet in x.columns if x[fet].dtype != 'O']
numerical_fet

['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

## Scaling and Encoding 

In [28]:
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_transformer = OneHotEncoder()
num_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder",cat_transformer,categorical_fet),
        ("StandarScalar",num_transformer,numerical_fet)
    ]
)

In [29]:
x = preprocessor.fit_transform(x)

## Train test split 

In [30]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [31]:
x_train.shape,x_test.shape 

((734, 20), (184, 20))

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [35]:
models = {
    "LogisticRegression":LogisticRegression(),
    "DecisionTreeClassifier":DecisionTreeClassifier(),
    "KNeighborsClassifier":KNeighborsClassifier(),
    "AdaBoostClassifier":AdaBoostClassifier(),
    "GradientBoostingClassifier":GradientBoostingClassifier(),
    "XGBClassifier": XGBClassifier()
}


In [41]:
accuracy_score_list = []
for m in list(models):
    model = models[m]
    model.fit(x_train,y_train)

    predict = model.predict(x_test)

    accuracy = accuracy_score(y_test,predict)
    cf_metrix = confusion_matrix(y_test,predict)
    accuracy = accuracy_score(y_test,predict)
    report = classification_report(y_test,predict)

    accuracy_score_list.append(accuracy)
    print(f"{m}")
    print(f"Accuracy:{accuracy}")
    print(f"Confusion metrix:\n{cf_metrix}")
    print(f"Classification report:\n{report}")

    print("=="*40)

LogisticRegression
Accuracy:0.8532608695652174
Confusion metrix:
[[67 10]
 [17 90]]
Classification report:
              precision    recall  f1-score   support

           0       0.80      0.87      0.83        77
           1       0.90      0.84      0.87       107

    accuracy                           0.85       184
   macro avg       0.85      0.86      0.85       184
weighted avg       0.86      0.85      0.85       184

DecisionTreeClassifier
Accuracy:0.782608695652174
Confusion metrix:
[[65 12]
 [28 79]]
Classification report:
              precision    recall  f1-score   support

           0       0.70      0.84      0.76        77
           1       0.87      0.74      0.80       107

    accuracy                           0.78       184
   macro avg       0.78      0.79      0.78       184
weighted avg       0.80      0.78      0.78       184

KNeighborsClassifier
Accuracy:0.8532608695652174
Confusion metrix:
[[67 10]
 [17 90]]
Classification report:
              precis

In [42]:
accuracy_score_list

[0.8532608695652174,
 0.782608695652174,
 0.8532608695652174,
 0.8695652173913043,
 0.8804347826086957,
 0.8804347826086957]

In [46]:
acc_table = pd.DataFrame(data=zip(list(models),accuracy_score_list),columns=['Model names',"Accuracy Score"]).sort_values(by=['Accuracy Score'],ascending=False)

In [47]:
acc_table

Unnamed: 0,Model names,Accuracy Score
4,GradientBoostingClassifier,0.880435
5,XGBClassifier,0.880435
3,AdaBoostClassifier,0.869565
0,LogisticRegression,0.853261
2,KNeighborsClassifier,0.853261
1,DecisionTreeClassifier,0.782609


+ Insights

From this table we can say that say that GDboosting classifier and XGBClassifier gives the best result.