# Heart Failure Prediction

**Cardiovascular diseases (CVDs)** are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. **Heart failure** is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies. People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Dataset from Kaggle (https://www.kaggle.com/andrewmvd/heart-failure-clinical-data)   
Thirteen (13) clinical features:

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- death event: if the patient deceased during the follow-up period (boolean)

**Task Details** :

- Create a model to assess the likelihood of a death by heart failure event.
- This can be used to help hospitals in assessing the severity of patients with cardiovascular diseases.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Library

In [None]:
# data visualization
import matplotlib.pyplot as plt 
import seaborn as sns

# data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# model building
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.naive_bayes import GaussianNB # Naive Bayes
from sklearn.svm import SVC # Support Vector Machine
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn.tree import DecisionTreeClassifier # Decision Tree

# model evaluation
from sklearn.metrics import confusion_matrix, accuracy_score

# k-fold cross validation
from sklearn.model_selection import cross_val_score

## Data Preparation

In [None]:
# read csv
heart_failure = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
heart_failure.head()

In [None]:
# check dimension
heart_failure.shape

In [None]:
# check info data
heart_failure.info()

In [None]:
# check missing values
heart_failure.isnull().sum()

In [None]:
# summary statistics
heart_failure.describe()

## Data Visualization

In [None]:
# percentage death event
plt.figure(figsize=(10,5))
sns.countplot(data=heart_failure, x='DEATH_EVENT')
plt.show()

In [None]:
# explore continuous features
plt.figure(figsize=(20,12))
plt.suptitle('Continuous Features', fontsize=20)
data_continuous =  heart_failure[['age', 'creatinine_phosphokinase', 'ejection_fraction',
                                  'platelets', 'serum_creatinine', 'serum_sodium', 'time']]
for i in range(0, len(data_continuous.columns)):
    plt.subplot(3, 3, i+1)

    sns.distplot(data_continuous.iloc[:, i])

In [None]:
# explore catergorical features
plt.figure(figsize=(20,12))
plt.suptitle('Categorical Features', fontsize=20)
data_categorical =  heart_failure[['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'DEATH_EVENT']]
for i in range(0, len(data_categorical.columns)-1):
    plt.subplot(3, 3, i+1)

    sns.countplot(data_categorical.iloc[:, i], hue=data_categorical['DEATH_EVENT'])

In [None]:
# explore sex vs age with death event
plt.figure(figsize=(10,5))
sns.boxplot(data=heart_failure, x="sex", y="age",hue='DEATH_EVENT')

plt.show()

In [None]:
# explore smoking vs age with death event
plt.figure(figsize=(10,5))
sns.boxplot(data=heart_failure, x="smoking", y="age",hue='DEATH_EVENT')

plt.show()

In [None]:
# explore sex vs smoking with death event
plt.figure(figsize=(10,5))
sns.catplot(data=heart_failure, kind='count', x='sex', col='smoking', hue='DEATH_EVENT')
plt.show()

## Data Modelling

### Feature Correlation

In [None]:
# correlation features
plt.figure(figsize=(12,8))
data_corr = heart_failure.corr()
sns.heatmap(data_corr, vmin=-1, annot=True, cmap='viridis')
plt.show()

### Feature Selection

In [None]:
# selecting features if correlation with target > +/-0.1
data_corr[abs(data_corr['DEATH_EVENT']) > 0.1]['DEATH_EVENT']

In [None]:
# Splitting training and testing data with features selected  
X = heart_failure[['age','ejection_fraction', 'serum_creatinine', 'serum_sodium', 'time']]
y = heart_failure['DEATH_EVENT']

# use train size 80% and test size 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# standardization
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))

In [None]:
# list model dan accuracy
model = []
model_accuracy = []

### Logistic Regression

In [None]:
lr = LogisticRegression(random_state=0)
lr.fit(X_train, y_train)
pred_lr = lr.predict(X_test)

acc1=accuracy_score(y_test,pred_lr)
model.append('Logistic Regression')
model_accuracy.append(acc1*100)
print("Logistic Regression Success Rate :", "{:.2f}%".format(100*acc1))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred_lr), annot=True)
plt.show()

### Naive Bayes

In [None]:
nb = GaussianNB()
nb.fit(X_train, y_train)
pred_nb = nb.predict(X_test)

acc2=accuracy_score(y_test,pred_nb)
model.append('Naive Bayes')
model_accuracy.append(acc2*100)
print("Naive Bayes Success Rate :", "{:.2f}%".format(100*acc2))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred_nb), annot=True)
plt.show()

### Support Vector Machine

In [None]:
svm = SVC(random_state=0, kernel='rbf')
svm.fit(X_train, y_train)
pred_svm = svm.predict(X_test)

acc3=accuracy_score(y_test,pred_svm)
model.append('Support Vector Machine')
model_accuracy.append(acc3*100)
print("Support Vector Machine Success Rate :", "{:.2f}%".format(100*acc3))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred_svm), annot=True)
plt.show()

### Decision Tree

In [None]:
dt = DecisionTreeClassifier(criterion = 'entropy', random_state=0)
dt.fit(X_train, y_train)
pred_dt = dt.predict(X_test)

acc4=accuracy_score(y_test,pred_dt)
model.append('Decision Tree')
model_accuracy.append(acc4*100)
print("Decision Tree Success Rate :", "{:.2f}%".format(100*acc4))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred_dt), annot=True)
plt.show()

### Random Forest

In [None]:
rf = RandomForestClassifier(criterion='entropy', n_jobs=10, random_state=10)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

acc5=accuracy_score(y_test,pred_rf)
model.append('Random Forest')
model_accuracy.append(acc5*100)
print("Random Forest Success Rate :", "{:.2f}%".format(100*acc5))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred_rf), annot=True)
plt.show()

### Comparison Model

In [None]:
data_model = pd.DataFrame({'Model':model, 'Accuracy':model_accuracy})
print(data_model)

In [None]:
plt.figure(figsize=(9,6))
sns.barplot(data=data_model, x='Accuracy', y='Model', palette='Blues')
plt.title('Comparison Accuracy of Model', fontsize=15, fontweight='bold')
plt.xlabel('Accuracy Model (%)')
plt.ylabel('')
plt.show()

In [None]:
# Calculating K Fold Cross Validation scores for the models
accuracies_lr = cross_val_score(estimator = lr, X = X_train, y = y_train, cv = 10, scoring='roc_auc')
accuracies_nb = cross_val_score(estimator = nb, X = X_train, y = y_train, cv = 10, scoring='roc_auc')
accuracies_svm = cross_val_score(estimator = svm, X = X_train, y = y_train, cv = 10, scoring='roc_auc')
accuracies_dt = cross_val_score(estimator = dt, X = X_train, y = y_train, cv = 10, scoring='roc_auc')
accuracies_rf = cross_val_score(estimator = rf, X = X_train, y = y_train, cv = 10, scoring='roc_auc')

In [None]:
kfold_acc_mean = [np.mean(accuracies_lr), np.mean(accuracies_nb), np.mean(accuracies_svm), 
                  np.mean(accuracies_dt), np.mean(accuracies_rf)]

kfold_acc_std = [np.std(accuracies_lr), np.std(accuracies_nb), np.std(accuracies_svm), 
                  np.std(accuracies_dt), np.std(accuracies_rf)]

In [None]:
KFold = pd.DataFrame({'Model': model, 'KFold accuracies mean': kfold_acc_mean, 
                      'KFold accuracies std': kfold_acc_std})
print(KFold)

**Conclusion**

- Random Forest is best choice of 5 models because accuracy score 86.67% and validation score 90.53%
- Many patient is safe from death is almost 70%
- Sex vs Smoker is have correlation score 0.45
- 5 Features selected is can handle heart failure prediction