<h1><center>Heart Failure Prediction</center></h1>
<h3><center>"I'm tired of working"</center></h3>
<center><img src='https://sinahealthtour.com/wp-content/uploads/2019/07/Untitled-1.jpg'></center>


# Summary

- [Libraries](#Libraries)

- [Data](#Data)

- [Data Analysis](#Data-Analysis)

- [Data Preparation](#Data-preparation)

- [Machine Learning Models](#Machine-Learning-Models)

- [Comparing models](#Comparing-models)

# Libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

from imblearn.over_sampling import SMOTE

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif, chi2
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, classification_report

# Data

In [None]:
data = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

data.head()

In [None]:
data.describe()

# Data Analysis

## Age

In [None]:
fig_age = go.Figure()

fig_age.add_trace(go.Histogram(x=data['age'],
                               marker_color='#6a6fff'))

fig_age.update_layout(
    title_text='Age Distribution',
    xaxis_title_text='Age',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_age.show()

## Anemia 

A condition in which the blood doesn't have enough healthy red blood cells.

["Anemia, or a low hemoglobin level in the blood, is often linked to heart disease because the heart has to work harder to pump more blood and oxygen through the body."](https://www.everydayhealth.com/heart-health/anemia.aspx#:~:text=Anemia%27s%20Impact%20on%20Heart%20Health&text=People%20who%20are%20anemic%20are,compared%20to%20those%20without%20anemia.)

In [None]:
normal = data[data['anaemia']==0]

anemia = data[data['anaemia']==1]

In [None]:
colors= ['#7eff5e', '#ff5e79']

labels = ['Normal', 'Anemia']

values = [len(normal[normal['DEATH_EVENT'] == 1]), 
          len(anemia[anemia['DEATH_EVENT'] == 1])]

fig_anemia = go.Figure()

fig_anemia.add_trace(go.Pie(labels=labels, values=values,
                            hole=.4, marker_colors=colors))

fig_anemia.update_layout(
    title_text='Total number of deaths - Anemia',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_anemia.show()

## Creatinine phosphokinase

The Creatine phosphokinase (CPK) is an enzyme found mainly in the heart, brain, and skeletal muscle and may indicate some type of injury to these muscles

**CPK normal value:** 

- 10 - 120 micrograms per liter (mcg/L)


**Abnormal values can indicates:**
- Brain injury or stroke
- Convulsions
- Delirium tremens
- Dermatomyositis or polymyositis
- Electric shock
- Heart attack*
- Inflammation of the heart muscle (myocarditis)
- Lung tissue death (pulmonary infarction)
- Muscular dystrophies
- Myopathy


For more information acess: [CPK](https://www.ucsfbenioffchildrens.org/tests/003503.html)

In [None]:
normal_cpk_level = data[(data['creatinine_phosphokinase'] >= 10) & 
                        (data['creatinine_phosphokinase'] <= 120)]

abnormal_cpk_level = data[(data['creatinine_phosphokinase'] < 10) | 
                          (data['creatinine_phosphokinase'] > 120)]

In [None]:
fig_creatinine = go.Figure()

fig_creatinine.add_trace(go.Histogram(x=data['creatinine_phosphokinase'],
                                      marker_color='#6a6fff'))

fig_creatinine.update_layout(
    title_text='Creatinine Phosphokinase Distribution',
    xaxis_title_text='Creatinine Phosphokinase (mcg/L)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_creatinine.show()

In [None]:
fig_creatinine = go.Figure()

fig_creatinine.add_trace(go.Box(y=data['creatinine_phosphokinase'], 
                                name='Box', marker_color='#6a6fff'))

fig_creatinine.update_layout(
    title_text='Creatinine Phosphokinase BoxPlot',
    yaxis_title_text='Creatinine Phosphokinase (mcg/L)', 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_creatinine.show()

In [None]:
colors = ['#7eff5e', '#ff5e79']

labels = ['CPK Normal Level', 'CPK Abnormal Level']

values = [len(normal_cpk_level[normal_cpk_level['DEATH_EVENT'] == 1]),
          len(abnormal_cpk_level[abnormal_cpk_level['DEATH_EVENT'] == 1])]

fig_creatinine = go.Figure()

fig_creatinine.add_trace(go.Pie(labels=labels, values=values, 
                                hole=.4, marker_colors=colors))

fig_creatinine.update_layout(
    title_text='Total number of deaths - CPK',
    template = 'plotly_dark',
    width=750, 
    height=600
)

## Diabetes

Is a chronic, metabolic disease characterized by elevated levels of blood glucose (or blood sugar), which leads over time to serious damage to the heart, blood vessels, eyes, kidneys and nerves. 

["Diabetes and heart failure are linked; treatment should be too."](https://www.heart.org/en/news/2019/06/06/diabetes-and-heart-failure-are-linked-treatment-should-be-too#:~:text=People%20who%20have%20Type%202,a%20risk%20factor%20for%20diabetes.)

In [None]:
normal = data[data['diabetes']==0]

diabetes = data[data['diabetes']==1]

In [None]:
colors = ['#7eff5e', '#ff5e79']

labels = ['Normal', 'Diabetes']

values = [len(normal[normal['DEATH_EVENT'] == 1]), 
          len(diabetes[diabetes['DEATH_EVENT'] == 1])]

fig_diabetes = go.Figure()

fig_diabetes.add_trace(go.Pie(labels=labels, values=values,
                              hole=.4, marker_colors=colors))

fig_diabetes.update_layout(
    title_text='Total number of deaths - Diabetes',
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, height=600)


## Ejection fraction
Ejection fraction is a measurement of the percentage of blood leaving your heart each time it contracts.

The ejection fraction is usually measured only in the left ventricle (LV).

- An LV ejection fraction of 55 percent or higher is considered normal.
- An LV ejection fraction of 50 percent or lower is considered reduced.
- An LV ejection fraction between 50 and 55 percent is usually considered "borderline.""

Some things that may cause a reduced ejection fraction are:

- Weakness of the heart muscle, such as cardiomyopathy
- Heart attack that damaged the heart muscle
- Heart valve problems
- Long-term, uncontrolled high blood pressure

For more information acess: [Ejection fraction](https://www.mayoclinic.org/ejection-fraction/expert-answers/faq-20058286#:~:text=The%20ejection%20fraction%20is%20usually,or%20higher%20is%20considered%20normal.)

In [None]:
normal_ejection_fract = data[data['ejection_fraction'] >= 55]

reduced_ejection_fract = data[data['ejection_fraction'] <= 50]

borderline_ejection_fract = data[(data['ejection_fraction'] < 55) & 
                                 (data['ejection_fraction'] > 50)]

In [None]:
fig_eject_fract = go.Figure()

fig_eject_fract.add_trace(go.Histogram(x=data['ejection_fraction'],
                                      marker_color='#6a6fff'))

fig_eject_fract.update_layout(
    title_text='Ejection Fraction Distribution',
    xaxis_title_text='Ejection fraction (%)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, height=600
)

fig_eject_fract.show()

In [None]:
colors = ['#7eff5e', '#ff5e79', '#fddb3a']

labels = ['Normal Ejection Fraction', 'Reduced Ejection Fraction', 
          'Borderline Ejection Fraction ']

values = [len(normal_ejection_fract[normal_ejection_fract['DEATH_EVENT']==1]),
          len(reduced_ejection_fract[reduced_ejection_fract['DEATH_EVENT']==1]),
          len(borderline_ejection_fract[borderline_ejection_fract['DEATH_EVENT']==1])]

fig_eject_fract = go.Figure()

fig_eject_fract.add_trace(go.Pie(labels=labels, values=values,
                         hole=.4, marker_colors=colors))

fig_eject_fract.update_layout(
    title_text='Total number of deaths - Ejection Fraction',
    template = 'plotly_dark',
    width=750, 
    height=600
)

## High blood pressure

Is a common condition in which the long-term force of the blood against your artery walls is high enough that it may eventually cause health problems, such as heart disease.

["How High Blood Pressure Can Lead to Heart Failure"](https://www.heart.org/en/health-topics/high-blood-pressure/health-threats-from-high-blood-pressure/how-high-blood-pressure-can-lead-to-heart-failure#:~:text=Heart%20failure%2C%20a%20condition%20where,risk%20of%20developing%20heart%20failure.)

In [None]:
normal_blood_pressure = data[data['high_blood_pressure'] == 0]

high_blood_pressure = data[data['high_blood_pressure'] == 1]

In [None]:
color = ['#7eff5e', '#ff5e79']

labels = ['Normal Blood Pressure', 'High Blood Pressure']

values = [len(normal_blood_pressure[normal_blood_pressure['DEATH_EVENT'] == 1]), 
          len(high_blood_pressure[high_blood_pressure['DEATH_EVENT'] == 1])]

fig_pressure = go.Figure()

fig_pressure.add_trace(go.Pie(labels=labels, values=values,
                             hole=.4, marker_colors=colors))

fig_pressure.update_layout(
    title_text='Total number of deaths - Blood Pressure',
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

## Platelets

Platelets are parts of the blood that helps the blood clot.

**Normal number of platelets**: 150.000 to 400.000

["Platelets and Cardiovascular Disease"](https://www.ahajournals.org/doi/pdf/10.1161/01.CIR.0000086897.15588.4B)

In [None]:
normal_platelets_level = data[(data['platelets'] >= 150000) & (data['platelets'] <= 450000)]

abnormal_platelets_level = data[(data['platelets'] < 150000) | (data['platelets'] > 450000)]

In [None]:
fig_platelets = go.Figure()

fig_platelets.add_trace(go.Histogram(x=data['platelets'], 
                                      marker_color='#6a6fff'))

fig_platelets.update_layout(
    title_text='Platelets Distribution',
    xaxis_title_text='Platelets (kiloplatelets/mL)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, height=600
)

fig_platelets.show()

In [None]:
colors = ['#7eff5e', '#ff5e79']

labels = ['Normal Platelets Level', 'Abnormal Platelets Level']

values = [len(normal_platelets_level[normal_platelets_level['DEATH_EVENT']==1]),
          len(abnormal_platelets_level[abnormal_platelets_level['DEATH_EVENT']==1])]

fig_platelets = go.Figure()

fig_platelets.add_trace(go.Pie(labels=labels, values=values, 
                         hole=.4, marker_colors=colors))

fig_platelets.update_layout(
    title_text='Total number of deaths - Platelets',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_platelets.show()

## Serum creatinine

Can indicate whether your kidneys are working properly. 

**Normal range:** 0.7 to 1.2 (mg/dL)

["Beware the rising creatinine level"](https://doi.org/10.1054/jcaf.2003.10)

In [None]:
normal_range_creatinine = data[(data['serum_creatinine'] >= 0.7) & (data['serum_creatinine'] <= 1.2)]

out_range_creatinine = data[(data['serum_creatinine'] < 0.7) | (data['serum_creatinine'] > 1.2)]

In [None]:
fig_creatinine = go.Figure()

fig_creatinine.add_trace(go.Histogram(x=data['serum_creatinine'], 
                                      marker_color='#6a6fff'))

fig_creatinine.update_layout(
    title_text='Serum Creatinine Distribution',
    xaxis_title_text='Serum Creatinine(mg/dL)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, height=600
)

fig_creatinine.show()

In [None]:
colors = ['#7eff5e', '#ff5e79']

labels = ['Normal Creatinine Level', 'Abnormal Creatinine Level']

values = [len(normal_range_creatinine[normal_range_creatinine['DEATH_EVENT']==1]),
          len(out_range_creatinine[out_range_creatinine['DEATH_EVENT']==1])]

fig_creatinine = go.Figure()

fig_creatinine.add_trace(go.Pie(labels=labels, values=values, 
                         hole=.4, marker_colors=colors))

fig_creatinine.update_layout(
    title_text='Total number of deaths - Creatinine',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_creatinine.show()

## Serum sodium

A sodium blood test is a routine test that allows your doctor to see how much sodium is in your blood. 

**Normal range:**  135 to 145 mEq/L

In [None]:
normal_sodium_level = data[(data['serum_sodium'] >= 135) & (data['serum_sodium'] <= 145)]
abnormal_sodium_level = data[(data['serum_sodium'] < 135) | (data['serum_sodium'] > 145)]

In [None]:
fig_sodium = go.Figure()

fig_sodium.add_trace(go.Histogram(x=data['serum_sodium'], 
                                  marker_color='#6a6fff'))

fig_sodium.update_layout(
    title_text='Serum Sodium Distribution',
    xaxis_title_text='Serum Sodium (mEq/L)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, height=600
)

fig_sodium.show()

In [None]:
colors = ['#7eff5e', '#ff5e79']

labels = ['Normal Sodium Level', 'Abnormal Sodium Level']

values = [len(normal_sodium_level[normal_sodium_level['DEATH_EVENT']==1]),
          len(abnormal_sodium_level[abnormal_sodium_level['DEATH_EVENT']==1])]

fig_sodium = go.Figure()

fig_sodium.add_trace(go.Pie(labels=labels, values=values, 
                         hole=.4, marker_colors=colors))

fig_sodium.update_layout(
    title_text='Total number of deaths - Sodium',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_sodium.show()

## Sex

In [None]:
colors = ['#013766', '#bc4558']

labels = ['Male', 'Female']

values = [len(data[(data['DEATH_EVENT'] == 1) & (data['sex'] == 1)]), 
          len(data[(data['DEATH_EVENT'] == 1) & (data['sex'] == 0)])]

fig_sex = go.Figure()

fig_sex.add_trace(go.Pie(labels=labels, values=values, 
                         hole=.4, marker_colors=colors))

fig_sex.update_layout(
    title_text='Total number of deaths - Sex',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_sex.show()

## Smoking

In [None]:
labels = ['Smokers', 'No smokers']

values = [len(data[(data['DEATH_EVENT'] == 1) & (data['smoking'] == 1)]), 
          len(data[(data['DEATH_EVENT'] == 1) & (data['smoking'] == 0)])]

fig_smoking = go.Figure()

fig_smoking.add_trace(go.Pie(labels=labels, values=values,
                            hole=.4))

fig_smoking.update_layout(
    title_text='Total number of deaths - Smoking',
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_smoking.show()

## Time

In [None]:
fig_time = go.Figure()

fig_time.add_trace(go.Histogram(x=data['time'], 
                                marker_color='#6a6fff'))

fig_time.update_layout(
    title_text='Time Distribution',
    xaxis_title_text='Time (days)',
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_time.show()

## Death event

In [None]:
survived = data[data['DEATH_EVENT'] == 0]

dead = data[data['DEATH_EVENT'] == 1]

In [None]:
fig_target = go.Figure()

fig_target.add_trace(go.Histogram(x=survived['DEATH_EVENT'], 
                                  name='Survived'))

fig_target.add_trace(go.Histogram(x=dead['DEATH_EVENT'], 
                                  name='No Survived'))

fig_target.update_layout(
    yaxis_title_text='Count', 
    bargap=0.05, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_target.show()

## Death event - Pairplot


In [None]:
sns.pairplot(data, hue='DEATH_EVENT')

## Correlation

In [None]:
fig_corr = px.imshow(data.corr(), color_continuous_scale='peach')

fig_corr.update_layout(
    title={
        'text': "Features correlation",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'}, 
    template = 'plotly_dark',
    width=750, 
    height=600
)

fig_corr.show()

# Data preparation

## Feature selection

I'll use two differents tests to select six features for our models:

- ANOVA test for numerical features

- Chi2 test for categorical features

In [None]:
numerical_features = ['age', 'creatinine_phosphokinase', 'ejection_fraction',
                      'platelets', 'serum_creatinine', 'serum_sodium',
                      'time']

categorical_features = ['anaemia', 'diabetes', 'high_blood_pressure',
                        'sex', 'smoking']

numerical_selector = SelectKBest(f_classif, k=4)

categorical_selector =  SelectKBest(chi2, k=1)

X_numerical = numerical_selector.fit_transform(data[numerical_features], 
                                                  data['DEATH_EVENT'])

X_categorical = categorical_selector.fit_transform(data[categorical_features],
                                                    data['DEATH_EVENT'])

print('Numerical features selected:', data[numerical_features].columns[numerical_selector.get_support()].to_list())

print('Categorical features selected:', data[categorical_features].columns[categorical_selector.get_support()].to_list())

## Defining variables

In [None]:
X_selected = data[['age', 'ejection_fraction', 'serum_creatinine', 'time', 
                   'high_blood_pressure']]

y = data['DEATH_EVENT']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size = 0.2, 
                                                    stratify = y)

## Standardzing the data

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# Machine Learning Models

###  Random Forest Classifier

In [None]:
rfc_parameters = {'n_estimators' : [10, 20, 50, 100],
                  'criterion' : ['gini', 'entropy'],
                  'max_depth' : [3, 5, 7, 9, 10]
                 }

grid_search_rfc = GridSearchCV(estimator = RandomForestClassifier(), 
                           param_grid = rfc_parameters,
                           cv = 10,
                           n_jobs = -1)

grid_search_rfc.fit(X_train_scaled, y_train)

rfc = grid_search_rfc.best_estimator_

y_pred_rfc = rfc.predict(X_test_scaled)

rfc_accuracy = accuracy_score(y_test, y_pred_rfc)

rfc_cv_score = cross_val_score(rfc, X_selected, y, cv=10).mean()

In [None]:
print(classification_report(y_test, y_pred_rfc))

### K-Nearest Neighbors

In [None]:
knn_parameters = {'n_neighbors' : [i for i in range(1, 40)]}

grid_search_knn = GridSearchCV(estimator = KNeighborsClassifier(), 
                           param_grid = knn_parameters,
                           cv = 10,
                           n_jobs = -1)

grid_search_knn.fit(X_train_scaled, y_train)

knn = grid_search_knn.best_estimator_

y_pred_knn = knn.predict(X_test_scaled)

knn_accuracy = accuracy_score(y_test, y_pred_knn)

knn_cv_score = cross_val_score(knn, X_selected, y, cv=10).mean()

In [None]:
print(classification_report(y_test, y_pred_knn))

# Comparing models

## With AUC

### Defining probabilities

In [None]:
no_skill = [0 for _ in range(len(y_test))]

rfc_probs = rfc.predict_proba(X_test_scaled)

rfc_probs = rfc_probs[:, 1]

knn_probs = knn.predict_proba(X_test_scaled)

knn_probs = knn_probs[:, 1]

### Evaluating AUC score

In [None]:
rfc_auc = roc_auc_score(y_test, rfc_probs)

knn_auc = roc_auc_score(y_test, knn_probs)

print('(RFC) ROC AUC score:', rfc_auc)

print('(KNN) ROC AUC score:', knn_auc)

### Defining False Positive and True Positive rates

In [None]:
ns_fpr, ns_tpr, a =  roc_curve(y_test, no_skill)

rfr_fpr, rfr_tpr, a =  roc_curve(y_test, rfc_probs)

knn_fpr, knn_tpr, a =  roc_curve(y_test, knn_probs)

### Comparing models with AUC

In [None]:
fig_auc = go.Figure()

fig_auc.add_trace(go.Scatter(x=ns_fpr, y=ns_tpr, mode='lines',line_dash='dot', 
                             name = 'No Skill (AUC = 0.5)'))

fig_auc.add_trace(go.Scatter(x=rfr_fpr, y=rfr_tpr, mode='lines', 
                             name=('RFC (AUC = %f)' %rfc_auc)))

fig_auc.add_trace(go.Scatter(x=knn_fpr, y=knn_tpr, mode='lines', 
                             name=('KNN (AUC = %f)' %knn_auc)))

fig_auc.update_layout(xaxis_title = 'False Positive Rate', 
                      yaxis_title='True Positive Rate', 
                      width=700, height=500)

fig_auc.show()

## With DataFrame

In [None]:
models = [('RFC', rfc_accuracy, rfc_cv_score), 
          ('KNN', knn_accuracy, knn_cv_score)]

model_comparasion = pd.DataFrame(models, columns=['Model', 'Accuracy Score', 'CV Score'])

model_comparasion.head()