# Heart Failure Project

Description: In this project we will use 12 factors ( among which age, blood pressure, smoking )  to predict heart failure in various patients. For this purspose, we will explore the Kaggle dataset "Heart Failure Prediction"

## 1. Loading Data

In [None]:
#import necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#load data
df=pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

In [None]:
#check the beginning
df.head()

## 2. Data Cleaning

In [None]:
#check for missing values
df.isnull().sum()    

# we notice that there are no missing values

In [None]:
# investigate data types
df.info()

We notice that the columns regarding __anemia__, __diabetes__, __high blood pressure__, __sex__, __smoking__ and __death__ should have categorical values instead of int64. We now change this.


In [None]:
#change data type for the indicated columns

cat_cols=["anaemia","diabetes","high_blood_pressure","sex","smoking","DEATH_EVENT"]

df[cat_cols]=df[cat_cols].astype("category")


In [None]:
#rename DEATH_EVENT column
df.rename({"DEATH_EVENT":"death"},axis=1,inplace=True)

In [None]:
#check statistical measures of the data
df.describe()

## 3. Exploratory Data Analysis

### 3.1 Correlation matrix

In [None]:
#import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

#create correlation matrix
cor_mat=df.corr()

# set figure size
plt.figure(figsize=(9,7))

#create the heatmap
ax=sns.heatmap(cor_mat,cmap="Blues",linewidths=2, linecolor='black',annot=True)
ax.set_ylim([0,7])
plt.show()

The heatmap shows no strong correlation between any two variables. Furthermore, we can see that only the numerical variables have been included.

### 3.2 Histograms

We will use histograms to grasp an idea about the distribution of the numerical variables

In [None]:
#set the style of the plots
sns.set_style("darkgrid")

In [None]:
#plot age
plt.figure(figsize=(6,6))
sns.distplot(df.age,bins=10)
plt.xlabel("Age of the patients")
plt.ylabel("Number of patients")

In [None]:
#plot creatinine_phosphokinase
plt.figure(figsize=(6,6))
sns.distplot(df.creatinine_phosphokinase,bins=10)
plt.xlabel("Level of creatinine phosphokinase")
plt.ylabel("Number of patients")

In [None]:
#plot ejection_fraction
plt.figure(figsize=(6,6))
sns.distplot(df.ejection_fraction,bins=10)
plt.xlabel("Ejection fraction")
plt.ylabel("Number of patients")

In [None]:
#plot platelets
plt.figure(figsize=(6,6))
sns.distplot(df.platelets,bins=10)
plt.xlabel("Concentration of platelets")
plt.ylabel("Number of patients")

In [None]:
#plot serum_creatinine
plt.figure(figsize=(6,6))
sns.distplot(df.serum_creatinine,bins=10)
plt.xlabel("Level of creatinine")
plt.ylabel("Number of patients")

In [None]:
#plot serum_sodium
plt.figure(figsize=(6,6))
sns.distplot(df.serum_sodium,bins=10)
plt.xlabel("Level of sodium")
plt.ylabel("Number of patients")

### 3.3 Countplots

In [None]:
#plot anaemia for each sex
plt.figure(figsize=(6,6))
sns.set_style("darkgrid")
sns.catplot(x="anaemia",hue="death",data=df,kind="count",col="sex",palette="colorblind")
plt.show()

In [None]:
#plot diabetes for each sex
plt.figure(figsize=(6,6))
sns.set_style("darkgrid")
sns.catplot(x="diabetes",hue="death",data=df,kind="count",col="sex",palette="colorblind")
plt.show()

In [None]:
#plot smoking for each sex
plt.figure(figsize=(6,6))
sns.set_style("darkgrid")
sns.catplot(x="smoking",hue="death",data=df,kind="count",col="sex",palette="colorblind")
plt.show()

In [None]:
#plot high_blood_pressure for each sex
plt.figure(figsize=(6,6))
sns.set_style("darkgrid")
sns.catplot(x="high_blood_pressure",hue="death",data=df,kind="count",col="sex",palette="colorblind")
plt.show()

### 3.4 Relational Plots

In [None]:
#the graphs in this section will be produced using plotly
import plotly.express as px

In [None]:
#plot platelets vs creatinine_phosphokinase
fig=px.scatter(df,x="platelets",y="creatinine_phosphokinase",color="death",template="plotly_dark",width=1000,height=500)
fig.update_traces(marker=dict(size=12, line=dict(width=1,color='LightBlue')),selector=dict(mode='markers'))

In [None]:
#plot serum_creatinine vs serum_sodium
fig=px.scatter(df,x="serum_creatinine",y="serum_sodium",color="death",template="plotly_dark",width=1000,height=500)
fig.update_traces(marker=dict(size=12, line=dict(width=1,color='LightBlue')),selector=dict(mode='markers'))

In [None]:
#plot serum_creatinine vs creatinine_phosphokinase 
fig=px.scatter(df,x="serum_creatinine",y="creatinine_phosphokinase",color="death",template="plotly_dark",width=1000,height=500)
fig.update_traces(marker=dict(size=12, line=dict(width=1,color='LightBlue')),selector=dict(mode='markers'))

### 3.5 Boxplots 

In [None]:
# boxplot of the age variable 
px.box(df, x="smoking", y="age",width=1000,height=500,facet_col="high_blood_pressure",color_discrete_sequence=['darkorchid']
)

In [None]:
# boxplot of the ejection_fraction variable 
px.box(df, x="smoking", y="ejection_fraction",width=1000,height=500,color="high_blood_pressure",
       color_discrete_sequence=['crimson','yellow']
)

In [None]:
#boxplot of ejection_fraction variable|
px.box(df, x="smoking", y="creatinine_phosphokinase",width=1000,height=500,color="high_blood_pressure",
       color_discrete_sequence=['darkgreen','blue']
)

In [None]:
# boxplot of the platelets variable 
px.box(df, x="smoking", y="platelets", color="sex",width=1000,height=500)

## 4. Model creation

In this part we will first scale our data and split it into train and test sets. Next, we will use several classification algorithms and record thei performances:
    1. K-Neighbours
    2. Logistic Regression
    3. Random Forest
    4. AdaBoost

In [None]:
#scaling our data

from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

columns=["age","creatinine_phosphokinase","ejection_fraction","platelets","serum_creatinine","serum_sodium","time"]

for column in columns:
    df[column]=scaler.fit_transform(df[column].values.reshape(-1,1))

In [None]:
#split into train and test

from sklearn.model_selection import train_test_split

X=df.drop("death",axis=1)
y=df.death

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=10,test_size=0.3,stratify=y)

In [None]:
#import modules needed for hyperparameter tunning
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,cross_val_score

#import modules needed for performance check
from sklearn.metrics import confusion_matrix, classification_report

### 4.1 K-Neighbors

In [None]:
#import the model
from sklearn.neighbors import KNeighborsClassifier

#instantiate the classifier
knn=KNeighborsClassifier()

#define parameter range
param_grid_knn={"n_neighbors":range(1,15)}

#run gridsearch 
cv_knn=GridSearchCV(knn,param_grid_knn,cv=10)
cv_knn.fit(X_train,y_train)

In [None]:
#get the best estimator
best_knn=cv_knn.best_estimator_

#get the predicted classes
y_pred_knn=best_knn.predict(X_test)

#get the score for the test set
print("The score for the tuned KNN model is {}".format(best_knn.score(X_test,y_test)))

In [None]:
#get the metrics for the two classes
print(pd.DataFrame(classification_report(y_test,y_pred_knn,output_dict=True)))

In [None]:
#plot the heatmap corresponding to the confusion matrix
ax=sns.heatmap(confusion_matrix(y_test,y_pred_knn),annot=True,cmap="GnBu")
ax.set_ylim([0,2])

As we can see, KNeighbours model does not have high performance. By increasing the number of considered neighbours, the algorithm
tends to classify all the labels as 0 due to class imbalance. This leads to high recall for the majority class (ie. __0__ ) and 
low recall and precision for the minority class (ie. __1__)

### 4.2 Logistic Regression

In [None]:
#import the model
from sklearn.linear_model import LogisticRegression

#instantiate the classifier
logreg=LogisticRegression(random_state=10,solver='liblinear')

#define parameter range
param_grid_logreg={"C":np.logspace(-4, 4, 20),'penalty' : ['l1', 'l2']}

#run gridsearch 
cv_logreg=GridSearchCV(logreg,param_grid_logreg,cv=10)
cv_logreg.fit(X_train,y_train)

In [None]:
#get the best estimator
best_logreg=cv_logreg.best_estimator_

#get the predicted classes
y_pred_logreg=best_logreg.predict(X_test)

#get the score for the test set
print("The score for the tuned Logistic Regression is {}".format(best_logreg.score(X_test,y_test)))

In [None]:
#get the metrics for the two classes
print(pd.DataFrame(classification_report(y_test,y_pred_logreg,output_dict=True)))

In [None]:
#plot the heatmap corresponding to the confusion matrix
ax=sns.heatmap(confusion_matrix(y_test,y_pred_logreg),annot=True,cmap="GnBu")
ax.set_ylim([0,2])

We can see that the Logistic Regression does a significantly better job. It has not only higher score for the test sets, but also 
higher precision and recall for each class

### 4.3 Random Forest

In [None]:
#import the model
from sklearn.ensemble import RandomForestClassifier

#instantiate the classifier
rf=RandomForestClassifier(random_state=10)

#define parameter range
param_grid_rf={"n_estimators":range(100,401,50),"criterion":["gini","entropy"],"max_depth":range(2,10),
               "min_samples_leaf":np.arange(0.1,0.51,0.1),"max_features":["auto","sqrt","log2"]
              }

#run randomized search 
cv_rf=RandomizedSearchCV(rf,param_grid_rf,cv=10)
cv_rf.fit(X_train,y_train)

In [None]:
#get the best estimator
best_rf=cv_rf.best_estimator_

#get the predicted classes
y_pred_rf=best_rf.predict(X_test)

#get the score for the test set
print("The score for the tuned Random Forest is {}".format(best_rf.score(X_test,y_test)))

In [None]:
#get the metrics for the two classes
print(pd.DataFrame(classification_report(y_test,y_pred_rf,output_dict=True)))

In [None]:
#plot the heatmap corresponding to the confusion matrix
ax=sns.heatmap(confusion_matrix(y_test,y_pred_rf),annot=True,cmap="GnBu")
ax.set_ylim([0,2])

Compared to the Logistic Regression model, the Random Forest has higher score for the test set. However, the recall for class __1__ is particularly low, which is confirmed by the confusion matrix.

### 4.4 AdaBoost

In [None]:
#import the model
from sklearn.ensemble import AdaBoostClassifier

#instantiate the classifier
ada=AdaBoostClassifier(random_state=10)

#range for number of estimators
param_grid_ada={"n_estimators":range(50,401,50)}

#run the gridsearch
cv_ada=GridSearchCV(ada,param_grid_ada,cv=3)
cv_ada.fit(X_train,y_train)

In [None]:
#get the best estimator
best_ada=cv_ada.best_estimator_
best_ada

In [None]:
#get the predicted classes
y_pred_ada=best_ada.predict(X_test)

#get the score for the test set
print("The score for the tuned AdaBoost model is {}".format(best_ada.score(X_test,y_test)))

This is the best score we got so far for the test set.

In [None]:
#get the metrics for the two classes
print(pd.DataFrame(classification_report(y_test,y_pred_ada,output_dict=True)))

In [None]:
#plot the heatmap corresponding to the confusion matrix
ax=sns.heatmap(confusion_matrix(y_test,y_pred_ada),annot=True,cmap="GnBu")
ax.set_ylim([0,2])

The AdaBoost model has fairly good results, having misclassified only a relatively small number of patients.

## 5. Final Comments

It appears that AdaBooster has the highest performance compared to the other 3 models we fitted.
However, apart from the KNeighbors algorithm, they all generated fairly good results, with accuracies for the test set of over 80%. 

This type of methods could provide effective support to doctors for assessing the severity of the patient's condition. However, more sophisticated models should be buit for them to be reliable.
