## This notebook will explore and analyze the heart failure dataset and use machine learning models to predict heart failure.

In [None]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
sns.set_palette("pastel")

First let's take a look at the dataset to get an idea of what the data look like, the features of the dataset, the variable types (numerical/categorical), missing values, and some summary statistics.

In [None]:
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df.head()

In [None]:
# number of rows, number of columns
df.shape

In [None]:
# check for missing values in the dataset

df.isnull().sum()

In [None]:
df.dtypes

In [None]:
# summary statistics
df.describe()

First let's plot the distributions of the different variables, seperated out by numeric and categorical variables.

In [None]:
# examine the distribution of the numeric variables

fig = plt.figure(figsize = (18, 26)).tight_layout(h_pad=5.0, w_pad = 5.0)

plt.subplot(421)
plt.title('Age Distribution')
sns.histplot(df.age)

plt.subplot(422)
plt.title('Serum Sodium Distribution')
sns.histplot(df.serum_sodium)

plt.subplot(423)
plt.title('Creatinine Phospokinase Distribution')
sns.histplot(df.creatinine_phosphokinase)

plt.subplot(424)
plt.title('Time Distribution')
sns.histplot(df.time)

plt.subplot(425)
plt.title('Ejection Fraction Distribution')
sns.histplot(df.ejection_fraction)

plt.subplot(426)
plt.title('Platelets Distribution')
sns.histplot(df.platelets)

plt.subplot(427)
plt.title('Serum Creatinine Distribution')
sns.histplot(df.serum_creatinine)

I notice the serum_creatinine and creatinine_phosphokinase features are heavily right-skewed, with serum_sodium left-skewed. Time does not follow a normal distribution. This will be important to keep in mind as we move forward. Now let's visualize the value counts of the categorical variables.

In [None]:
# examine the distribution of the categorical variables

fig = plt.figure(figsize = (16, 12)).tight_layout(pad=3.0)

plt.subplot(231)
plt.title('Sex Distribution')
sns.countplot(df.sex)

plt.subplot(232)
plt.title('Anaemia Distribution')
sns.countplot(df.anaemia)

plt.subplot(233)
plt.title('Diabetes Distribution')
sns.countplot(df.diabetes)

plt.subplot(234)
plt.title('High Blood Pressure Distrubution')
sns.countplot(df.high_blood_pressure)

plt.subplot(235)
plt.title('Smoking Distribution')
sns.countplot(df.smoking)

plt.subplot(236)
plt.title('Death Distribution')
sns.countplot(df.DEATH_EVENT)

plt.show()

I notice the number of survivors is roughly double the number of deaths in the sample.  The same is true for non-smokers vs smokers, and normal blood pressure vs high blood pressure.

I will now seperate the features into numerical and categorical variables, and will standardize the numerical variables.

In [None]:
df.dtypes

In [None]:
df_cat = df[['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'DEATH_EVENT']]
df_num = df[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']]

In [None]:
df_cat.head()

In [None]:
# converting age and platelets to integers for simplicity

df_num['age'] = [int(x) for x in df_num.age]
df_num['platelets'] = [int (x) for x in df_num.platelets]
df_num.head()

In [None]:
# scaling the numerical data via standardscaler

sc = StandardScaler()
df_cols = df_num.columns
df_num_scaled = sc.fit_transform(df_num)
df_num_scaled = pd.DataFrame(df_num_scaled, columns = df_cols)
df_num_scaled.head()

In [None]:
# master dataframe (unscaled)
df_master = pd.concat([df_num, df_cat], axis=1)
df_master.head()

In [None]:
# master dataframe (scaled)
df_master_scaled = pd.concat([df_num_scaled, df_cat], axis=1)
df_master_scaled.head()

Let's take a look at any correlations that exist between the features by using a correlation matrix.

In [None]:
# heatmap to identify correlations between features

fig = plt.figure(figsize = (12, 7))
sns.heatmap(df_master_scaled.corr(), center=0, cmap='mako', robust=True, annot=True)

I note fairly significant correlations between death (our dependent variable), and the independent variables age, ejection fraction, serum_creatinine, serum_sodium, and time (which has the largest correlation).  Let's take a look.

In [None]:
median_death = df_master[df_master['DEATH_EVENT']==1]['age'].median()
median_life = df_master[df_master['DEATH_EVENT']==0]['age'].median()
print("Median Age for Death: ", median_death, '\nMedian age for Survivor: ', median_life, '\nDifference: ', median_death-median_life)

ax = sns.violinplot(data=df_master, x='DEATH_EVENT', y='age')
ax.set_title('Age of Deaths vs Survivors', fontsize=20)
ax.set_xlabel('Death Status', fontsize=14)
ax.set_ylabel('Age', fontsize=14)
ax.set_xticklabels(['Survivor', 'Death'], fontsize=14)
plt.show()

The median age of deaths is roughly 5 years older than survivors (65 vs 60 yrs)

In [None]:
female_deaths = len(df_master[(df_master['DEATH_EVENT']==1) & (df_master['sex']==0)])/len(df_master[df_master['sex']==0])
male_deaths = len(df_master[(df_master['DEATH_EVENT']==1) & (df_master['sex']==1)])/len(df_master[df_master['sex']==1])
print("Proportion of Female Deaths: ", female_deaths, '\nProportion of Male Deaths: ', male_deaths, '\nDifference: ', female_deaths-male_deaths)

ax = sns.countplot(data=df_master, x='sex', hue='DEATH_EVENT')
ax.set_title('Number of Deaths by Sex', fontsize=20)
ax.set_xlabel('Sex', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_xticklabels(['Female', 'Male'], fontsize=14)
plt.show()

Roughly the same proportion of males and females die of heart failure.

In [None]:
median_death = df_master[df_master['DEATH_EVENT']==1]['ejection_fraction'].median()
median_life = df_master[df_master['DEATH_EVENT']==0]['ejection_fraction'].median()
print("Median Ejection Fraction for Death: ", median_death, '\nMedian ejection Fraction for Survivors: ', median_life, '\nDifference: ', median_death-median_life)

ax = sns.violinplot(data=df_master, x='DEATH_EVENT', y='ejection_fraction')
ax.set_title('Ejection Fraction Deaths vs Survivors', fontsize=20)
ax.set_xticklabels(['Survivor', 'Death'], fontsize=14)
ax.set_xlabel('Death Status', fontsize=14)
ax.set_ylabel('Ejection Fraction', fontsize=14)
plt.show()

It seems lower ejection fraction seems to be associated with greater chance of heart failure according to our sample data.

In [None]:
median_death = df_master[df_master['DEATH_EVENT']==1]['serum_creatinine'].median()
median_life = df_master[df_master['DEATH_EVENT']==0]['serum_creatinine'].median()
print("Median Death: ", median_death, '\nMedian Life: ', median_life, '\nDifference: ', median_death-median_life)

ax = sns.violinplot(data=df_master, x='DEATH_EVENT', y='serum_creatinine')
ax.set_title('Serum Creatinine Deaths vs Survivors', fontsize=20)
ax.set_xticklabels(['Survivor', 'Death'], fontsize=14)
ax.set_ylabel('Serum Creatinine', fontsize=14)
ax.set_xlabel('Death Status', fontsize=14)
plt.show()

Serum creatinine levels seem to be very similar, though the deaths have many more outliers on the high end.

In [None]:
median_death = df_master[df_master['DEATH_EVENT']==1]['time'].median()
median_life = df_master[df_master['DEATH_EVENT']==0]['time'].median()
print("Median Death: ", median_death, '\nMedian Life: ', median_life, '\nDifference: ', median_death-median_life)

ax = sns.violinplot(data=df_master, x='DEATH_EVENT', y='time')
ax.set_title('Time Deaths vs Survivors', fontsize=20)
ax.set_xticklabels(['Survivor', 'Death'], fontsize=14)
ax.set_ylabel('Time', fontsize=14)
ax.set_xlabel('Death Status', fontsize=14)
plt.show()

We can clearly see the time variable is widely dispersed and does not follow a normal distribution especially for survivors. Now let's perform some multivariate analysis to see if there is anything interesting there.

In [None]:
ax = sns.scatterplot(x=df_master['serum_creatinine'], y=df_master['age'], hue=df_master['DEATH_EVENT'])
ax.set_title('Serum Creatinine vs Age', fontsize=20)
#ax.set_xticklabels(['Survivor', 'Death'], fontsize=14)
ax.set_ylabel('Age', fontsize=14)
ax.set_xlabel('Serum Creatinine', fontsize=14)
plt.show()

Across all age levels we see a greater concentration of survivors having lower serum creatinine levels, with many deaths seeming to be associated with slightly greater levels of serum creatinine.

In [None]:
ax = sns.scatterplot(data=df_master, x='age', y='ejection_fraction', hue='DEATH_EVENT')
ax.set_title('Ejection Fraction vs Age', fontsize=20)
ax.set_ylabel('Age', fontsize=14)
ax.set_xlabel('Ejection Fraction', fontsize=14)
plt.show()

No clear trends here, though it seems there is a greater concentration of deaths at lower ejection fraction levels at all ages.

In [None]:
ax = sns.scatterplot(data=df_master, x='age', y='serum_sodium', hue='DEATH_EVENT')
ax.set_title('Serum Sodium vs Age', fontsize=20)
ax.set_ylabel('Age', fontsize=14)
ax.set_xlabel('Serum Sodium', fontsize=14)
plt.show()

In [None]:
ax = sns.violinplot(data=df_master, x='sex', y='ejection_fraction', hue='DEATH_EVENT')
ax.set_title('Sex vs Ejection Fraction', fontsize=20)
ax.set_xticklabels(['Female', 'Male'], fontsize=12)
ax.set_ylabel('Ejection Fraction', fontsize=14)
ax.set_xlabel('Sex', fontsize=14)
plt.show()

It appears females overall have a slightly higher ejection fraction on average than males, slightly more notable in the deaths.  The female deaths have a more dispersed ejection fraction than male deaths. Lower ejection fraction appears slightly correlated to higher chance of heart failure, and this seems true for both males and females. Males seem to have lower overall ejection fraction than females in the sample.

In [None]:
g = sns.FacetGrid(df, row='sex', col='diabetes')
g.map(sns.countplot, 'DEATH_EVENT')
g.set_axis_labels("Death Event", "Count")

The ratio of deaths/survivors for males without diabetes is roughly 2x, compared to ~2.5x for males with diabetes.  For females, the ratio of deaths/survivors is higher in people without diabetes vs with diabetes.

# Data Preprocessing / Modeling

I will first drop the variables that seem to be the least correlated to the dependent variable.

In [None]:
X = df_master_scaled.drop(['DEATH_EVENT', 'sex', 'anaemia', 'diabetes', 'high_blood_pressure', 'platelets', 'smoking'], axis=1)
y = df[['DEATH_EVENT']]

In [None]:
X.head()

In [None]:
y.head()

Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=1)

In [None]:
X_train

In [None]:
y_train

In [None]:
# Import the models to be used

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost

# Import the evaluation methodologies to be used
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# create a dictionary containing the various models (method borrowed from another Kaggle user - currently searching for original author to provide credit)

model_list = dict()
model_list['Decision Tree'] = DecisionTreeClassifier(class_weight={0:1,1:2})
model_list['Random Forest'] = RandomForestClassifier(class_weight={0:1,1:2})
model_list['Logreg'] = LogisticRegression()
model_list['GradientBoost'] = GradientBoostingClassifier()
model_list['AdaBoost'] = AdaBoostClassifier()
model_list['XGBoost'] = xgboost.XGBClassifier()

In [None]:
# iterate through the models in the dictionary and fit the training data to each model
for model in model_list:
    model_list[model].fit(X_train, y_train)
    print(model + ' : fit')

In [None]:
# iterate through the models in the dictionary and print a classification report to evaluate the models

print("Train set prediction")
for item in model_list:
        
    print(item)
    model = model_list[item]
    y_train_pred = model.predict(X_train)
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

In [None]:
# confusion matrix for the logistical regression

model = model_list['Logreg']
y_train_pred = model.predict(X_train)
arg_train = {'y_true':y_train, 'y_pred':y_train_pred}
sns.heatmap(confusion_matrix(**arg_train), annot=True, cmap='mako')

In [None]:
# confusion matrix for the adaboost model

model = model_list['AdaBoost']
y_train_pred = model.predict(X_train)
arg_train = {'y_true':y_train, 'y_pred':y_train_pred}
sns.heatmap(confusion_matrix(**arg_train), annot=True, cmap='mako')

In [None]:
#now the testing set

print("Test set prediction")
for item in model_list:
        
    print("                         "+item)
    model = model_list[item]
    y_test_pred = model.predict(X_test)
    print(confusion_matrix(y_test, y_test_pred))
    print(classification_report(y_test, y_test_pred))

In [None]:
# confusion matrices for each model in model_list

for item in model_list:
        
    #print(item)
    model = model_list[item]
    y_test_pred = model.predict(X_test)
    ax = sns.heatmap(confusion_matrix(y_test, y_test_pred), annot=True, cmap='mako')
    ax.set_title('Confusion Matrix: '+ item)
    plt.show()

It seems the XGBoost model has the highest average recall score of the models with the test set, followed closely by the random forest and logistical regression models.  Now let's plot the feature importances for each model.

In [None]:
#plot graph of feature importances
for item in model_list:
        
    if item is not "Logreg":
        feat_importances = pd.Series(model_list[item].feature_importances_, index=X.columns)
        ax = feat_importances.nlargest(10).plot(kind='barh')
        ax.set_title('Feature Importances: '+ item)
        plt.show()



Time appears to be the most relevant feature in each of the models, significantly outweighing the other features in some of the models.  Creatinine phosphokinase and serum creatinine are much more important in the Adaboost model.

I hope you enjoyed my exploratory analysis and basic model development with the heart disease dataset.  There is quite a bit a bit more that can be done to further explore these data and improve the models, which I will likely explore in further revisions to this analysis.  Any feedback/suggestions to help improve my work would be greatly appreciated!