# Introduction
![](https://www.clearlake-specialties.com/wp-content/uploads/SystolicDiastolic_Heartfailure.5518685646fab-e1553788922847.png)

 

Heart failure is a **progressive condition** in which the heart’s muscle  gets injured from something like a **heart** attack or high blood pressure and **gradually loses its ability to pump enough blood to supply the body’s needs**. The heart can be affected in two ways, either become weak and unable to pump blood (we call this situation systolic heart failure) or it become stiff and unable to fill with blood adequately (we call this situation diastolic heart failure).

Ultimately, both conditions lead to retention of extra fluid or congestion. So when patients develop symptoms we call it congestive heart failure Heart failure is very common.  Although we have made progress in the treatment of many forms of heart disease, heart failure is a growing problem in the United States.  Current estimates are that nearly 6.5 million Americans over the age of 20 have heart failure. One major study estimates there are 960,000 new heart failure cases annually.  Not only is heart failure a major problem affecting many people, heart failure is also a major killer.  Heart failure directly accounts for about 8.5% of all heart disease deaths in the United States.  And, by some estimates heart failure actually contributes to about 36% of all cardiovascular disease deaths.

### Motivation: 
 * Try to understand what the cause of Heart failure.
 * Explore the data trough some EDA and data visualisazion.
 * Try to detect and extract relevant feature in order to build a prediction model. 


### The plan:

- [Libraries](#Libraries)

- [Load data and first look](#Load-data-and-first-look)

- [Data Analysis](#Data-Analysis)

- [Feature enginerring](#feature-enginerring)

- [Modeling](#Modeling)


# Libraries

In [None]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import sklearn

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff


from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import accuracy_score, classification_report, roc_curve,precision_recall_curve, auc,confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC

from xgboost import XGBClassifier

# Load data and first look

In [None]:
df = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

In [None]:
df.head()

In [None]:
df.info()

* Sex - Gender of patient Male = 1, Female =0
* Age - Age of patient
* Diabetes - 0 = No, 1 = Yes
* Anaemia - 0 = No, 1 = Yes
* High_blood_pressure - 0 = No, 1 = Yes
* Smoking - 0 = No, 1 = Yes
* DEATH_EVENT - 0 = No, 1 = Yes

From the datafame info we observe that our columns are all numerical and do not have missing values which will make our work easier

In [None]:
df.isnull().sum()

No missing values.

<h2 class="list-group-item list-group-item-action active">Distribution of numerical values</h2>

- CPK blood tests the different forms of CPK in the bloodstream and the CPK normal range varies from a male to female. The CPK normal range for a male is between 39 – 308 U/L, while in females the CPK normal range is between 26 – 192 U/L,

- The reference range for serum sodium is 135-147 mmol/L,

- Results of the creatinine blood test are measured in milligrams per deciliter or micromoles per liter. The normal range for creatinine in the blood may be 0.84 to 1.21 milligrams per deciliter (74.3 to 107 micromoles per liter), although this can vary from lab to lab, between men and women, and by age. Since the amount of creatinine in the blood increases with muscle mass, men usually have higher creatinine levels than do women.

In [None]:
def plot_hist(col, bins=30, title="",xlabel="",ax=None):
    sns.distplot(col, bins=bins,ax=ax)
    ax.set_title(f'Histogram of {title}',fontsize=20)
    ax.set_xlabel(xlabel)

In [None]:
fig, axes = plt.subplots(3,2,figsize=(20,20),constrained_layout=True)
plot_hist(df.creatinine_phosphokinase,
          title='Creatinine Phosphokinase',
          xlabel="Level of the CPK (mcg/L)",
          ax=axes[0,0])
plot_hist(df.platelets,
          bins=30,
          title='Platelets',
          xlabel='Platelets in the blood (kiloplatelets/mL)',
          ax=axes[0,1])
plot_hist(df.serum_creatinine,
          title='Serum Creatinine', 
          xlabel='Level of serum creatinine in the blood (mg/dL)',
          ax=axes[1,0])
plot_hist(df.serum_sodium,
          bins=30,
          title='Serum Sodium',
          xlabel='Level of serum sodium in the blood (mEq/L)',
          ax=axes[1,1])
plot_hist(df.ejection_fraction,
          title='Ejection Fraction', 
          xlabel='Percentage of blood leaving the heart at each contraction (percentage)',
          ax=axes[2,0])
plot_hist(df.time,
          bins=30,
          title='Time',
          xlabel='Follow-up period (days)',
          ax=axes[2,1])
plt.show()

In [None]:
fig = px.histogram(df, x="age",color="DEATH_EVENT")
fig.show()

# Data analysis

<h2 class="list-group-item list-group-item-action active">Lets invesgate how are the features related to heart failure</h2>


In [None]:
len_data = len(df)
len_w = len(df[df["sex"]==0])
len_m = len_data - len_w

men_died = len(df.loc[(df["DEATH_EVENT"]==1) &(df['sex']==0)])
men_survived = len_m - men_died

women_died = len(df.loc[(df["DEATH_EVENT"]==1) & (df['sex']==1)])
women_survived = len_w - women_died

labels = ['Men died','Men survived','Women died','Women survived']
values = [men_died, men_survived, women_died, women_survived]

fig = go.Figure(data=[go.Pie(labels=labels, values=values,textinfo='label+percent',hole=0.4)])
fig.update_layout(
    title_text="Distribution of DEATH EVENT according to their gender")
fig.show()

In [None]:
fig = px.pie(df, values='sex', names='DEATH_EVENT',color_discrete_sequence=px.colors.sequential.RdBu
            ,title='Proportional of death event based on sex')
fig.show()

In [None]:
fg=sns.FacetGrid(df, hue="DEATH_EVENT", height=6,)
fg.map(sns.kdeplot, "age",shade=True).add_legend(labels=["Alive","Not alive"])
plt.title('Age Distribution Plot');
plt.show()

*Observations*
*  The distrubution of the age of male/female with no death event is almost the same 
*  The distribution of the age of male/femele with death event not the same

In [None]:
g = sns.lmplot(x="age", y="DEATH_EVENT", col="sex", hue="sex", data=df, y_jitter=.02, logistic=True, truncate=False)
g.set(xlim=(0, 80), ylim=(-.05, 1.05));

*Observations*
* From the two plots we can state that the older we get, higher the probability of having a death event 
* Men are more likely to face when they age than women 

<h2 class="list-group-item list-group-item-action active">Correlation maps</h2>

In [None]:
sns.heatmap(df.corr(),cmap="Blues");

In [None]:
data = df.copy()
data.loc[data.DEATH_EVENT == 0, 'DEATH_EVENT'] = "Alive"
data.loc[data.DEATH_EVENT == 1, 'DEATH_EVENT'] = "Not Alive"
sns.pairplot(data=data[['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time','DEATH_EVENT']], hue='DEATH_EVENT');

In [None]:
pd.crosstab(df.diabetes ,df.DEATH_EVENT).plot(kind='bar')
plt.legend(title='DEATH_EVENT', loc='upper right', labels=['No death event', 'Death event'])
plt.title('Death Event as per diabetes ')
plt.xlabel('diabetes ')
plt.ylabel('# Death')
plt.show()

In [None]:
pd.crosstab(df.high_blood_pressure ,df.DEATH_EVENT).plot(kind='bar')
plt.legend(title='DEATH_EVENT', loc='upper right', labels=['Not alive', 'Alive'])
plt.title('Death Event as per High pressure blood ')
plt.xlabel('High pressure blood ')
plt.ylabel('# Death')
plt.show()

In [None]:
pd.crosstab(df.smoking ,df.DEATH_EVENT).plot(kind='bar')
plt.legend(title='DEATH_EVENT', loc='upper right', labels=['Not alive', 'Alive'])
plt.title('Death Event as per smokers ')
plt.xlabel('Smokers ')
plt.ylabel('# Death')
plt.show()

In [None]:
pd.crosstab(df.diabetes ,df.DEATH_EVENT).plot(kind='bar')
plt.legend(title='DEATH_EVENT', loc='upper right', labels=['Not alive', 'Alive'])
plt.title('Death Event as per diabetes ')
plt.xlabel('diabetes ')
plt.ylabel('Death')
plt.show()

In [None]:
len_data = len(df)
len_w = len(df[df["sex"]==0])
len_m = len_data - len_w

men_with_diabetes = len(df.loc[(df["diabetes"]==1) & (df['sex']==0)])
men_without_diabetes = len_m - men_with_diabetes

women_with_diabetes = len(df.loc[(df["diabetes"]==1) & (df['sex']==1)])
women_without_diabetes = len_w - women_with_diabetes
#print(men_with_diabetes,men_without_diabetes) 
#print(women_with_diabetes,women_without_diabetes)

labels = ['M_diabetes','M_no_diabete','W_diabete','W_no_diabete']
values = [men_with_diabetes, men_without_diabetes, women_with_diabetes, women_without_diabetes]

fig = go.Figure(data=[go.Pie(labels=labels, values=values,textinfo='label+percent',hole=0.4)])
fig.update_layout(
    title_text="Distribution of No/diabetics according to their gender. (M for Men, W for Women)")
fig.show()

In [None]:
fig = px.parallel_categories(df[["sex","smoking","diabetes","anaemia","high_blood_pressure","time","DEATH_EVENT"]], color='DEATH_EVENT', color_continuous_scale=px.colors.sequential.Inferno)
fig.show()

With this plot, we have the possible scenerios for a given patient according to their categorical values

In [None]:
fig = px.box(df, x="DEATH_EVENT", y="age", color="smoking", notched=True)
fig.show()

# Feature selection

>Since the feature time do not have concret meaning we will not use it, even if the correlation map show that the time feature and the target are correlated.

In [None]:
x = df.copy()
y = x.loc[:,["DEATH_EVENT"]]
x = x.drop(columns=['time','DEATH_EVENT'])
features_names = x.columns

In [None]:
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.decomposition import PCA

In [None]:
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(x, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the impurity-based feature importances of the forest
plt.figure()
plt.title("Feature importances")
sns.barplot(x=features_names[indices].to_numpy(), y=importances[indices], palette="deep",yerr=std[indices])
plt.xticks(range(x.shape[1]), features_names[indices].to_numpy(),rotation=80)
plt.xlim([-1, x.shape[1]])
plt.show()

In [None]:
features = features_names[indices].to_numpy()[0:6]

fig = px.scatter_matrix(
    df,
    dimensions=features,
    color="DEATH_EVENT"
)
fig.update_traces(diagonal_visible=False)

fig.update_layout(
    title='Correlation map on best first 5 features',
    dragmode='select',
    width=1200,
    height=1200,
    #hovermode='closest',
)
fig.show()

In [None]:
def plot_cm(cm,title):
    z = cm
    x = ['No death Event', 'Death Event']
    y = x
    # change each element of z to type string for annotations
    z_text = [[str(y) for y in x] for x in z]

    # set up figure 
    fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text, colorscale='deep')

    # add title
    fig.update_layout(title_text='<i><b>Confusion matrix {}</b></i>'.format(title),
                      #xaxis = dict(title='x'),
                      #yaxis = dict(title='x')
                     )

    # add custom xaxis title
    fig.add_annotation(dict(font=dict(color="black",size=14),
                            x=0.5,
                            y=-0.10,
                            showarrow=False,
                            text="Predicted value",
                            xref="paper",
                            yref="paper"))

    # add custom yaxis title
    fig.add_annotation(dict(font=dict(color="black",size=14),
                            x=-0.15,
                            y=0.5,
                            showarrow=False,
                            text="Real value",
                            textangle=-90,
                            xref="paper",
                            yref="paper"))

    # adjust margins to make room for yaxis title
    fig.update_layout(margin=dict(t=50, l=20),width=750,height=750)
    


    # add colorbar
    fig['data'][0]['showscale'] = True
    fig.show()

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2, random_state=23)

# Modeling 

Will use the following algorithm in order to predict the death event:
* Logistic Regression
* KNearest Neighbor
* Decision Tree Classifier
* Random Forest Classifier
* Ada Boost
* SVM
* XG Boost
* Cat Boost


In [None]:
from catboost import CatBoostClassifier
models= [['Logistic Regression ',LogisticRegression()],
        ['KNearest Neighbor ',KNeighborsClassifier()],
        ['Decision Tree Classifier ',DecisionTreeClassifier()],
        ['Random Forest Classifier ',RandomForestClassifier()],
        ['Ada Boost ',AdaBoostClassifier()],
        ['SVM ',SVC()],
        ['XG Boost',XGBClassifier()],
        ['Cat Boost',CatBoostClassifier(logging_level='Silent')]]

models_score = []
for name,model in models:
    
    model = model
    model.fit(x_train,y_train)
    model_pred = model.predict(x_test)
    cm_model = confusion_matrix(y_test, model_pred)
    models_score.append(accuracy_score(y_test,model.predict(x_test)))
    
    print(name)
    print('Validation Acuuracy: ',accuracy_score(y_test,model.predict(x_test)))
    print('Training Accuracy: ',accuracy_score(y_train,model.predict(x_train)))
    print('############################################')
    plot_cm(cm_model,title=name+"model")
    fpr, tpr, thresholds = roc_curve(y_test, model_pred)

    fig = px.area(
        x=fpr, y=tpr,
        title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
        labels=dict(x='False Positive Rate', y='True Positive Rate'),
        width=700, height=500
    )
    fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=0, y1=1
    )

    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    fig.update_xaxes(constrain='domain')
    fig.show()
    
    



In [None]:
models_names = [
    'Logistic Regression',
'KNearest Neighbor',
'Decision Tree Classifier',
'Random Forest Classifier',
'Ada Boost',
'SVM',
'XG Boost',
'Cat Boost']

plt.rcParams['figure.figsize']=20,8
sns.set_style('darkgrid')
ax = sns.barplot(x=models_names, y=models_score, palette = "inferno", saturation =2.0)
plt.xlabel('Classifier Models', fontsize = 20 )
plt.ylabel('% of Accuracy', fontsize = 20)
plt.title('Accuracy of different Classifier Models on test set', fontsize = 20)
plt.xticks(fontsize = 12, horizontalalignment = 'center', rotation = 8)
plt.yticks(fontsize = 12)
for i in ax.patches:
    width, height = i.get_width(), i.get_height()
    x, y = i.get_xy() 
    ax.annotate(f'{round(height,2)}%', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()


## If this notebook helped you to understand both health failure (HF) and ML technics to detect HF or just liked my work please considere upvoting this notebook :). It will keep me motivated and will encourage me to explore more on Kaggle! 