# 1. Basic dataset analysis

Python Imports:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data Import:

In [None]:
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
df.head()

Basic analysis of the dataframe:

In [None]:
df.info()

Looks like we have **299** records (rows) in the dataset.

Check for missing data (however we can observe we shouldn't have any missing data from above):

In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(15,3))

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

**NO missing data** and data quality is good!

Number of columns:

In [None]:
df.columns

In [None]:
len(df.columns)

We have **13 columns**, and all of them are **numerical (quantitative)**, no column is categorical (if there were any we could have just used LabelEncoder or OneHotEncoding with DummyVariables)!

Number of death occurances:

In [None]:
plt.figure(figsize=(6,6))
df['DEATH_EVENT'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

In [None]:
df['DEATH_EVENT'].value_counts()

We have **96 death events** and 203 not death events. **32.1% of patients died!**

# 2. Exploratory Data Analysis (EDA) + Visualizations

Explore the correlations in this dataset, as all columns are numerical:

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),cmap='coolwarm', annot=True)

We can start to see that the **age, ejection_fraction, serum_creatinine, serum_sodium and time** columns are quite well correlated to the DEATH_EVENT label. These seem to be the **most important features** in the df. We could only keep these when we build the model.

**AGE**:

Age distribution of the 2 sexes:

In [None]:
sns.set_style('whitegrid')

g = sns.FacetGrid(df, hue="sex", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "age", bins=30, alpha=0.5)

g.add_legend()

Similar normal distribution for M/F.

Age distribution of survived/not survived:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "age", bins=30, alpha=0.5)

g.add_legend()

Looks like older people tend to be correlated to DEATH_EVENT (makes sense).

In [None]:
sns.boxplot(x="DEATH_EVENT", y="age", data=df)

We can see that the mean age is higher for death_event. Note some outliers for the age of DEATH_EVENT=0, probably very old people that did not die from heart disease.

**ANAEMIA:**

What percentage of people with anaemia died?

In [None]:
plt.figure(figsize=(6,6))
df['anaemia'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

**creatinine_phosphokinase distribution hued by DEATH_EVENT**:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "creatinine_phosphokinase", bins=50, alpha=0.5)

g.add_legend()

**DIABETES:**

What percentage of people with diabetes died?

In [None]:
plt.figure(figsize=(6,6))
df['diabetes'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

**ejection_fraction distribution hued by DEATH_EVENT**:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "ejection_fraction", bins=10, alpha=0.5)

g.add_legend()

In [None]:
sns.boxplot(x="DEATH_EVENT", y="ejection_fraction", data=df)

We can see that lower ejection fraction increases chances of DEATH_EVENT.

**HIGH BLOOD PRESSURE:**

What percentage of people with high-blood pressure died?

In [None]:
plt.figure(figsize=(6,6))
df['high_blood_pressure'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

**platelets distribution hued by DEATH_EVENT**:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "platelets", bins=30, alpha=0.5)

g.add_legend()

**serum_creatinine distribution hued by DEATH_EVENT**:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "serum_creatinine", bins=30, alpha=0.5)

g.add_legend()

In [None]:
sns.boxplot(x="DEATH_EVENT", y="serum_creatinine", data=df)

Looks to be higher for people that died of heart disease.

**serum_sodium distribution hued by DEATH_EVENT**:

In [None]:
g = sns.FacetGrid(df, hue="DEATH_EVENT", height=6, aspect=2, palette='dark')
g = g.map(plt.hist, "serum_sodium", bins=30, alpha=0.5)

g.add_legend()

In [None]:
sns.boxplot(x="DEATH_EVENT", y="serum_sodium", data=df)

Looks to be lower for people that died.

**GENDER:**

In [None]:
plt.figure(figsize=(6,6))
df['sex'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

**SMOKING:**

In [None]:
plt.figure(figsize=(6,6))
df['smoking'].value_counts().plot(kind='pie', autopct='%1.1f', shadow=True)

**serum_sodium vs ejection_fraction:** (these look to be correlated from the corr heatmap above)

In [None]:
sns.regplot(x='serum_sodium',y='ejection_fraction', data=df)

Just a slight correlation here, the line is not too steep.

# 3. Models and Performance: 

Choose what features are included in the model (for now we will include all, but we could have chosen the ones given by corr heatmap):

In [None]:
X=df.drop(['DEATH_EVENT'], axis=1)
y=df['DEATH_EVENT']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Predefine performance metric functions for Classification Problems**

In [None]:
def print_validation_report(y_true, y_pred):
    print("Classification Report")
    print(classification_report(y_true, y_pred))
    acc_sc = accuracy_score(y_true, y_pred)
    print("Accuracy : "+ str(acc_sc))

In [None]:
def plot_confusion_matrix(y_true, y_pred):
    mtx = confusion_matrix(y_true, y_pred)
    sns.heatmap(mtx, annot=True, fmt='d', linewidths=.5,  
                cmap="Blues", cbar=False)
    plt.ylabel('true label')
    plt.xlabel('predicted label')

In [None]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

## 3.1 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr=LogisticRegression(max_iter=10000)
lr.fit(X_train,y_train)
p1=lr.predict(X_test)
s1=accuracy_score(y_test,p1)
print("Linear Regression Success Rate :", s1*100,'%')

In [None]:
print_validation_report(y_test,p1)

In [None]:
plot_confusion_matrix(y_test,p1)

**FEATURE IMPORTANCE IN LOG REGRESSION:**

In [None]:
importance = abs(lr.coef_[0])
coeffecients = pd.DataFrame(importance, X_train.columns)
coeffecients.columns = ['Coeffecient']
plt.figure(figsize=(15,4))
plt.bar(X_train.columns,importance)
plt.show()

As expected, the columns we predicted have the highest feature importance! The rest of the columns could be discarded and the models re-build using only the highlighted columns.

## 3.2 Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc=RandomForestClassifier()
rfc.fit(X_train,y_train)
p2=rfc.predict(X_test)
s2=accuracy_score(y_test,p2)
print("Random Forrest Accuracy :", s2*100,'%')

In [None]:
plot_confusion_matrix(y_test,p2)

## 3.3 SVM

In [None]:
from sklearn.svm import SVC
svm=SVC()
svm.fit(X_train,y_train)
p3=svm.predict(X_test)
s3=accuracy_score(y_test,p3)
print("SVM Accuracy :", s3*100,'%')

In [None]:
plot_confusion_matrix(y_test,p3)

## 3.4 KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,y_train)
p4=knn.predict(X_test)
s4=accuracy_score(y_test,p4)
print("KNN Accuracy :", s4*100,'%')

Let's optimize for K:

In [None]:
error_rate = []
scores = []

for i in range(1,40): # check all values of K between 1 and 40
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    score=accuracy_score(y_test,pred_i)
    scores.append(score)
    error_rate.append(np.mean(pred_i != y_test)) # ERROR RATE DEF and add it to the list

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10,6))
plt.plot(range(1,40),scores,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Accuracy Score vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy Score')

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

K=35 seems a good value that minimises errors and maximises accuracy.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=35)
knn.fit(X_train,y_train)
p4=knn.predict(X_test)
s4=accuracy_score(y_test,p4)
print("KNN Accuracy:", s4*100,'%')

In [None]:
plot_confusion_matrix(y_test,p4)

## 3.5 Gaussian Naive-Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
p5 =nb.predict(X_test)
s5=accuracy_score(y_test,p5)
print("Naive-Bayes Accuracy:", s5*100,'%')

In [None]:
plot_confusion_matrix(y_test,p5)

In [None]:
f1_score(y_test,p5)

Very good F1 score, meaning good precision-recall balance!!

## Summarize results:

In [None]:
models = pd.DataFrame({
    'Model': ["LOGISTIC REGRESSION","RANDOM FOREST","SUPPORT VECTOR MACHINE","KNN","NAIVE-BAYES"],
    'Accuracy Score': [s1*100,s2*100,s3*100,s4*100,s5*100]})
models.sort_values(by='Accuracy Score', ascending=False)

Looks like **Naive-Bayes wins in accuracy**.

We can further compare each model's F1 scores for balance between precision and recall.

In [None]:
print(f1_score(y_test,p1))
print(f1_score(y_test,p2))
print(f1_score(y_test,p3))
print(f1_score(y_test,p4))
print(f1_score(y_test,p5))

**NB** wins in F1 score as well, KNN and SVM are to be completely disregarded.

We can also compare the ROC curves and AUC scores for each model.

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score, auc

In [None]:
fpr1,tpr1, thr1=roc_curve(y_test,p1)
fpr2,tpr2, thr2=roc_curve(y_test,p2)
fpr3,tpr3, thr3=roc_curve(y_test,p3)
fpr4,tpr4, thr4=roc_curve(y_test,p4)
fpr5,tpr5, thr5=roc_curve(y_test,p5)

In [None]:
plt.figure(figsize=(10,6))
plt.plot(fpr1,tpr1, linestyle='--', label='LR')
plt.plot(fpr2,tpr2, linestyle='--', label='RF')
plt.plot(fpr3,tpr3, linestyle='--', label='SVM')
plt.plot(fpr4,tpr4, linestyle='--', label='KNN')
plt.plot(fpr5,tpr5, linestyle='--', label='NB')
plt.legend()

We can see the largest area under curve is the **Naive-Bayes**, as prediced. This means a good balance between Type1 and Type2 errors.

In [None]:
print(roc_auc_score(y_test,p1))
print(roc_auc_score(y_test,p2))
print(roc_auc_score(y_test,p3))
print(roc_auc_score(y_test,p4))
print(roc_auc_score(y_test,p5))

We can see that the **NB classifier has the largest AUC score.**

# 4. Conclusion and model choice

**Final model choice: Gaussian Naive-Bayes** presents the largest accuracy,f1 and auc scores, as well as only 4 miss-labelings!

**Summary:** We have started with an initial data analysis, seeing if there was any missing data and ensuring the integrity and quality of data. We also obesrved each column (their data types) and for the categorical columns, we could have used LabelEncoding or Dummy variables (not the case in this dataset). Then, some visualizations based on each feature were created and a correlation heatmap was used to determine some preliminary important features, that were confirmed in the model-building section afterwards (these features could have been used for the models instead of choosing all of them like I have). The data was split into train-test and 5 Classifier Models (Logistic Regression, Random Forest, KNN, SVC and Gaussian Naive-Bayes) were build upon the train data and tested upon the test data. The comparison metrics were accuracy score, f1 score, as well as the ROC-AUC curve scores, and the best model was clearly the Gaussian Naive-Bayes, with around 93% accuracy and the least number of Type 1 + Type 2 errors. Feature importance was extracted from the Logistic Regression, however it is best to use regularization algorithms like Lasso/Ridge/ElasticNet to extract feature importance. To note that in medical diagnosis, it is desired to minimise False Negatives! Moreover, the KNN Classifier was optimized for the best K value.