# Heart Failure Prediction
## K-Nearest Neighbors vs Random Forest vs Naive Bayes

Hello, in this notebook we will compare the accuracy between 3 classifier algorithms in predicting heart failure based on clinical data.
First, we import the relevant libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Then we load the CSV file and inspect the dataframe.

In [None]:
df=pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df.head(10)

In [None]:
df.describe()

Now we assign the feature columns as variable X, and label column as Y. This establishes the ground truth for evaluating the predicted values later on.

In [None]:
feature_df=df[['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time']]
X=np.asarray(feature_df)
Y=y=np.asarray(df['DEATH_EVENT'])

We split the dataframe into training dataset and testing dataset. We will allocate 30% of the dataframe as test set.

In [None]:
#Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

### K Nearest Neighbors

We will attempt to obtain the most optimum number for nearest neighbors, and obtain the highest accuracy score.

In [None]:
e_knn=np.zeros(100);
for i in range(0,len(e_knn)):
    knn_model = KNeighborsClassifier(n_neighbors=i+1)
    knn_model.fit(X_train, y_train)
    yh_knn=knn_model.predict(X_test)
    e_knn[i]=accuracy_score(y_test, yh_knn)
    
print("KNN Prediction Accuracy Score: ",np.round(e_knn.max(),3),' with N = ',e_knn.argmax()+1)
plt.plot(np.arange(1,101),e_knn)
plt.plot(e_knn.argmax()+1,e_knn.max(),'or')
plt.title('KNN Accuracy Score')
plt.xlabel('N')
plt.ylabel('Accuracy')
plt.show()

### Random Forest

One of the most widely used algorithms for classifying labels. We will loop through 10-110 to obtain the most optimum number of estimators parameter.

In [None]:
rf_model = RandomForestClassifier(criterion='gini')
rf_model.fit(X_train, y_train)
yh_rf=rf_model.predict(X_test)
e_rf=accuracy_score(y_test, yh_rf)
    
print("Random Forest Prediction Accuracy Score: ",np.round(e_rf.max(),3))

### Naive Bayes

And lastly we model our prediction with Naive Bayes algorithm. The parameter to adjust will be the smoothing variable. This usually is done within a logarithmic space.

In [None]:
e_gnb=np.zeros(100)
params_NB = np.logspace(0,-9, num=100)
for i in range(0,len(params_NB)):
    gnb_model = GaussianNB(var_smoothing=params_NB[i])
    gnb_model.fit(X_train, y_train)
    yh_gnb=gnb_model.predict(X_test)
    e_gnb[i]=accuracy_score(y_test, yh_gnb)
    
print("Naive Bayes Prediction Accuracy Score: ",np.round(e_gnb.max(),3),' with Smoothing = ',params_NB[e_gnb.argmax()])
plt.plot(params_NB,e_gnb,'.-')
plt.plot(params_NB[e_gnb.argmax()],e_gnb.max(),'or')
plt.title('Naive Bayes Accuracy Score')
plt.xscale('log')
plt.xlabel('Var Smoothing')
plt.ylabel('Accuracy')
plt.show()

## Conclusion

In [None]:
plt.bar(['K-Nearest Neighbors','Random Forest','Naive Bayes'],[e_knn.max(),e_rf.max(),e_gnb.max()])

We can clearly see above that Random Forest algorithm provides the highest accuracy in predicting heart failure cases (>90% accuracy).