Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.


### Variables in the dataset:

1.Age: Age of the patient

2.Anaemia: If the patient had the haemoglobin below the normal range

3.Creatinine_phosphokinase: The level of the creatine phosphokinase in the blood in mcg/L

4.Diabetes: If the patient was diabetic

5.Ejection_fraction: Ejection fraction is a measurement of how much blood the left ventricle pumps out with each contraction

6.High_blood_pressure: If the patient had hypertension

7.Platelets: Platelet count of blood in kiloplatelets/mL

<a href="https://www.healthline.com/health/high-creatinine-symptoms">8.Serum_creatinine: The level of serum creatinine in the blood in mg/dL</a>

9.Serum_sodium: The level of serum sodium in the blood in mEq/L

10.Sex: The sex of the patient

11.Smoking: If the patient smokes actively or ever did in past

12.Time: It is the time of the patient's follow-up visit for the disease in months

13.Death_event: If the patient deceased during the follow-up period

<a>The structure of this notebook</a>
<ol>
<li>Introduction</li>
<li>Data and preparation-visualisation/cleaning </li>
<li>Build predictive model: Naive bayes, Logistic Regression, Desicion Tree, Support Vector Classification </li>
<li>Build Ensemble Model like Bagging and Boosting</li>
<li>Choose the Best Model</li>
<li>Comapre between the best found model and a sequential Neural network model</li>
<li>Deploy that Model</li>
</ol>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Data Visualization

In [None]:
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isna().sum()

# Women-->0
# Men-->1

In [None]:
df.sex.value_counts()

# Death-->1
# Alive-->0

In [None]:
df.sex[df.DEATH_EVENT==1].value_counts()

In [None]:
fig_dims = (15, 7)
fig, ax = plt.subplots(figsize=fig_dims)
df.sex[df.DEATH_EVENT==1].value_counts().plot(kind='bar',figsize=(10,6),color=['orange','blue'])
plt.title("Count of the number of males and females with heart disease", color="green")
plt.xticks(rotation=0);
ax.tick_params(axis='x', colors='red')
ax.tick_params(axis='y', colors='red')

In [None]:
table=pd.crosstab(df.DEATH_EVENT,df.sex)

In [None]:
fig_dims = (15, 7)
fig, ax = plt.subplots(figsize=fig_dims)
table.plot(kind='bar',figsize=fig_dims ,color=['orange','blue'],ax=ax)
plt.title("Frequency of Heart Disease vs Sex",color="green")
plt.xlabel("0= Heart Disease, 1= No disease",color="green")
plt.ylabel("Number of people with heart disease",color="green")
plt.legend(["Female","Male"])
plt.xticks(rotation=0);
ax.tick_params(axis='x', colors='red')
ax.tick_params(axis='y', colors='red')

#### Data Transformation
We will apply Log Transformation to convert the all the contionous data which are all left skewed to normal distribution.

In [None]:
df.describe()

In [None]:
df.hist(figsize=(15,15))

In [None]:
df_agelog=np.log(df["age"])

In [None]:
df_agelog.skew()

In [None]:
df_agelog.hist(figsize=(5,5))

In [None]:
df.skew()

In [None]:
df_cret_phos_log=np.log(df['creatinine_phosphokinase'])
df_cret_phos_log.skew()

In [None]:
df_agelog.hist(figsize=(5,5))

In [None]:
df_plate_log=np.sqrt(df['platelets'])
df_ejec_frac_log=np.log(df['ejection_fraction'])
df_ser_sod_log=np.power(df['serum_sodium'],3)
print('skew of plate_log={}, skew of ejec_frac={} and skew of ser_sod={}'.format(df_plate_log.skew(),df_ejec_frac_log.skew(),df_ser_sod_log.skew()
                           ))                         

In [None]:
df_plate_log.hist()

In [None]:
df_ejec_frac_log.hist()

In [None]:
df_ser_sod_log.hist()

In [None]:
df["age"]=df_agelog
df['creatinine_phosphokinase']=df_cret_phos_log
df['platelets']=df_plate_log
df['ejection_fraction']=df_ejec_frac_log
df['serum_sodium']=df_ser_sod_log

In [None]:
df.head()

### Scaling:
Reduces the weight attached high valued continous data into a predefined scale.Here Min-Max will scale each variable in 0 to 1 with one being the highest value instead of any arbitary value which would have shifted the scale unfairly in one predictor varable's favor.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scal=MinMaxScaler()
features= ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_sodium','time','serum_creatinine']
df[features] = scal.fit_transform(df[features])
df.head()

In [None]:
fig_dims = (20, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(data=df, x="creatinine_phosphokinase", 
                y="age", ax=ax)
ax.tick_params(axis='x', colors='red')
ax.tick_params(axis='y', colors='red')

**Detecting Outliers and Boxplots**

In [None]:
fig_dims = (20, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot(data=df, orient="h", palette="Set2", ax=ax)
ax.tick_params(axis='x', colors='red')
ax.tick_params(axis='y', colors='red')

**Serum_creatinine**

In [None]:
from scipy import stats
z=np.abs(stats.zscore(df.serum_creatinine))# Outlier removal with zscore method.

In [None]:
threshold=3
print(np.where(z>3))

#### We can say this is a late stage of CKD and an opportunity to get the best treatment outcome has unfortunately been lost. When serum creatinine level is 10.0 mg/dl, it means that 90% of kidney function has already been lost and this points to end stage kidney disease (ESKD).Hence Death event is very much related due to this outlier. As we we are are looking for death events we won't remove such outliers in the data.

In [None]:
df.iloc[9]

**Platelets**

In [None]:
z1=np.abs(stats.zscore(df.platelets))

In [None]:
threshold=3
print(np.where(z1>3))

When the platelet count drops below 20,000, the patient may have spontaneous bleeding that may result in death. Thrombocytopenia occurs due to platelet destruction or impaired platelet production. In this case as it is normalized we are seeing it as 0.076 but it si definately near if not below 20,000 platelets. Hence leading to death hence we are not removing any outliers.

In [None]:
df.iloc[15]

# Building a Correlation Matrix

# Creating Features and Target variable

In [None]:
cor_mat=df.corr()
fig,ax=plt.subplots(figsize=(15,10))
sns.heatmap(cor_mat,annot=True,linewidths=0.5,fmt=".3f")

It shows the correlations between various features of the dataset like sex of a patient determines whether he smokes or not or whether the lower time of the patient's follow-up visit for the disease in months leads to death. This is important as it shows no Multicolinearity between variables.

# Creating Features and Target variable

In [None]:
df.DEATH_EVENT.values

In [None]:
x=df.drop("DEATH_EVENT",axis=1).values
y=df.DEATH_EVENT

# Splitting the data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(x,y,random_state=0,test_size=0.2)

In [None]:
from sklearn.metrics import accuracy_score,recall_score,f1_score,precision_score,roc_auc_score,confusion_matrix

def metrics(Y_test,Y_pred):
    acc=accuracy_score(Y_test,Y_pred)
    rec=recall_score(Y_test,Y_pred)
    f1=f1_score(Y_test,Y_pred)
    print("Accuracy= {}".format(acc),
          "\n Recall= {}".format(rec),
         "\n f1 score= {}".format(f1))
    l1=[ acc,rec,f1]
    return l1

# Fitting and Comparing different Models

**Naive bayes**
It is a set of supervised algorithmn that applies the bayes theorem with the assumption that independence between every pair of feature.

In [None]:
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X_train,Y_train)

In [None]:
Naive_bayes_preds = clf.predict(X_test)
NBayes=metrics(Y_test, Naive_bayes_preds)

In [None]:
# confusion matrix 
cm_NBayes = confusion_matrix(Y_test, Naive_bayes_preds)  
print ("Confusion Matrix : \n", cm_NBayes)  
TN=cm_NBayes[0,0]# True is of prediction and Negative is of test
FP=cm_NBayes[0,1]# False is of prediction and Positive is of test
FN=cm_NBayes[1,0]# True is of prediction and Negative is of test
TP=cm_NBayes[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test, Naive_bayes_preds)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.

precision_NBayes=TP/(TP+FP)
print("precision=", precision_NBayes)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_NBayes = TN/(TN+FP)
print("Specificity=", Specificity_NBayes)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_NBayes=NBayes[0]
accuracy_NBayes
recall_NBayes=NBayes[1]
recall_NBayes
f1score_NBayes=NBayes[2]
f1score_NBayes
print("acc= {}, rec= {}, f1score ={}".format(accuracy_NBayes,recall_NBayes,f1score_NBayes))

In [None]:
plt.figure(figsize=(5,5))

sns.heatmap(data=cm_NBayes,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new_N=round(NBayes[0],3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new_N)
plt.title(all_sample_title, size = 15)

### Logistic Regresion

Logistic regression is used for classification where the response variable is categorical not numerical.

In [None]:
import statsmodels.api as sm 

In [None]:
log_clas = sm.Logit(Y_train,X_train).fit() 

In [None]:
log_clas.summary()

We can see from the coefficiants that age and serum_creatinine are significant and highly impacts death rate . Conversely time taken or the follow up period is highly inversely related and highly significant. Y variable is 42.47 correctly explained by the predictor variables given by Pseudo R-square.

In [None]:
logit_Y_pred = log_clas.predict(X_test) 
prediction = list(map(round, logit_Y_pred)) 
  
# comparing original and predicted values of y 
print('Acutal values', list(Y_test.values)) 
print('Predictions :', prediction) 
logclas=metrics(Y_test, prediction)

In [None]:
from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm_logit = confusion_matrix(Y_test, prediction)  
print ("Confusion Matrix : \n", cm_logit)  
TN=cm_logit[0,0]# True is of prediction and Negative is of test
FP=cm_logit[0,1]# False is of prediction and Positive is of test
FN=cm_logit[1,0]# True is of prediction and Negative is of test
TP=cm_logit[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test, prediction)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.
precision_log_clas=TP/(TP+FP)
print("precision=", precision_log_clas)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_log_clas = TN/(TN+FP)
print("Specificity=", Specificity_log_clas)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_logclas=logclas[0]
accuracy_logclas
recall_logclas=logclas[1]
recall_logclas
f1score_logclas=logclas[2]
f1score_logclas
print("acc= {}, rec= {}, f1score ={}".format(accuracy_logclas,recall_logclas,f1score_logclas))

In [None]:
plt.figure(figsize=(5,5))

sns.heatmap(data=cm_logit,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new_l=round(logclas[0],3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new_l)
plt.title(all_sample_title, size = 15)

# Decison Tree

Binary branching structure to classify an arbitary input X. Each node in a tree contains asimple feature comparison against some field.Result is either true or false which determines which direction to proceed. Also known as CART.Here it is for classification.

In [None]:
# Defining the decision tree algorithm
from sklearn.tree import DecisionTreeClassifier#for checking testing results
from sklearn.metrics import classification_report, confusion_matrix#for visualizing tree 
from sklearn.tree import plot_tree
dtree=DecisionTreeClassifier()
dtree.fit(X_train,Y_train)

print('Decision Tree Classifer Created')

In [None]:
# Predicting the values of test data
y_pred = dtree.predict(X_test)
print("Classification report - \n", classification_report(Y_test,y_pred))

In [None]:
Dectree=metrics(Y_test, y_pred)

In [None]:
# confusion matrix 
cm_dectree = confusion_matrix(Y_test, y_pred)  
print ("Confusion Matrix : \n", cm_dectree)  
TN=cm_dectree[0,0]# True is of prediction and Negative is of test
FP=cm_dectree[0,1]# False is of prediction and Positive is of test
FN=cm_dectree[1,0]# True is of prediction and Negative is of test
TP=cm_dectree[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test, y_pred)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.
precision_dectree=TP/(TP+FP)
print("precision=", precision_dectree)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_dectree = TN/(TN+FP)
print("Specificity=", Specificity_dectree)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_Dectree=Dectree[0]
accuracy_Dectree
recall_Dectree=Dectree[1]
recall_Dectree
f1score_Dectree=Dectree[2]
f1score_Dectree
print("acc= {}, rec= {}, f1score ={}".format(accuracy_Dectree,recall_Dectree,f1score_Dectree))

In [None]:
cm = confusion_matrix(Y_test,y_pred)
plt.figure(figsize=(5,5))

sns.heatmap(data=cm,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new=round(dtree.score(X_test, Y_test),3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new)
plt.title(all_sample_title, size = 15)


In [None]:
# Visualising the graph without the use of graphviz

plt.figure(figsize = (100,100))
dec_tree = plot_tree(decision_tree=dtree, feature_names = df.columns, 
                     class_names =df.columns.values , filled = True , precision = 4, rounded = True)


# **Support Vector Classification**

Works by constructing a hyperplane that seperates points between two classes.The observed training observations are seperate into parts by a maximal distance from the hyperplane. The maximal distance is the margin.

In [None]:
np.random.seed(42)
from sklearn.svm import SVC
SVC_clf=SVC()
SVC_clf.fit(X_train,Y_train)
SVC_score=SVC_clf.score(X_test,Y_test)
SVC_Y_pred=SVC_clf.predict(X_test)
#print(SVC_score)
SVC=metrics(Y_test,SVC_Y_pred)

In [None]:
# confusion matrix 
cm_SVC = confusion_matrix(Y_test,SVC_Y_pred)  
print ("Confusion Matrix : \n", cm_SVC)  
TN=cm_SVC[0,0]# True is of prediction and Negative is of test
FP=cm_SVC[0,1]# False is of prediction and Positive is of test
FN=cm_SVC[1,0]# True is of prediction and Negative is of test
TP=cm_SVC[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test,SVC_Y_pred)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.
precision_SVC=TP/(TP+FP)
print("precision=", precision_SVC)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_SVC = TN/(TN+FP)
print("Specificity=", Specificity_SVC)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_SVC=SVC[0]
accuracy_SVC
recall_SVC=SVC[1]
recall_SVC
f1score_SVC=SVC[2]
f1score_SVC
print("acc= {}, rec= {}, f1score ={}".format(accuracy_SVC,recall_SVC,f1score_SVC))

In [None]:
plt.figure(figsize=(5,5))

sns.heatmap(data=cm_SVC,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new_S=round(SVC[0],3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new_S)
plt.title(all_sample_title, size = 15)

# Ensemble Techniques

### Random Forest Classification
It uses bagging technique with random feature selection to add additional diversity to the decision tree model.It ensemble a group of decision trees one by one by voting among them.

In [None]:
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
RF_clf=RandomForestClassifier(n_estimators=450)
RF_clf.fit(X_train,Y_train)
RF_score=RF_clf.score(X_test,Y_test)
RF_Y_pred=RF_clf.predict(X_test)
#print(RF_score)
RandFor=metrics(Y_test,RF_Y_pred)

In [None]:
# confusion matrix 
cm_RandFor = confusion_matrix(Y_test,RF_Y_pred)  
print ("Confusion Matrix : \n", cm_RandFor)  
TN=cm_RandFor[0,0]# True is of prediction and Negative is of test
FP=cm_RandFor[0,1]# False is of prediction and Positive is of test
FN=cm_RandFor[1,0]# True is of prediction and Negative is of test
TP=cm_RandFor[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test,RF_Y_pred)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.
precision_Randfor=TP/(TP+FP)
print("precision=", precision_Randfor)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_Randfor = TN/(TN+FP)
print("Specificity=", Specificity_Randfor)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_RandFor=RandFor[0]
accuracy_RandFor
recall_RandFor=RandFor[1]
recall_RandFor
f1score_RandFor=RandFor[2]
f1score_RandFor
print("acc= {}, rec= {}, f1score ={}".format(accuracy_RandFor,recall_RandFor,f1score_RandFor))

In [None]:
plt.figure(figsize=(5,5))

sns.heatmap(data=cm_RandFor,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new_R=round(RandFor[0],3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new_R)
plt.title(all_sample_title, size = 15)

### **XGBoost**

XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

In [None]:
from xgboost import XGBClassifier
XGB_clf=XGBClassifier()
XGB_clf.fit(X_train,Y_train)
XGB_score=XGB_clf.score(X_test,Y_test)
XGB_Y_pred=XGB_clf.predict(X_test)
eGradBoost=metrics(Y_test,XGB_Y_pred)

In [None]:
# confusion matrix 
cm_eGradBoost = confusion_matrix(Y_test,XGB_Y_pred)  
print ("Confusion Matrix : \n", cm_eGradBoost)  
TN=cm_eGradBoost[0,0]# True is of prediction and Negative is of test
FP=cm_eGradBoost[0,1]# False is of prediction and Positive is of test
FN=cm_eGradBoost[1,0]# True is of prediction and Negative is of test
TP=cm_eGradBoost[1,1]# False is of prediction and Positive is of test
print("True Positive cases= {} True Negative cases={} False Positive cases={} False Negative cases= {}".format(TP,TN,FP,FN))
# accuracy score of the model 
#print('Test accuracy = ', accuracy_score(Y_test, prediction))
metrics(Y_test,XGB_Y_pred)#Recall It answers the question how many are at the risk of dying and how many is correctly predicted.
#F1-score is best when there is uneven class distribution or unsymmetric dataset.
precision_eGradBoost=TP/(TP+FP)
print("precision=", precision_eGradBoost)#How many of those who we labeled as dead are actually died due to heart disease?
Specificity_eGradBoost = TN/(TN+FP)
print("Specificity=", Specificity_eGradBoost)#Of all the people who are healthy, how many of those did we correctly predict?

In [None]:
accuracy_eGradBoost=eGradBoost[0]
accuracy_eGradBoost
recall_eGradBoost=eGradBoost[1]
recall_eGradBoost
f1score_eGradBoost=eGradBoost[2]
f1score_eGradBoost
print("acc= {}, rec= {}, f1score ={}".format(accuracy_eGradBoost,recall_eGradBoost,f1score_eGradBoost))

In [None]:
plt.figure(figsize=(5,5))

sns.heatmap(data=cm_eGradBoost,linewidths=.5, annot=True,square = True,  cmap = 'OrRd')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')
sc_new_R=round(eGradBoost[0],3)
all_sample_title = 'Accuracy Score: {0}'.format(sc_new_R)
plt.title(all_sample_title, size = 15)

# Comparisons¶

In [None]:
model_comp = pd.DataFrame({'Model': ['Logistic Regression','Random Forest',
                    'Naive Bayes','Support Vector Machine',"Decison Tree","Extreme Gradient Classifier"], 'Accuracy': [accuracy_logclas*100,
                    accuracy_RandFor*100,accuracy_NBayes*100,accuracy_SVC*100,accuracy_Dectree*100,accuracy_eGradBoost*100], 'Precision': [precision_log_clas*100,
                    precision_Randfor*100,precision_NBayes*100,precision_SVC*100,precision_dectree*100,precision_eGradBoost*100], 'Recall': [recall_logclas*100,
                    recall_RandFor*100,recall_NBayes*100,recall_SVC*100,recall_Dectree*100,recall_eGradBoost*100], 'Specficity': [Specificity_log_clas*100,
                    Specificity_Randfor*100,Specificity_NBayes*100,Specificity_SVC*100,Specificity_dectree*100,Specificity_eGradBoost*100],'F1-score': [f1score_logclas*100,
                    f1score_RandFor*100,f1score_NBayes*100,f1score_SVC*100,f1score_Dectree*100,f1score_eGradBoost*100]})
model_comp

# Looking at the evaluation metrics for our best model
As we can see, the Random Forest Classifier gives us an accuracy of 85%.

Let us evaluate the model now.

In [None]:
print(" Best evaluation parameters achieved with Random Forest:") 
metrics(Y_test,RF_Y_pred)

In [None]:
final_metrics={'Accuracy': RF_clf.score(X_test,Y_test),
                   'Precision': precision_score(Y_test,RF_Y_pred),
                   'Recall': recall_score(Y_test,RF_Y_pred),
                   'F1': f1_score(Y_test,RF_Y_pred),
                   'AUC': roc_auc_score(Y_test,RF_Y_pred)}

metrics=pd.DataFrame(final_metrics,index=[0])

metrics.T.plot.bar(title='Final metric evaluation',legend=False)

* We can say that from AUC that there is a almost 80% chance of the model to correct predict the death and the healtiness of apatient.
* Precision says 90%+ of those who we labeled as dead are actually died due to heart disease.
* Recall answers the question 60%+ are at the risk of dying and they are correctly predicted.
* Accuarcy gives the how accurate our prediction where to the actual values.
* F1-score is best when there is uneven class distribution or unsymmetric dataset.

In [None]:
from sklearn.metrics import roc_curve

# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(Y_test, RF_Y_pred, pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(Y_test, prediction,pos_label=1)
fpr3, tpr3, thresh3 = roc_curve(Y_test, Naive_bayes_preds, pos_label=1)
fpr4, tpr4, thresh4 = roc_curve(Y_test, SVC_Y_pred, pos_label=1)
fpr5, tpr5, thresh5 = roc_curve(Y_test, y_pred, pos_label=1)
fpr6, tpr6, thresh6 = roc_curve(Y_test, XGB_Y_pred, pos_label=1)
random_probs = [0 for i in range(len(Y_test))]
p_fpr, p_tpr, _ = roc_curve(Y_test, random_probs, pos_label=1)

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# plot roc curves
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Random Forest')
plt.plot(fpr2, tpr2, linestyle='--',color='grey', label='Logistic Classification')
plt.plot(fpr3, tpr3, linestyle='--',color='grey', label='Naive Bayes Classification')
plt.plot(fpr4, tpr4, linestyle='--',color='grey', label='Support Vector Classification')
plt.plot(fpr5, tpr5, linestyle='--',color='grey', label='Desicion Tree')
plt.plot(fpr6, tpr6, linestyle='--',color='brown', label='Extreme Gradient Classification')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')

plt.legend(loc='best')
plt.savefig('ROC',dpi=300)
plt.show();

# Bagging vs Boosting
In this case  we can see as false prediction of death reamins low Random Forrest which is a Bagging of decision trees does better. While if there is a chance of falsely predicting death of a patient due to Cardiovascular problems Boosting algorithmn does better. It can be due various reasons but as we don't falsely predict a survival of a patient we should go with the Random Forest Classification model.

# User Input

In [None]:
user_input=input("Enter the values one by one")
user_input=user_input.split(",")


for i in range(len(user_input)):
    # convert each item to int type
    user_input[i] = float(user_input[i])

user_input=np.array(user_input)
user_input=user_input.reshape(1,-1)
user_input=scal.transform(user_input)
scv_Y_pred=scv.predict(user_input)
if(scv_Y_pred[0]==0):
  print("Warning! You have chances of getting a heart disease!")
else:
  print("You are healthy and are less likely to get a heart disease!")


In [None]:
import pickle as pkl
pkl.dump(Knn_clf,open("final_model.p","wb"))

In [None]:
import sklearn
sklearn_version = sklearn.__version__
print(sklearn_version)

In [None]:
!pip install streamlit
!pip install pyngrok===4.1.1
from pyngrok import ngrok

In [None]:
%%writefile healthy-heart-app.py
import streamlit as st
import base64
import sklearn
import numpy as np
import pickle as pkl
from sklearn.preprocessing import MinMaxScaler
scal=MinMaxScaler()
#Load the saved model
model=pkl.load(open("final_model.p","rb"))





st.set_page_config(page_title="Healthy Heart App",page_icon="⚕️",layout="centered",initial_sidebar_state="expanded")



def preprocess(age,sex,cp,trestbps,restecg,chol,fbs,thalach,exang,oldpeak,slope,ca,thal ):   
 
    
    # Pre-processing user input   
    if sex=="male":
        sex=1 
    else: sex=0
    
    
    if cp=="Typical angina":
        cp=0
    elif cp=="Atypical angina":
        cp=1
    elif cp=="Non-anginal pain":
        cp=2
    elif cp=="Asymptomatic":
        cp=2
    
    if exang=="Yes":
        exang=1
    elif exang=="No":
        exang=0
 
    if fbs=="Yes":
        fbs=1
    elif fbs=="No":
        fbs=0
 
    if slope=="Upsloping: better heart rate with excercise(uncommon)":
        slope=0
    elif slope=="Flatsloping: minimal change(typical healthy heart)":
          slope=1
    elif slope=="Downsloping: signs of unhealthy heart":
        slope=2  
 
    if thal=="fixed defect: used to be defect but ok now":
        thal=6
    elif thal=="reversable defect: no proper blood movement when excercising":
        thal=7
    elif thal=="normal":
        thal=2.31

    if restecg=="Nothing to note":
        restecg=0
    elif restecg=="ST-T Wave abnormality":
        restecg=1
    elif restecg=="Possible or definite left ventricular hypertrophy":
        restecg=2


    user_input=[age,sex,cp,trestbps,restecg,chol,fbs,thalach,exang,oldpeak,slope,ca,thal]
    user_input=np.array(user_input)
    user_input=user_input.reshape(1,-1)
    user_input=scal.fit_transform(user_input)
    prediction = model.predict(user_input)

    return prediction

    

       
    # front end elements of the web page 
html_temp = """ 
    <div style ="background-color:pink;padding:13px"> 
    <h1 style ="color:black;text-align:center;">Healthy Heart App</h1> 
    </div> 
    """
      
# display the front end aspect
st.markdown(html_temp, unsafe_allow_html = True) 
st.subheader('by Amlan Mohanty ')
      
# following lines create boxes in which user can enter data required to make prediction
age=st.selectbox ("Age",range(1,121,1))
sex = st.radio("Select Gender: ", ('male', 'female'))
cp = st.selectbox('Chest Pain Type',("Typical angina","Atypical angina","Non-anginal pain","Asymptomatic")) 
trestbps=st.selectbox('Resting Blood Sugar',range(1,500,1))
restecg=st.selectbox('Resting Electrocardiographic Results',("Nothing to note","ST-T Wave abnormality","Possible or definite left ventricular hypertrophy"))
chol=st.selectbox('Serum Cholestoral in mg/dl',range(1,1000,1))
fbs=st.radio("Fasting Blood Sugar higher than 120 mg/dl", ['Yes','No'])
thalach=st.selectbox('Maximum Heart Rate Achieved',range(1,300,1))
exang=st.selectbox('Exercise Induced Angina',["Yes","No"])
oldpeak=st.number_input('Oldpeak')
slope = st.selectbox('Heart Rate Slope',("Upsloping: better heart rate with excercise(uncommon)","Flatsloping: minimal change(typical healthy heart)","Downsloping: signs of unhealthy heart"))
ca=st.selectbox('Number of Major Vessels Colored by Flourosopy',range(0,5,1))
thal=st.selectbox('Thalium Stress Result',range(1,8,1))



#user_input=preprocess(sex,cp,exang, fbs, slope, thal )
pred=preprocess(age,sex,cp,trestbps,restecg,chol,fbs,thalach,exang,oldpeak,slope,ca,thal)




if st.button("Predict"):    
  if pred[0] == 0:
    st.error('Warning! You have high risk of getting a heart attack!')
    
  else:
    st.success('You have lower risk of getting a heart disease!')
    
   



st.sidebar.subheader("About App")

st.sidebar.info("This web app is helps you to find out whether you are at a risk of developing a heart disease.")
st.sidebar.info("Enter the required fields and click on the 'Predict' button to check whether you have a healthy heart")
st.sidebar.info("Don't forget to rate this app")



feedback = st.sidebar.slider('How much would you rate this app?',min_value=0,max_value=5,step=1)

if feedback:
  st.header("Thank you for rating the app!")
  st.info("Caution: This is just a prediction and not doctoral advice. Kindly see a doctor if you feel the symptoms persist.") 


     













# Using NN to build classification model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout ,Flatten
from tensorflow.keras.layers.experimental.preprocessing import Normalization

In [None]:
normalize = Normalization()

In [None]:
X_train.shape

In [None]:
model = Sequential([
    Dense(12, activation=tf.nn.relu,input_shape=(239,12)),
    Dropout(0.5),
    Dense(1, activation=tf.nn.sigmoid),
])

model.compile(loss='binary_crossentropy', optimizer='rmsprop',metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', patience=10,restore_best_weights=True)

In [None]:
model.fit(x=X_train, 
          y=Y_train, 
          epochs=5,
          batch_size=10,
          validation_data=(X_test, Y_test),
          callbacks=[early_stop]
          )

In [None]:
history_dict=model.history.history
history_dict

In [None]:
model.evaluate(X_test, Y_test)

In [None]:
pred = y_pred=model.predict_classes(np.expand_dims(X_test[4], axis=0))
print("prediction= {} and verses real={}".format(pred,X_test[4]))