# Heart Attack Prediction with Classifier Algorithms

Heart attack is a serious disease. It can be caused death. Some features (age,anaemia,creatinine_phosphokinase,diabetes,high_blood_pressure etc.) can trigger heary attack. We can see, whether a person has a heart attack or doesn't have, looking this features. In this notebook I developed a model that predicted the number of death because of hearth attack. I used to various metrics in order to can find true model.

## CONTENT

[1.Exploratory Data Analysis](#1) <br/>
[2.Train And Test Split](#2) <br/>
[3.Create Model](#3) <br/>
    [3.1.Logistic Regression](#3.1) <br/>
    [3.2.K Nearest Neighbors](#3.2) <br/>
    [3.3.Support Vector Machine](#3.3) <br/>
    [3.4.Native Bayes](#3.4) <br/>
    [3.5.Decision Tree Classifier](#3.5) <br/>
    [3.6.Random Forest Classifier](#3.6) <br/>
    [3.7.Gradient Boosting Classifier](#3.7) <br/>
    [3.8.XG Boosting Classifier](#3.8) <br/>
[4.Evaluation Models](#4) <br/>
[5.Conclusion](#5) <br/>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model Selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix

# Model Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="1"></a>
## Exploratory Data Analysis

In [None]:
data = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
data.head()

**Before the model doesn't set up, We let look features.**

In this dataset:

* It contains 299 rows (patient information).
* It contains 13 columns (features about heart attack).
* 10 features are integer type.
* 3 features are float type.
* It doesn't have missing value.

### **FEATURES**

**age:** age of patient <br/>
**anaemia:** Decrease of red blood cells or hemoglobin <br/>
**creatinine_phosphokinase:** Level of the CPK enzyme in the blood (mcg/L) <br/>
**diabetes:** If the patient has diabetes <br/>
**ejection_fraction:** Percentage of blood leaving the heart at each contraction (percentage) <br/>
**high_blood_pressure:** If the patient has hypertension 
**platelets:** Platelets in the blood                
**serum_creatinine:** Level of serum creatinine in the blood (mg/dL)         
**serum_sodium:** Level of serum sodium in the blood (mEq/L)              
**sex:** Woman or man (binary)                       
**smoking:** If the patient smokes or not                   
**time:** Follow-up period (days)                     
**DEATH_EVENT:** If the patient deceased during the follow-up period 

In [None]:
data.info()

In [None]:
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data.corr(),annot=True, linewidths=.5, ax=ax)
plt.show()

In above correlation matrix, we see features relationship each other. This relationships can be useful to set up model. If the relationship how is close and is strong, it can be impact to use them in order to set up true model. In this dataset, we will look relationship with death_evet other features. If relationship between them is big from 0.1, This features can be important features,which heart attack triggers. While my model set up, I will use features,which correlation coffience is big from 0.1.   

In [None]:
cor = data.corr() 
corr_target = abs(cor["DEATH_EVENT"])
relevant_features = corr_target[corr_target>0.1]
relevant_features

<a id="2"></a>
## Train and Test Split

I splitted as 20% test dataset and 80% train dataset.

In [None]:
accuracy_list = []
algorithm = []
predict_list = []

X_train,X_test,Y_train,Y_test = train_test_split(data.loc[:,{"age","ejection_fraction","serum_creatinine","serum_sodium","time"}]
                                                 ,data["DEATH_EVENT"],test_size=0.2)
print("X_train shape :",X_train.shape)
print("Y_train shape :",Y_train.shape)
print("X_test shape :",X_test.shape)
print("Y_test shape :",Y_test.shape)

<a id="3"></a>
## Create Model

In this section, accuracy values and the number of patient, which predicted, of the models,which I set up,is seen. Accuracy value is not enough in order to set up actual model. Sometimes, low accuracy value models can predict more actual result than high accuracy value. You see all models comparisons in below. The algorithm I have used:

* Logistic Regression
* K Nearest Neighbors 
* Support Vector Machine
* Native Bayes 
* Decision Tree Classifier
* Random Forest Classifier
* Gradient Boosting Classifier
* XG Boosting Classifier

<a id="3.1"></a>
### Logistic Regression

In [None]:
reg = LogisticRegression(max_iter=1000)
reg.fit(X_train,Y_train)
accuracy_list.append(reg.score(X_test,Y_test))
algorithm.append("Logistic Regression")
print("test accuracy ",reg.score(X_test,Y_test))

cm = confusion_matrix(Y_test,reg.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.2"></a>
### K Nearest Neighbors

In [None]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': np.arange(1, 25)}
knn_gscv = GridSearchCV(knn, param_grid, cv=4)
knn_gscv.fit(X_train, Y_train)
print("Best K Value is ",knn_gscv.best_params_)

accuracy_list.append(knn_gscv.score(X_test,Y_test))
print("test accuracy ",knn_gscv.score(X_test,Y_test))
algorithm.append("K Nearest Neighbors Classifier")

cm = confusion_matrix(Y_test,knn_gscv.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.3"></a>
### Support Vector Machine

In [None]:
svm = SVC()
svm.fit(X_train,Y_train)
print("test accuracy: ",svm.score(X_test,Y_test))
accuracy_list.append(svm.score(X_test,Y_test))
algorithm.append("Support Vector Machine")

cm = confusion_matrix(Y_test,svm.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.4"></a>
### Native Bayes

In [None]:
nb = GaussianNB()
nb.fit(X_train,Y_train)
print("test accuracy: ",nb.score(X_test,Y_test))
accuracy_list.append(nb.score(X_test,Y_test))
algorithm.append("Native Bayes Classifier")

cm = confusion_matrix(Y_test,nb.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.5"></a>
### Decision Tree

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train,Y_train)
print("test accuracy: ",dt.score(X_test,Y_test))
accuracy_list.append(dt.score(X_test,Y_test))
algorithm.append("Decision Tree Classifier")

cm = confusion_matrix(Y_test,dt.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.6"></a>
### Random Forest

In [None]:
param_grid = {'n_estimators': np.arange(10, 100, 10)}
rf = RandomForestClassifier(random_state = 42)
rf_gscv = GridSearchCV(rf, param_grid, cv=4)
rf_gscv.fit(X_train, Y_train)
print("Best K Value is ",rf_gscv.best_params_)

print("test accuracy: ",rf_gscv.score(X_test,Y_test))
accuracy_list.append(rf_gscv.score(X_test,Y_test))
algorithm.append("Random Forest Classifier")

cm = confusion_matrix(Y_test,rf_gscv.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.7"></a>
### Gradient Boosting

In [None]:
param_grid = {'n_estimators': [10,20,50],'learning_rate': [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1],'max_features': [2],'max_depth': [2]}
gb = GradientBoostingClassifier()
gb_gscv = GridSearchCV(gb, param_grid, cv=4)
gb_gscv.fit(X_train,Y_train)
print("The best parameters are ",gb_gscv.best_params_)
print("------------------------------------------------------")
print("test accuracy is ",gb_gscv.score(X_test,Y_test))
accuracy_list.append(gb_gscv.score(X_test,Y_test))
algorithm.append("Gradient Boosting Classifier")

cm = confusion_matrix(Y_test,gb_gscv.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="3.8"></a>
### XGBoosting

In [None]:
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, Y_train)
print("test accuracy is ",xgb_clf.score(X_test,Y_test))
accuracy_list.append(xgb_clf.score(X_test,Y_test))
algorithm.append("XGBClassifier")

cm = confusion_matrix(Y_test,xgb_clf.predict(X_test))
predict_list.append(cm.item(0)+cm.item(2))
sns.heatmap(cm,annot=True, linewidths=.5)
plt.show()

<a id="4"></a>
## Evaluation Models

In [None]:
#Classifier Accuracy
f,ax = plt.subplots(figsize = (15,7))
sns.barplot(x=accuracy_list,y=algorithm,palette = sns.cubehelix_palette(len(accuracy_list)))
plt.xlabel("Accuracy")
plt.ylabel("Classifier")
plt.title('Classifier Accuracy')
plt.show()

In [None]:
#Classifier Predict Death Event Count
f,ax = plt.subplots(figsize = (15,7))
sns.barplot(x=predict_list,y=algorithm,palette = sns.cubehelix_palette(len(accuracy_list)))
plt.xlabel("Predict Death Event Count")
plt.ylabel("Classifier")
plt.title('Classifier Predict Death Event Count')
plt.show()

<a id="5"></a>
## Conclusion

* Correlation coffience is effective in classification problems.
* The models give actual result to use features that have high correlation coffience.
* It isn't enough to look accuracy value in order to choose model. It needs to apply other metrics. Especially, the number of class, which predicted true, is the important metric. Sometimes, although lower accuracy value can give more accurate predictions. 
* For example Support Vector Machine, even though it has lower accuracy value, it gave more accurate prediction according to other algorithms.
* If my notebook is benefit for you, please don't forget to upvote. :))