# Introduction to Heart Attack Prediction 
### **Arpit Verma**
#### 15-Jun-2021

* **1. Introduction**
* **2. Data preparation**
    * 2.1 Importing libraries
    * 2.2 Loading data
    * 2.3 Understanding data
    * 2.4 Outlier treatment
    * 2.5 Distribution of target variable 
    * 2.6 Correlation of predictors with target variable
    * 2.7 Converting categorical variables into dummies   
    * 2.8 Collinearity check 
* **3. Train test split**
* **4. Logistic regression model**
* **5. Decision tree classifier model**
* **6. Random forest classifier model**
* **7. Gradient boosting classifier model**
* **8. Conclusion**

# 1. Introduction

In this notebook, we will develop a model to predict heart attack, using the dataset: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

We will develop the following models:
* Logistic Regression
* Decision Tree
* Random Forest
* Gradient Boosting

Based on the performance of these models on training and test dataset, we will select our final model.


# 2. Data preparation
## 2.1 Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, accuracy_score, roc_curve, plot_confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

import warnings
warnings.filterwarnings("ignore")

## 2.2 Loading data

In [None]:
full_data = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

## 2.3 Understanding data

In [None]:
print(full_data.head(3))
print("\nNumber of rows: %d \nNumber of columns: %d" % (full_data.shape[0], full_data.shape[1]))
print("Column names: ", list(full_data.columns))

Our dataset has 303 records, where each record represents a person. Target variable is *output* which has a value of 1 if the person had a heart attack.

Let's have a look at our 13 predictor variables:
* age - Age of the person
* sex - Gender of the person. This column has values 0 and 1. Let's assume that 1: Male and 0: Female
* cp - Chest Pain type, 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic
* trtbps - resting blood pressure (in mm Hg). High blood pressure is more likely to cause heart attack
* chol - cholestoral in mg/dl fetched via BMI sensor\
* fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy
* thalachh - Maximum heart rate achieved
* oldpeak - Previous peak
* slp - Slope
* caa - Number of major vessels
* thall - Thalium Stress Test result ~ (0,3)
* exng - Exercise induced angina ~ 1 = Yes, 0 = No
* oldpeak - Previous peak
* slp - Slope
* caa - number of major vessels (0-3)
* thall - Thal rate

In [None]:
full_data.describe().T.style.bar(subset=['mean'], color='#208ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')

We can observe that all the variables have a count of 303, hence, no missing data. Following variables have a high standard deviation: *age, trtbps, chol,thalachh.*

## 2.4 Outlier treatment

We need to check if there are outliers in the continuous variables and treat the outliers, if any. Following are the continuous variables in the data:
* age
* trtbps
* chol
* thalachh
* oldpeak

We will start by creating some functions for outlier check and treatment. A good treament for outliers is to apply floor on minimum value (set to 1st percentile) and cap on maximum value (set to 99th percentile).

In [None]:
# Creating function for outlier check
def outlier_check(var):
    sns.displot(data = full_data, x = var)
    print('Variable: ', var) 
    print('Minimum value: ', full_data[var].min())
    print('1st percentile: ', full_data[var].quantile(0.01))
    print('99th percentile: ', full_data[var].quantile(0.99))
    print('Maximum value: ', full_data[var].max())

# Outlier treatment, sets minimum value to 1st percentile or maximum value to 99th percentile.
def outlier_treatment(var, min_treat, max_treat):
    min = full_data[var].quantile(0.01)
    max = full_data[var].quantile(0.99)
    if min_treat == 1:
        full_data.loc[full_data[var] <= min, var] =  min
    if max_treat == 1:
        full_data.loc[full_data[var] >= max, var] =  max


The *outlier_check* function will perform the following steps:
* Plot a distribution of values of the variable
* Print the minimum and 1st percentile value of the variable
* Print the maximum and 99th percentile value of the variable

If the distribution shows presence of outliers or if there seems to be a significant difference between the minimum (maximum) value and 1st (99th) percentile, then we will apply floor (cap) using our second function *outlier_treat*.

**Outlier treatment for column: *age***

In [None]:
outlier_check('age')
outlier_treatment('age', min_treat = 1, max_treat = 0)

Distribution of column *age* shows that there are outliers on the left side of the distribution. Outliers don't seem to be present on the right side, hence, we will only apply outlier treatment to floor minimum value to 1st percentile.

**Outlier treatment for column: *trtbps***

In [None]:
outlier_check('trtbps')
outlier_treatment('trtbps', min_treat = 1, max_treat = 1)

Distribution of column *trtbps* shows outliers on both left as well as right side. Hence, we will apply floor and cap to this variable.

**Outlier treatment for column: *chol***


In [None]:
outlier_check('chol')
outlier_treatment('chol', min_treat = 0, max_treat = 1)

Distribution of column *chol* shows that there are outliers on the right side of the distribution. Outliers don't seem to be present on the left side, hence, we will only apply outlier treatment to cap maximum value to 99th percentile.

**Outlier treatment for column: *thalachh***


In [None]:
outlier_check('thalachh')
outlier_treatment('thalachh', min_treat = 1, max_treat = 0)

Distribution of column *thalachh* shows that there are outliers on the left side of the distribution. Outliers don't seem to be present on the right side, hence, we will only apply outlier treatment to floor minimum value to 1st percentile.

**Outlier treatment for column: *oldpeak***


In [None]:
outlier_check('oldpeak')
outlier_treatment('oldpeak', min_treat = 0, max_treat = 1)

Distribution of column oldpeak shows that there are outliers on the right side of the distribution. Outliers don't seem to be present on the left side, hence, we will only apply outlier treatment to cap maximum value to 99th percentile.

## 2.5 Distribution of target variable

In [None]:
sns.countplot(full_data['output'])
print(pd.concat( [full_data['output'].value_counts(),
                  full_data['output'].value_counts(normalize=True).mul(100).round(2)],
                 axis = 1,
                 keys = ('Count', 'Percentage')))

Our target variable seems to be evenly distributed among 0s and 1s. This is good and will allow the model to predict the values correctly.

## 2.6 Correlation of predictors with target variable

In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(full_data.corr()[['output']].sort_values(by='output', ascending=False), 
                      vmin=-1, 
                      vmax=1, 
                      annot=True, 
                      cmap=sns.diverging_palette(5, 5, as_cmap=True))
heatmap.set_title('Features Correlating with Target', fontdict={'fontsize':18}, pad=16);

Correlation Plot shows the correlation of each variable with output. A value greater than 0.2 (20%) or less than -0.2 (-20%) is a good correlation. We can observe that following variables do not have a good correlation and can be removed from further steps:
* restecg = 0.14
* fbs = -0.028
* chol = -0.1
* trtbps = -0.14

Let's remove these variables now.

In [None]:
low_corr_vars = ['restecg', 'fbs', 'chol', 'trtbps']
full_data.drop(low_corr_vars, axis = 1, inplace = True)
full_data.head()

## 2.7 Converting categorical variables into dummies

We have 5 categorical variables in our dataset:
* sex - Gender of the person; Values 0, 1
* cp - Chest pain type; Values - 0,1,2,3
* exng - Exercise induced angina; Values - 0, 1
* caa - Number of major vessels; Values - 0, 1, 2, 3, 4
* thall - Thalium Stress Test result ~ (0,3)

We need dummy columns for variables which have more than 2 categories. Only 3 variable have more than 2 categories: cp, caa, thall. Let's convert these variables into dummies.

In [None]:
full_data =pd.get_dummies(full_data, columns = ['cp', 'caa', 'thall'], drop_first = True)
full_data.head()

Our dataset is ready. Before starting the modeling part, we need to make sure that our predictor variables do not have a high correlation with each other. Let's check that now.

## 2.8 Collinearity check

In [None]:
corr_data = full_data.drop(['output'], axis = 1)
corrmat = corr_data.corr()
f, ax = plt.subplots(figsize=(14, 14))
sns.heatmap(corrmat, 
            vmin = -1, 
            vmax= 1, 
            square = True, 
            annot = True,
            cmap=sns.diverging_palette(5, 5, as_cmap=True));

Correlation among variables can be a problem, it can reduce the prediction power of the model as it will become very sensitive to changes in the value of any of the correlated variables. 
Imagine you fit a regression model and the coefficient values, and even the signs, change dramatically depending on the specific variables that you include in the model. It’s a disconcerting feeling when slightly different models lead to very different conclusions. You don’t feel like you know the actual effect of each variable!

A correlation among predictor variables of greater than 0.4 or less than -0.4 is not good. From the heatmap, we can see that *slp* and *oldpeak* have a correlation of  -0.57 (57%). We need to remove one of these variables. We should keep oldpeak because it has a high correlation with our target variable (-0.44, 44%) as compared to slp (0.35, 35%). 

In [None]:
full_data = full_data.drop(['slp'], axis = 1)
full_data.head()

Our dataset is now ready for model development. Let's go ahead and start developing models on this dataset.

# 3. Train test split

In [None]:
# Splitting into predictor and target variables
x_full = full_data.drop('output', axis = 1)
y_full = full_data['output']

# Train test split
x_train, x_test, y_train, y_test = train_test_split(x_full, y_full, test_size = 0.3, random_state = 1)

print('Train data records: %d \nTest data records: %d' % (x_train.shape[0], x_test.shape[0]))

We have split the data into train (70%) and test (30%). Now, we can start developing models on train data and check their performance on both train and test data. Let's start with Logistic Regression model.

# 4. Logistic regression model

In [None]:
logistic_model =LogisticRegression(max_iter = 1000)
logistic_model.fit(x_train, y_train)
train_predictions = logistic_model.predict(x_train)
test_predictions = logistic_model.predict(x_test)
model_performance = pd.DataFrame([[ 'Logistic Regression', 
                                    round(accuracy_score(y_train, train_predictions)*100,2),
                                    round(accuracy_score(y_test, test_predictions)*100,2)]],
                                   columns = ['Model', 'Train_Accuracy', 'Test_Accuracy'])
print(model_performance)

# Plotting the ROC Curve
fpr,tpr,threshols=roc_curve(y_test,logistic_model.predict_proba(x_test)[:,1])
plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label="Logistic Regression")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistric Regression ROC Curve")
plt.show()                                        

Our model shows 86% accuracy with train data and 80% accuracy with test data. Let's try Decision Tree model now.

# 4. Decision tree classifier model

Decision Tree Model accuracy relies on the parameter: max_leaf_nodes. It limit the number of leaf nodes (outputs of a split). The more leaves we allow the model to make, the more we move from the underfitting area to the overfitting area.

In [None]:
# Creating a function to create different models based on values of max_leaf_nodes
def dt_score(max_leaf_nodes, x_train, y_train, x_test, y_test):
    model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(x_train, y_train)
    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)
    ac_score_train = round(accuracy_score(y_train, train_predict)*100,2)
    ac_score_test = round(accuracy_score(y_test, test_predict)*100,2)
    return(ac_score_train, ac_score_test)

dt_score_results = pd.DataFrame(columns = ['Max_leaf_nodes', 'Train Accuracy', 'Test Accuracy'])
for max_leaf_nodes in [2, 3, 4, 5, 10, 20, 30, 40, 50]:
    ac_score_train, ac_score_test = dt_score(max_leaf_nodes, x_train, y_train, x_test, y_test)
    print("Max leaf nodes: %d  \t Train Accuracy:  %.2f \t Test Accuracy %.2f" % (max_leaf_nodes, ac_score_train, ac_score_test))   

As we increase the maximum leaf nodes, our model accuracy on train data increases and that on test data decreases. We need an optimal value of this parameter, which gives us the best model in terms of train as well as test data performance. The best value of maximum leaf nodes is 2.

In [None]:
decision_tree_model = DecisionTreeClassifier(max_leaf_nodes = 2, random_state = 1)
decision_tree_model.fit(x_train,y_train)
train_predictions = decision_tree_model.predict(x_train)
test_predictions = decision_tree_model.predict(x_test)
model_performance = model_performance.append(pd.DataFrame([[ 'Decison Tree Classifier', 
                                    round(accuracy_score(y_train, train_predictions)*100,2),
                                    round(accuracy_score(y_test, test_predictions)*100,2)]],
                                   columns = ['Model', 'Train_Accuracy', 'Test_Accuracy']), ignore_index = True)
print(model_performance)
# Plotting the ROC Curve
fpr,tpr,threshols=roc_curve(y_test,decision_tree_model.predict_proba(x_test)[:,1])
plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label="Decison Tree")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Decision Tree ROC Curve")
plt.show()

Our Decision Tree model is not performing better than Logistic Regression model. Let's try Random Forest model now.

# 5. Random forest classifier model

In [None]:
# Creating a function to create different models based on values of max_leaf_nodes
def rf_score(max_leaf_nodes, x_train, y_train, x_test, y_test):
    model = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(x_train, y_train)
    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)
    ac_score_train = round(accuracy_score(y_train, train_predict)*100,2)
    ac_score_test = round(accuracy_score(y_test, test_predict)*100,2)
    return(ac_score_train, ac_score_test)

rf_score_results = pd.DataFrame(columns = ['Max_leaf_nodes', 'Train Accuracy', 'Test Accuracy'])
for max_leaf_nodes in [2, 3, 4, 5, 10, 20, 30, 40, 50]:
    ac_score_train, ac_score_test = rf_score(max_leaf_nodes, x_train, y_train, x_test, y_test)
    print("Max leaf nodes: %d  \t Train Accuracy:  %.2f \t Test Accuracy %.2f" % (max_leaf_nodes, ac_score_train, ac_score_test))   

We can observe that our model accuracy in test data is best for Max leaf nodes of 5.

In [None]:
random_forest_model = RandomForestClassifier(max_leaf_nodes = 5, random_state = 1)
random_forest_model.fit(x_train,y_train)
train_predictions = random_forest_model.predict(x_train)
test_predictions = random_forest_model.predict(x_test)
model_performance = model_performance.append(pd.DataFrame([[ 'Random Forest Classifier', 
                                    round(accuracy_score(y_train, train_predictions)*100,2),
                                    round(accuracy_score(y_test, test_predictions)*100,2)]],
                                   columns = ['Model', 'Train_Accuracy', 'Test_Accuracy']), ignore_index = True)
print(model_performance)
# Plotting the ROC Curve
fpr,tpr,threshols=roc_curve(y_test,random_forest_model.predict_proba(x_test)[:,1])
plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label="Random Forest")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Random Forest ROC Curve")
plt.show()

Random Forest Model has a higher accuracy than both Decision Tree as well as Logistic Regression. Let's try Gradient Boosting Model now.

# 6. Gradient boosting classifier model

In [None]:
# Creating a function to create different models based on values of max_leaf_nodes
def gbm_score(max_leaf_nodes, x_train, y_train, x_test, y_test):
    model = GradientBoostingClassifier(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(x_train, y_train)
    train_predict = model.predict(x_train)
    test_predict = model.predict(x_test)
    ac_score_train = round(accuracy_score(y_train, train_predict)*100,2)
    ac_score_test = round(accuracy_score(y_test, test_predict)*100,2)
    return(ac_score_train, ac_score_test)

gbm_score_results = pd.DataFrame(columns = ['Max_leaf_nodes', 'Train Accuracy', 'Test Accuracy'])
for max_leaf_nodes in [2, 3, 4, 5, 10, 20, 30, 40, 50]:
    ac_score_train, ac_score_test = gbm_score(max_leaf_nodes, x_train, y_train, x_test, y_test)
    print("Max leaf nodes: %d  \t Train Accuracy:  %.2f \t Test Accuracy %.2f" % (max_leaf_nodes, ac_score_train, ac_score_test))   

We can observe that our model accuracy in test data is best for Max leaf nodes of 2.

In [None]:
gbm_model = GradientBoostingClassifier(max_leaf_nodes = 2, random_state = 1)
gbm_model.fit(x_train,y_train)
train_predictions = gbm_model.predict(x_train)
test_predictions = gbm_model.predict(x_test)
model_performance = model_performance.append(pd.DataFrame([[ 'Gradient Boosting Classifier', 
                                    round(accuracy_score(y_train, train_predictions)*100,2),
                                    round(accuracy_score(y_test, test_predictions)*100,2)]],
                                   columns = ['Model', 'Train_Accuracy', 'Test_Accuracy']), ignore_index = True)
print(model_performance)
# Plotting the ROC Curve
fpr,tpr,threshols=roc_curve(y_test,random_forest_model.predict_proba(x_test)[:,1])
plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label="Gradient Boosting")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Gradient Boosting ROC Curve")
plt.show()

Gradient Boosting Classifier shows good performance, but Random Forest Classifier has a better test accuracy. Let's build an XGBoost model now.

# 7. Conclusion

In [None]:
full_predictions = random_forest_model.predict(x_full)
print( "Accuracy Score: %.2f " % (round(accuracy_score(y_full, full_predictions)*100,2)))

# Plotting confusion matrix
plot_confusion_matrix(random_forest_model,
                      x_full, 
                      y_full,
                      cmap=plt.cm.Blues,
                      normalize= 'true')


Our final model is Random Forest Classifier. It shows an accuracy of 83.5% with the full data.