# <p style="background-color:  #ff8080; font-family: Helvetica, fantasy; line-height: 1.3; font-size: 26px; letter-spacing: 3px; text-align: center; color: #ffffff">Heart Attack Prediction using Extreme Gradient Boosting (XGBoost)</p>

![](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/pink-porcelain-anatomical-heart-royalty-free-image-1597338342.jpg)

<p style="background-color:  #ff8080; font-family: Helvetica, fantasy; line-height: 1.3; font-size: 26px; letter-spacing: 3px; text-align: center; color: #ffffff">Dataset description</p>

- age : Age of the patient
- Sex : 1 = male; 0 = female
- exng: exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slp: the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)
- thall: 2 = normal; 1 = fixed defect; 3 = reversable defect.
- caa: number of major vessels (0-3)
- cp : Chest Pain type chest pain type
 - Value 0: typical angina
 - Value 1: atypical angina
 - Value 2: non-anginal pain
 - Value 3: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg : resting electrocardiographic results
 - Value 0: normal
 - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
 - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalachh : maximum heart rate achieved
- output: target : 0 = less chance of heart attack 1 = more chance of heart attack

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import random

random.seed(224)
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
# Renaming columns
df.columns = ['Age', 'Sex', 'ChestPainType', 'RestingBloodPressure', 'Cholesterol', 'FastingBloodSugar', 'RestingECG', 'MaxHeartRate',
       'ExerciseInducedAngina', 'PreviousPeak', 'Slope', 'MajorBloodVessels', 'ThalRate', 'ProbHA']

categoricals = ['Sex', 'ChestPainType', 'FastingBloodSugar', 'RestingECG', 'ExerciseInducedAngina', 'Slope', 'ThalRate', 'ProbHA']
numericals = [i for i in df.columns if i not in categoricals]

# Investigate categorical features

In [None]:
for col in df[categoricals]:
    print(f'We have {len(df[col].unique())} unique values in --{col}-- column: {df[col].unique()}')

In [None]:
# Count plots for categorical features
x=0
fig=plt.figure(figsize=(15,10),constrained_layout =True)
plt.subplots_adjust(wspace = 0.5)
plt.suptitle("Count of the Categorical Variables",y=0.95, family='Sherif', size=18, weight='bold')
for i in df[categoricals]:
    ax = plt.subplot(241+x)
    ax = sns.countplot(data=df, x=i, color = 'salmon')
    plt.grid(axis='y')
    x+=1

# Investigate numerical features

In [None]:
df[numericals].describe()

In [None]:
corr = df.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)]=True
with sns.axes_style('white'):
    fig, ax = plt.subplots(figsize=(18,10))
    sns.heatmap(corr,  mask=mask, cmap='YlGnBu', annot=True, center=0, vmin=-1, vmax=0.8,
                square=True, cbar_kws={'shrink':.5, 'orientation': 'vertical'}, linewidth=.02)

In [None]:
x=0
fig=plt.figure(figsize=(15,10),constrained_layout =True)
plt.subplots_adjust(wspace = 0.5)
plt.suptitle("Distribution of numerical variables",y=0.95, family='Sherif', size=18, weight='bold')
for i in df[numericals]:
    ax = plt.subplot(231+x)
    ax = sns.boxplot(data=df, y=i, color = 'salmon')
    x+=1

Outlier: Cholesterol > 500; MaxHeartRate < 80


In [None]:
df.drop(df[df['Cholesterol'] > 500].index, inplace = True)
df.drop(df[df['MaxHeartRate'] < 80].index, inplace = True)
df.shape[0]

# Exploratory Data Analysis (EDA)

In [None]:
print(df['ProbHA'].value_counts())

pie, ax = plt.subplots(figsize=[15,10])
labels = ['More chance to HA', 'Less chance to HA']
colors = ['#ff8533', '#7070db']
plt.pie(x = df['ProbHA'].value_counts(), autopct='%.2f%%', explode=[0.02]*2, labels=labels, pctdistance=0.5, textprops={'fontsize': 14}, colors = colors)
plt.title('Distributin of target variable in %')
plt.show()

In [None]:
x=0
fig=plt.figure(figsize=(15,10),constrained_layout =True)
plt.subplots_adjust(wspace = 0.5)
plt.suptitle("Count of the categorical variables by target variable",y=0.95, family='Sherif', size=18, weight='bold')
for i in df[categoricals]:
    ax = plt.subplot(241+x)
    ax = sns.countplot(data=df, x=i, hue='ProbHA', palette = colors)
    ax.legend_.remove()
    plt.grid(axis='y')
    x+=1

Insights:
 - The number of males that are more likely to have a HA from the number of total males is higher than females.
 - The individuals who present a typical angina chest type are more likely to have a HA.
 - The individuals with normal (0) resting electrocardiographic results (Resting ECG) appear to be more likely to suffer a HA.
 - If angina is exercise induced, is more likely to suffer a HA.
 - If the slope of the peak exercise ST segment is flat, is more likely to suffer a HA.
 - If the thal rate is reversable defect, is more likely to suffer a HA. 

In [None]:
x=0
fig=plt.figure(figsize=(15,10),constrained_layout =True)
plt.subplots_adjust(wspace = 0.5)
plt.suptitle("Distribution of numerical variables by target variable",y=0.95, family='Sherif', size=18, weight='bold')
for i in df[numericals]:
    ax = plt.subplot(231+x)
    ax = sns.histplot(data=df, x=i, hue='ProbHA', palette=colors, element='poly')
    ax.legend_.remove()
    x+=1

In [None]:
x=0
fig=plt.figure(figsize=(15,10),constrained_layout =True)
plt.subplots_adjust(wspace = 0.5)
plt.suptitle("Relationships between age and numerical features by target variable",y=0.95, family='Sherif', size=18, weight='bold')
for i in df[numericals[1:]]:
    ax = plt.subplot(231+x)
    ax = sns.scatterplot(data=df, x='Age', y=i, hue='ProbHA', palette=colors)
    ax.legend_.remove()
    x+=1

# Data preparation

In [None]:
# Split into features & target; train & test
# Normalize features
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
y = df['ProbHA']
X = df.drop('ProbHA', axis = 1)

normalize = Normalizer()
X = normalize.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123, shuffle = True, stratify = y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

# Modeling

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

models = [('DT', DecisionTreeClassifier()),
          ('LR', LogisticRegression()), 
          ('SGDC', SGDClassifier()), 
          ('SVC', SVC())]

# Baseline models trainining and evaluation
for name, model in models:
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'The accuracy of {name} is {acc:.3f}')

## Ensembling with XGBoost (Extreme Gradient Boosting)

In [None]:
import xgboost as xgb

# Basline XGBClassifier
xgb_cl = xgb.XGBClassifier()
xgb_cl.fit(X_train, y_train)
preds = xgb_cl.predict(X_test)
score = accuracy_score(y_test, preds)
print(f'The accuracy of XGBClassifier is {score:.3f}')

## Hyperparameters tuning for XGBoost

In [None]:
# Grid search
from sklearn.model_selection import GridSearchCV

params_grid = {'learning_rate':[0.01, 0.1, 0.5, 0.9],
              'n_estimators':[100,200,300],
              'subsample':[0.3, 0.5, 0.9],
               'max_depth':[2,3,4],
               'colsample_bytree':[0.3,0.5,0.7,1]}
grid = GridSearchCV(estimator=xgb_cl, param_grid=params_grid, scoring='accuracy', cv = 10)

grid.fit(X_train, y_train)
print(f'Best params found for XGBoost are: {grid.best_params_}')
print(f'Best accuracy obtained by the best params: {grid.best_score_}')

In [None]:
preds = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, preds))

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, roc_curve, auc
# Confusion matrix
confusion_matrix(y_test, preds)

In [None]:
plot_confusion_matrix(grid.best_estimator_, X_test, y_test)

Out of 61 samples, the XGBoost misscassified 6

## AUC evaluation of XGBoost

In [None]:
probs = grid.best_estimator_.predict_proba(X_test)
pred = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, pred)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(12,8))
plt.title('ROC')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()