# Heart Attack EDA & Classification

Hello! Welcome to my heart attack notebook where we will be visualising the different features of this dataset and creating a predictor which will predict the attack.

<img src="https://cdn-images-1.medium.com/max/1200/1*aDIf-jrD5Ikrfcjsnmdebw.jpeg" width="400px"/>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from collections import Counter
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()

In [None]:
df['sex'] = np.where(df['sex']==1, 'male', 'female')
df['cp'] = np.where(df['cp']==0, 'typical angina', np.where(df['cp']==1, 'atypical angina',
    np.where(df['cp']==2, 'non-anginal pain', np.where(df['cp']==3, 'asymptomatic', 0))))
df['fbs'] = np.where(df['fbs']==1, 'true', 'false')
df['restecg'] = np.where(df['restecg']==0, 'normal', 'abnormal')
df['exng'] = np.where(df['exng']==1, 'yes', 'no')

### Description of features
* **age** - age of patient
* **sex** - sex of patient (1=male, 0=female)
* **cp** - chest pain type (4 values)
* **trtbps** - resting blood pressure
* **chol** - serum cholestoral in mg/dl
* **fbs** - fasting blood sugar > 120 mg/dl (1=true, 0=false)
* **restecg** - resting electrocardiographic results (values 0,1,2)
* **thalachh** - maximum heart rate achieved
* **exng** - exercise induced angina (1=yes, 0=no)
* **oldpeak** - ST depression induced by exercise relative to rest
* **slp** - the slope of the peak exercise ST segment
* **caa** - number of major vessels (0-3) colored by flourosopy
* **thal** - 3 = normal; 6 = fixed defect; 7 = reversable defect
* **output** - heart attack or not

In [None]:
def barplot(data, x, y, hue):
    fig, ax = plt.subplots(1, 1, figsize=(15, 10))
    sns.barplot(data=data, x=x, y=y, hue=hue)
    plt.show()
    
def pie(data, x, y):
    count = Counter(data)
    count = pd.DataFrame({x:count.keys(), y:count.values()})
    fig = px.pie(count, x, y)
    fig.update_layout(legend_title=dict(text=x))
    fig.show()
    
def bars(data, x, y):
    count = Counter(data)
    count = pd.DataFrame({x:count.keys(), y:count.values()}).sort_values(by=x)
    fig = px.bar(count, x, y)
    fig.update_layout(legend_title=dict(text=x))
    fig.show()
    
def scatter(x, y):
    data = pd.DataFrame({x:df[x], y:df[y]})
    fig = px.scatter(data, x, y, color=df['output'])
    fig.show()

# Sex

Firstly, we create a pie chart for the sex of the patients which tells us that the majority of them are male.

In [None]:
pie(df['sex'], 'Sex', 'Number of patients')

# Chest pain type

Secondly, we visualise the number of different chest pains, seeing that chest pain type 0 affects almost half of our patients.

In [None]:
pie(df['cp'], 'Chest pain type', 'Number of patients')

# Fasting blood sugar over 120 mg/dl

Then, we check out the whether the fasting blood sugar is over 120 mg/dl. We can conclude that the vast majority of patients do not have it over 120 mg/dl.

In [None]:
pie(df['fbs'], 'Fasting blood sugar over 120 mg/dl', 'Number of patients')

# Resting electrocardiographic results

Now we visualise the three electrocardiographic results, showing that an almost even divide is split between 0 and 1, with type 2 being less than 2 percent.

In [None]:
pie(df['restecg'], 'Resting electrocardiographic results', 'Number of patients')

# Exercise induced angina

Next, we visualise that most of the patients do not have exercise induced angina.

In [None]:
pie(df['exng'], 'Exercise induced angina', 'Number of patients')

# Major blood vessels coloured

Subsequently, we now see that over half of the people have no major blood vessels coloured by colonoscopy.

In [None]:
pie(df['caa'], 'Number of major blood vessels coloured by colonoscopy', 'Number of patients')

# Thal

Furthermore, we check out the frequency of the four thal types, seeing that more than half of it is by thal 2.

In [None]:
pie(df['thall'], 'Thal', 'Number of patients')

# Sex and Chest pain

Here we use a treemap to analyse the sex and chest pains. We can conclude that women have a higher chance of getting heart attacks, typical anginas are the chest pain types which have the lowest chance of heart attacks, while non-anginal pain has the highest.

In [None]:
px.treemap(df, path=['sex', 'cp'], color='output')

# Sex and Exercise Induced Angina

Next, we analyse sex and exercise induced angina. Strangely enough, the data seems to tell us that not having an exercise induced angina gives you a higher chance of getting a heart attack than having one.

In [None]:
px.treemap(df, path=['sex', 'exng'], color='output')

# Resting ECG and Chest pains

Now we compare the resting electrocardiographic (ECG) results with the chest pains, seeing that the abnormal ECG results have a higher chance of heart attack than the normal ones.

In [None]:
px.treemap(df, path=['restecg', 'cp'], color='output')

# Fasting blood sugar and Exercise Induced Angina

Here we compare fasting blood sugar (fbs) and exercise induced angina, seeing that having a fbs under 120 mg/dl gives you a slightly higher chance of getting a heart attack.

In [None]:
px.sunburst(df, path=['fbs', 'exng'], color='output')

# Fasting blood sugar and Rest ECG

Finally, we compare the FBS with the Resting ECG.

In [None]:
px.sunburst(df, path=['fbs', 'restecg'], color='output')

# Age

Now, as we move onto bar charts, we can see that the average age for the patients is in their 50s.

In [None]:
bars(df['age'], 'Age', 'Number of patients')

# Resting blood pressure

The resting blood pressure seems rather sporadic, however the most common cases are in the rbs of 120, 130 and 140.

In [None]:
bars(df['trtbps'], 'Resting blood pressure', 'Number of patients')

# Cholesterol

Furthermore, we see that most of the serum cholesterol is between 200 and 300.

In [None]:
bars(df['chol'], 'Serum cholesterol', 'Number of patients')

# Maximum heart rate

Afterwards, we take a look at the maximum heart rate, seeing that most cases occur after the value of 140.

In [None]:
bars(df['thalachh'], 'Maximum heart rate', 'Number of patients')

# ST depression induced by exercise relative to rest

Most of the people have no or close to no ST depression incuded by exercise relative to rest.

In [None]:
bars(df['oldpeak'], 'ST depression', 'Number of patients')

# Comparative barplots

Now we switch our sights over to barplots which compare the various categories by stacking the columns next to each other.

In [None]:
cols = [['sex', 'oldpeak', 'exng'], ['restecg', 'oldpeak', 'exng'], ['sex', 'thalachh', 'cp'],
        ['cp', 'chol', 'thall'], ['slp', 'trtbps', 'caa'], ['output', 'oldpeak', 'caa']]

for i in cols:
    barplot(data=df, x=i[0], y=i[1], hue=i[2])

# Scatterplots

Then, we scatter 10 pairs of variables and use the target to determine their classifiability.

In [None]:
cols = [['trtbps', 'age'], ['chol', 'age'], ['thalachh', 'age'], ['oldpeak', 'age'],
       ['chol', 'trtbps'], ['thalachh', 'trtbps'], ['oldpeak', 'trtbps'],
       ['oldpeak', 'chol'], ['thalachh', 'chol'], ['oldpeak', 'thalachh']]

for col in cols:
    scatter(col[0], col[1])

# Correlation

Next, as preparation for the classifiers, we visualise the correlation of the different variables. The heatmap we see below shows promising results, as it tells us that the columns have no real correlation, which indicates that there are not any redundancies.

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
X = df.drop('output', axis=1)
y = df['output']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
sns.heatmap(X.corr(), annot=True)
plt.show()

# Predicting data with models

Finally, we can compare the LinearSVC, Random Forest and XGBoost models to classify our data.

In [None]:
model_scores = []
cross_val_scores = []
roc_auc_scores = []
f1_scores = []

for model in [LinearSVC(dual=True), RandomForestClassifier(), LogisticRegression(),
              XGBClassifier(use_label_encoder=False, eval_metric='logloss')]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    scores = [model.score(X_test, y_test), cross_val_score(model, X_test, y_test).mean(),
             roc_auc_score(y_test, y_pred), f1_score(y_test, y_pred)]
    
    for score in scores:
        print(score)
        
    print('')
    
    model_scores.append(scores[0])
    cross_val_scores.append(scores[1])
    roc_auc_scores.append(scores[2])
    f1_scores.append(scores[3])

# Evaluating models

As the results of the models are shown below, we can see that the XGBoost classifier predicts this dataset the best.

In [None]:
model_names = ['Linear SVC', 'Random Forest', 'Logistic Regression', 'XGBoost']
score_names = ['Model score', 'Cross validation score', 'ROC AUC score', 'F1 score']
scores = [model_scores, cross_val_scores, roc_auc_scores, f1_scores]

for score in scores:
    score_name = score_names[scores.index(score)]
    data = pd.DataFrame({'Model':model_names, score_name:score})
    fig = px.bar(data, 'Model', score_name, color=score_name)
    fig.show()

<img src="https://miro.medium.com/max/1000/1*qHbAsMNmdWQJkzm2SUA-8w.jpeg" width="600px"/>

## Thank you for reading this notebook.
## If you enjoyed my notebook and found it helpful, please give it an upvote and provide feedback, as it would help me make more of these.