# What's heart attack? How frequent is it? Who does it hit?

![](https://th.bing.com/th/id/Ra5fdfa4209dd1fd68547587508f1a175?rik=aCiCknlW4VtJDw&riu=http%3a%2f%2fi.huffpost.com%2fgen%2f1483544%2fthumbs%2fo-HEART-ATTACK-facebook.jpg&ehk=eqHuVoVBFNLMHmWVlGrHHMxap7aV7Wf%2f5mJ9ORvD7%2b4%3d&risl=&pid=ImgRaw)

An **heart attack** is an acute episode of **coronary** heart **disease** marked by the death or damage of heart muscle due to insufficient blood supply to the heart usually as a result of a coronary artery becoming blocked by a blood clot formed in response to a ruptured or torn fatty arterial deposit

s of 2018, 30.3 million U.S. adults  were diagnosed with heart disease. Every year, about 647,000 Americans die from heart disease, making it the leading cause of death in the United States. Heart disease causes 1 out of every 4 deaths.

According to the Centers for Disease Control and Prevention (CDC), approximately **every 40 seconds** an American will have a heart attack. Every year, **805,000 Americans have a heart attack**, 605,000 of them for the first time.

Seeing these terrifying data I wanted to examine this dataset to try to understand it a little more and then be able to share it with you, so let's find out more about this bad disease that affects so many people.

# **First approach with data and small elaborations**

**Import libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import tensorflow as tf
import seaborn as sns
import math
import keras

**Read data files**

In [None]:
filepath_heart = '../input/heart-attack-analysis-prediction-dataset/heart.csv'

heart_data = pd.read_csv(filepath_heart)

**Make a copy of the dataset so as not to work on the original**

In [None]:
df = heart_data.copy()

**Have a look on the shapes of the dataset**

In [None]:
df.shape

**Taking a look on what types are the features and if there are missing values**

In [None]:
df.info()

**And taking a little overview on the first 5 file of the dataset**

In [None]:
df.head()

Little description of the features of the dataframe:
* **Age** : Age of the patient

* **Sex** : Sex of the patient

* **cp** : Chest Pain type chest pain type
 
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic

* **trtbps** : resting blood pressure (in mmHg)

* **chol** : cholestoral in mg/dl fetched via BMI sensor

* **fbs** : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

* **restecg** : resting electrocardiographic results
 
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    
    
* **thalach** : maximum heart rate achieved

* **exng**: exercise induced angina (1 = yes; 0 = no)

* **oldpeak**: number of previous peak

* **slp**: Speech–language pathology 

* **caa**: number of major vessels (0-3)

* **thall**: Thallium Stress Test

* **output** : 
    * 0= less chance of heart attack 
    * 1= more chance of heart attack

**We are now creating new related features with existing features that will help us better represent data in graphs**

In [None]:
def Age(age):
    if age >= 70 : return '70 years old'
    if age >= 60 : return '60 years old'
    if age >= 50 : return '50 years old'
    if age >= 40 : return '40 years old'
    if age >= 30 : return '30 years old'
    else: return '20 years old'
df['age_group'] = df.apply(lambda x: Age(x["age"]), axis = 1)

In [None]:
def gender(gender):
    if gender == 0  : return 'female'
    else: return 'male'
df['sex'] = df.apply(lambda x: gender(x["sex"]), axis = 1)

In [None]:
def cp_prob(value):
    if value == 0  : return 'normal'
    if value == 1  : return 'atypical angina'
    if value == 2  : return 'non-anginal pain'
    else: return 'asymptomatic'
df['cp_value'] = df.apply(lambda x: cp_prob(x['cp']), axis = 1)

In [None]:
def ecg_res(value):
    if value == 0  : return 'normal'
    if value == 1  : return 'wave abnormality'
    else: return 'left ventricular hypertrophy'
df['restecg_value'] = df.apply(lambda x: ecg_res(x['restecg']), axis = 1)

In [None]:
def Old(peak):
    if peak >= 6 : return 6.0
    if peak >= 5 : return 5.0
    if peak >= 4 : return 4.0
    if peak >= 3 : return 3.0
    if peak >= 2.5 : return 2.5
    if peak >= 2 : return 2
    if peak >= 1.5 : return 1.5
    if peak >= 1 : return 1
    if peak >= 0.5 : return 0.5
    else: return peak
df['oldpeak'] = df.apply(lambda x: Old(x["oldpeak"]), axis = 1)

In [None]:
def Target(value):
    if value == 0  : return 'less chance'
    else: return 'more chance'
df['output_val'] = df.apply(lambda x: Target(x['output']), axis = 1)

----------------------------------------------------------

# EDA part: plotting graphs

**Age group**

**Here we want to analyze which age groups are most affected by this nasty disease while in the other plot we can see the age distribution of people with this disease**

In [None]:
plt.figure(figsize=(13,5))
age_categ_count = df['age_group'].value_counts()
ax = sns.countplot(x="age_group", 
                   data = df,
                   hue = 'sex',
                   order = age_categ_count.index,
                   linewidth=2)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),horizontalalignment='center', fontsize = 13)
ax.set_title('Which age groups are most affected by heart attacks??',fontsize = 20, fontweight='bold')
ax.set_xlabel('Age group', fontsize = 15)
ax.set_ylabel('N° of people', fontsize = 15)

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(x=df["age"])
plt.title("Distribution of people age",fontsize=20, fontweight='bold')
plt.xlabel("AGE",fontsize=20)
plt.ylabel("COUNT",fontsize=20)
plt.show()

**Genders**

Are males or females more affected? Will there be a difference between the genders or will they be even?

In [None]:
s=df["sex"].value_counts().reset_index()
fig = px.pie(s, values = 'sex', names='index',
             title='There are more males or females?',
             labels={'sex':'n° of cases', 'index':'sex'})
fig.show()

In [None]:
plt.figure(figsize=(13,5))
output_count = df['output_val'].value_counts()
ax = sns.countplot(x="output_val", 
                   data = df,
                   order = output_count.index,
                   hue = "sex",
                   linewidth=2)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),horizontalalignment='center', fontsize = 13)
ax.set_title('There are more male or female in danger?',fontsize = 20, fontweight='bold' )
ax.set_xlabel('Chance of an heart attack', fontsize = 15)
ax.set_ylabel('N° of people', fontsize = 15)

**Chest pain**

Heart attack can cause 3 types of chest pain mainly, which is the most frequent? Does it always cause pain or may it not even cause it?

In [None]:
cp=df["cp_value"].value_counts().reset_index()
fig = px.bar(cp, x='cp_value', y='index',
             title='What are the chest pains more frequent?',
             labels={'cp_value':'n° of cases', 'index':'type of chest pain'})
fig.show()

**Blood pressure**

How does blood pressure affect the chance of heart attack?

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(data=df, x="trtbps", hue = 'output')
plt.title("Distribution of blood pressure around patients",fontsize=30, fontweight='bold')
plt.xlabel("Pressure (in mmHg)",fontsize=20)
plt.ylabel("Count",fontsize=20)
plt.show()

As we can see, the most heart attack chances are when the blood pressure is between 120 and 140 mmHg

**Cholesteral**

And how does cholesteral affect the chance of heart attack?

In [None]:
plt.figure(figsize=(25,10))
sns.histplot(data=df, x="chol", hue = 'output')
plt.title("Distribution of cholesteral around patients",fontsize=30, fontweight='bold')
plt.xlabel("cholestoral in mg/dl",fontsize=20)
plt.ylabel("Count",fontsize=20)
plt.show()

**ECG results**

In [None]:
rst=df["restecg_value"].value_counts().reset_index()
fig = px.bar(rst, x='index', y='restecg_value',
             title='What are the ecg results more frequent?',
             labels={'restecg_value':'n° of cases', 'index':'type of chest pain'})
fig.show()

**Heart rate**

Since heart rate is a consequence of the pumping of blood from the heart, how does it affect the possibility of a heart attack? But above all, does it influence her or not?

In [None]:
plt.figure(figsize=(20,10))
sns.lineplot(x="age",y="thalachh",hue="output",data=df)
plt.title("Effect of heart attack with increase in age and maximum heart rate",
          fontsize=20, fontweight='bold')
plt.xlabel("Age",fontsize=15)
plt.ylabel("Thalachh",fontsize=15)
plt.show()

**Oldpeaks**

In [None]:
plt.figure(figsize=(13,5))
oldpeak_count = df['oldpeak'].value_counts()
ax = sns.countplot(x="oldpeak", 
                   data = df,
                   order = oldpeak_count.index,
                   linewidth=2)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),horizontalalignment='center', fontsize = 13)
ax.set_title('How many previous peak do the patiens have already had?',fontsize = 20, fontweight='bold' )
ax.set_xlabel('N° of peaks', fontsize = 15)
ax.set_ylabel('N° of people', fontsize = 15)

As we have said in the introduction, we can see that the majority of the people who are affected by heart attack is for the first time. So we are especially careful trying to avoid the situations that could lead us to be affected by it 

**There are more people in danger or with less chance of heart attack?**

In [None]:
s=df["output_val"].value_counts().reset_index()
fig = px.pie(s, values = 'output_val', names='index',
             title='There are more people in danger or with less chance of heart attack?',
             labels={'output':'n° of cases', 'index':'output'})
fig.show()

**But what is the correlation between the features?**

In [None]:
plt.figure(figsize=(20,6))
sns.heatmap(df.corr(),annot=True,cmap="PuBuGn")

-------------------------------------------

# **Data preprocessing and modelling**

**Import libraries**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import metrics

**Create another copy of the dataset for the predictions and without the new features that we have created**

In [None]:
data = heart_data.copy()
data.head()

**Drop the 'output' feature from the dataset to create a target** 

In [None]:
target = data['output']
data = data.drop(columns = ['output','slp'])

**One-Hot Encoding for the categorical variables**

In [None]:
data_dummies = data[['sex','cp','fbs','restecg','exng','caa','thall']]
data_dummies = pd.get_dummies(data_dummies,columns=['sex','cp','fbs','restecg','exng','caa','thall'])

**Merging the dummy variables and our original data**

In [None]:
data.drop(columns=['sex','cp','fbs','restecg','exng','caa','thall'],inplace=True)
data = data.merge(data_dummies,left_index=True, right_index=True,how='left')
data.head()

**Splitting the data into training and testing sets**

In [None]:
train_x,test_x,train_y,test_y = train_test_split(data,target,test_size=0.3,random_state=42)

**Standardizing the training and testing data withe the stander scaler**

In [None]:
scaler=StandardScaler()
train_x = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)

**Try 2 models for see which makes better predictions**

* Logistic Regression model

In [None]:
lr_model=LogisticRegression()
lr_model.fit(train_x,train_y)
score = lr_model.score(test_x, test_y)
prediction = lr_model.predict(test_x)
cm = confusion_matrix(test_y,prediction)
print('Testing Score \n',score)
plot_confusion_matrix(lr_model,test_x,test_y,cmap='rocket_r')
metrics.plot_roc_curve(lr_model, test_x, test_y)  

* Decision Tree Classifier model

In [None]:
dt_model =  DecisionTreeClassifier(random_state = 1)
dt_model.fit(train_x,train_y)
score = dt_model.score(test_x, test_y)
prediction = dt_model.predict(test_x)
cm = confusion_matrix(test_y,prediction)
print('Testing Score \n',score)
plot_confusion_matrix(dt_model,test_x,test_y,cmap='rocket_r')
metrics.plot_roc_curve(dt_model, test_x, test_y)  

As we can see the Logistic Regression model performs better than the Decision Tree Classifier 

**Thank you so much for looking at this notebook, I hope you enjoyed it and if so I would invite you to put an upvote. If you have found any errors, please write them to me in the comments or even if you have any suggestions for improving the notebook. thank you very much again and good Kaggling!**