# Heart Attack - Data Analysis and Prediction modelling

Author: Y. Staeva

This project is focused on Heart attack data analysis and prediction modelling. 

The data used for the project is available at:


R. Rahman (2021): Heart Attack Analysis and Prediction

Dataset link: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/version/2?select=heart.csv


The main purpose of the following code is to perform exploratory data analysis on the given dataset and to construct two classification models in order to compare their performance.

### Exploratory Data Analysis

**Install Pandas profiling**

In [None]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip (25.9 MB)
Collecting pydantic>=1.8.1
  Using cached pydantic-1.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB)
Collecting PyYAML>=5.0.0
  Using cached PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
Collecting visions[type_image_path]==0.7.4
  Using cached visions-0.7.4-py3-none-any.whl (102 kB)
Collecting htmlmin>=0.1.12
  Using cached htmlmin-0.1.12.tar.gz (19 kB)
Collecting phik>=0.11.1
  Using cached phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
Collecting tangled-up-in-unicode==0.2.0
  Using cached tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
Collecting requests>=2.24.0
  Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting multimethod>=1.4
  Using cached multimethod-1.6-py3-none-any.whl

**Import Numpy, Pandas and Pandas profiling**

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

**Read the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
heart_data = pd.read_csv('/content/drive/MyDrive/heart.csv')

**Check if the dataset is correctly read**

In [None]:
heart_data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
heart_data.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'],
      dtype='object')

**Generate a report on Exploratory data analysis using Pandas profiling**

In [None]:
report = ProfileReport(heart_data)

In [None]:
report

Output hidden; open in https://colab.research.google.com to view.

### Predictive variables

In this section the predictive variables will be selected and classified into coarse classes. The predictive variables must not have a correlation value with the target (outcome) variables, that is greater or lesses than 0.5. Classification of numerical variables, such as age, should indicate either positive or negative relationship between the variable class and the outcome.

**The first predictive varibale is the sex of the patient.**

In [None]:
heart_data['sex'].value_counts(dropna=False)

1    207
0     96
Name: sex, dtype: int64

In [None]:
heart_data.groupby(['sex'])['output'].mean()

sex
0    0.750000
1    0.449275
Name: output, dtype: float64

**Cholestoral concentration of the patient in mg/dl.**

In [None]:
heart_data['chol'].describe()

count    303.000000
mean     246.264026
std       51.830751
min      126.000000
25%      211.000000
50%      240.000000
75%      274.500000
max      564.000000
Name: chol, dtype: float64

In [None]:
heart_data['Cholesterol_conc'] = pd.cut(heart_data['chol'],[126, 211, 240, 274, 600], right=False)

In [None]:
heart_data['Cholesterol_conc'].value_counts(dropna=False).sort_index()

[126, 211)    74
[211, 240)    74
[240, 274)    76
[274, 600)    79
Name: Cholesterol_conc, dtype: int64

In [None]:
heart_data.groupby(['Cholesterol_conc'])['output'].mean()

Cholesterol_conc
[126, 211)    0.608108
[211, 240)    0.594595
[240, 274)    0.565789
[274, 600)    0.417722
Name: output, dtype: float64

**Resting electrocardiographic results**

Value 0 - normal

Value 1 - ST-T wave abnormality 

Value 2 - probable left ventricular hypertrophy

In [None]:
heart_data['restecg'].value_counts(dropna=False)

1    152
0    147
2      4
Name: restecg, dtype: int64

In [None]:
heart_data['Rest_ecg_class'] = np.where(heart_data['restecg'] == 0, 'Normal', 'At risk')

In [None]:
heart_data['Rest_ecg_class'].value_counts(dropna=False).sort_index()

At risk    156
Normal     147
Name: Rest_ecg_class, dtype: int64

In [None]:
heart_data.groupby(['Rest_ecg_class'])['output'].mean()

Rest_ecg_class
At risk    0.621795
Normal     0.462585
Name: output, dtype: float64

**Chest pain**

Value 1: Typical angina

Value 2: Atypical angina

Value 3: Non - anginal pain

Value 4: Asymptomatic

In [None]:
heart_data['cp'].value_counts(dropna=False)

0    143
2     87
1     50
3     23
Name: cp, dtype: int64

In [None]:
heart_data.groupby(['cp'])['output'].mean()

cp
0    0.272727
1    0.820000
2    0.793103
3    0.695652
Name: output, dtype: float64

In [None]:
heart_data['Chest_pain_class'] = np.where(heart_data['cp'] == 0, 'Lower risk', 'Higher risk')

In [None]:
heart_data['Chest_pain_class'].value_counts(dropna=False).sort_index()

Higher risk    160
Lower risk     143
Name: Chest_pain_class, dtype: int64

In [None]:
heart_data.groupby(['Chest_pain_class'])['output'].mean()

Chest_pain_class
Higher risk    0.787500
Lower risk     0.272727
Name: output, dtype: float64

**Maximum heart rate achieved**

In [None]:
heart_data['thalachh'].describe()

count    303.000000
mean     149.646865
std       22.905161
min       71.000000
25%      133.500000
50%      153.000000
75%      166.000000
max      202.000000
Name: thalachh, dtype: float64

In [None]:
heart_data['Max_heart_rate_class'] = pd.cut(heart_data['thalachh'], [60, 133, 153, 166, 210], right=False)

In [None]:
heart_data['Max_heart_rate_class'].value_counts(dropna=False)

[166, 210)    78
[133, 153)    77
[153, 166)    74
[60, 133)     74
Name: Max_heart_rate_class, dtype: int64

In [None]:
heart_data.groupby(['Max_heart_rate_class'])['output'].mean()

Max_heart_rate_class
[60, 133)     0.256757
[133, 153)    0.467532
[153, 166)    0.635135
[166, 210)    0.807692
Name: output, dtype: float64

**Resting blood sugar pressure**

In [None]:
heart_data['trtbps'].describe()

count    303.000000
mean     131.623762
std       17.538143
min       94.000000
25%      120.000000
50%      130.000000
75%      140.000000
max      200.000000
Name: trtbps, dtype: float64

In [None]:
heart_data['Resting_blood_pressure_class'] = pd.cut(heart_data['trtbps'], [90, 120, 140, 220], right=False)

In [None]:
heart_data['Resting_blood_pressure_class'].value_counts(dropna=False).sort_index()

[90, 120)      60
[120, 140)    146
[140, 220)     97
Name: Resting_blood_pressure_class, dtype: int64

In [None]:
heart_data.groupby(['Resting_blood_pressure_class'])['output'].mean()

Resting_blood_pressure_class
[90, 120)     0.616667
[120, 140)    0.575342
[140, 220)    0.453608
Name: output, dtype: float64

**Fasting blood sugar**

Value 1: fbs > 120 mg/dl

Value 0: fbs < 120 mg/dl

In [None]:
heart_data['fbs'].value_counts(dropna=False)

0    258
1     45
Name: fbs, dtype: int64

In [None]:
heart_data.groupby(['fbs'])['output'].mean()


fbs
0    0.550388
1    0.511111
Name: output, dtype: float64

### Data processing

**The selected predictive variables are:**

Sex, cholestoral concentration, resting electrocardiographic results, chest pain, maximum heart rate achieved, resting blood sugar pressure and fasting blood sugar.

First, the predictive variables, as well as their classing, are copied in a new dataframe called heart_model.

In [None]:
heart_data.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output', 'Cholesterol_conc',
       'Rest_ecg_class', 'Chest_pain_class', 'Max_heart_rate_class',
       'Resting_blood_pressure_class'],
      dtype='object')

In [None]:
heart_model = heart_data[['output', 'sex', 'chol', 'Cholesterol_conc', 'restecg', 'Rest_ecg_class', 'cp', 'Chest_pain_class', 'thalachh', 'Max_heart_rate_class', 'trtbps', 'Resting_blood_pressure_class', 'fbs']].copy()

In [None]:
heart_model.columns

Index(['output', 'sex', 'chol', 'Cholesterol_conc', 'restecg',
       'Rest_ecg_class', 'cp', 'Chest_pain_class', 'thalachh',
       'Max_heart_rate_class', 'trtbps', 'Resting_blood_pressure_class',
       'fbs'],
      dtype='object')

In the next step, the classified variables are represented with the use of dummy variables.

In [None]:
heart_model_dummy = pd.concat([heart_model['output'],
                         pd.get_dummies(heart_model['sex'], prefix='sex', dummy_na=False),
                         pd.get_dummies(heart_model['Cholesterol_conc'], prefix='chol', dummy_na=False),
                         pd.get_dummies(heart_model['Rest_ecg_class'], prefix='restecg', dummy_na=False),
                         pd.get_dummies(heart_model['Chest_pain_class'], prefix='cp', dummy_na=False),
                         pd.get_dummies(heart_model['Max_heart_rate_class'], prefix='thalachh', dummy_na=False),
                         pd.get_dummies(heart_model['Resting_blood_pressure_class'], prefix='trtbps', dummy_na=False),
                         pd.get_dummies(heart_model['fbs'], prefix='fbs', dummy_na=False)], axis = 1, ignore_index=False, join = 'outer')

In [None]:
heart_model_dummy.head()

Unnamed: 0,output,sex_0,sex_1,"chol_[126, 211)","chol_[211, 240)","chol_[240, 274)","chol_[274, 600)",restecg_At risk,restecg_Normal,cp_Higher risk,cp_Lower risk,"thalachh_[60, 133)","thalachh_[133, 153)","thalachh_[153, 166)","thalachh_[166, 210)","trtbps_[90, 120)","trtbps_[120, 140)","trtbps_[140, 220)",fbs_0,fbs_1
0,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,1
1,1,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,1,0
2,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0
3,1,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,1,0
4,1,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,1,0


In [None]:
heart_model_dummy.shape

(303, 20)

The dataframe holding the predictive variables is converted into a numpy vector in order to be divided into features and target vectors.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
heart_model_vector = heart_model_dummy.to_numpy()

In [None]:
heart_model_vector.shape

(303, 20)

In [None]:
features = heart_model_vector[:, 1 : 20]

In [None]:
features.shape

(303, 19)

In [None]:
features

array([[0, 1, 0, ..., 1, 0, 1],
       [0, 1, 0, ..., 0, 1, 0],
       [1, 0, 1, ..., 0, 1, 0],
       ...,
       [0, 1, 1, ..., 1, 0, 1],
       [0, 1, 1, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 1, 0]])

In [None]:
target = heart_model_vector[:, 0]

In [None]:
target.shape

(303,)

In [None]:
target

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
x_train, x_test, y_train, y_test = train_test_split(features, target, train_size = 0.7, random_state=42)

In [None]:
x_train.shape

(212, 19)

In [None]:
y_train.shape

(212,)

### Model development

In this step I will create and evaluate the performance of two classification models. The first method is Logistic Regression and the second method is a Support Vector Machine.The evaluation is done with a Confusion matrix, as well as calculating the Sensitivity and Specificity for both methods.

In [None]:
import matplotlib.pyplot as plt

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

**Create and plot a Confusion Matrix**

In [None]:
def plot_confusion_matrix(confusion_matrix, classes, normalize = False, title = 'Confusion Matrix', cmap = plt.cm.Blues):
  plt.imshow(confusion_matrix, interpolation = 'nearest', cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation = 45)
  plt.yticks(tick_marks,classes)

  print(confusion_matrix)

**Logistic Regression**

In [None]:
logistic_regression = LogisticRegression(random_state=0, solver="sag")

In [None]:
lg_model = logistic_regression.fit(x_train,y_train)

In [None]:
lg_prediction = lg_model.predict(x_test)

In [None]:
lg_rounded_prediction = np.argmax(lg_prediction, axis =-1)

In [None]:
lg_confusion_matrix = confusion_matrix(y_true = y_test, y_pred = lg_prediction)

In [None]:
lg_classes = ['Higher Risk patient', 'Lower Risk patient']

In [None]:
plot_confusion_matrix(lg_confusion_matrix, classes = lg_classes)

[[32  9]
 [12 38]]


Logistic Regression metrics:

Sensitivity: 0.7805

Specificity: 0.76

**Support Vector Machine**

In [None]:
from sklearn.svm import LinearSVC

In [None]:
support_vector_classifier = LinearSVC(C = 1.0)

In [None]:
svc_model = support_vector_classifier.fit(x_train,y_train)

In [None]:
svc_predictions = svc_model.predict(x_test)

In [None]:
svc_rounded_predictions = np.argmax(svc_predictions)

In [None]:
svc_confusion_matrix = confusion_matrix(y_true = y_test, y_pred = svc_predictions)

In [None]:
svc_classes = ['Higher Risk patient', 'Lower Risk patient']

In [None]:
plot_confusion_matrix(svc_confusion_matrix, classes = svc_classes)

[[31 10]
 [11 39]]


Support Vector Machine metrics:

Sensitivity: 0.7561

Specificity: 0.78



**Conclusion:**

The results show, that after calculatiting the Sensitivity and Specificity for each model, both methods tend to have a good performance. In order to select a "better" solution for the problem, there are two questions, that arise:

Q1: Is it more important to detect patients with Higher Heart Attack Risk?

If so, then it would be appropriate to select the Logistic regression model, as it has a higher Sensitivity value.

Q2: Is it more important to detect patients with Lower Heart Attack Risk?

If so, then the Support Vector Machine method is more suitable for the problem, as it has a higher Specificity value.

