# Predicting heart diseases using Logistic Regression

This is still a work in progress!

Future additions:
- include EDA findings in model creation
- feature generation / selection

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas.plotting import scatter_matrix

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn import preprocessing
from sklearn.compose import ColumnTransformer

print('Import complete')

# 1. Loading data

In [None]:
data = pd.read_csv("../input/heart-disease-uci/heart.csv")
X = data.drop("target", axis=1)
y = data.target

print(f"data shape: {X.shape}\ntarget shape: {y.shape}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.sample(10)

### Attributes with explanations
<table  style="float:left;">
    <tr>
        <td>age</td>
        <td>age</td>
    </tr>
    <tr>
        <td>sex</td>
        <td>1 = male, 0 = female</td>
    </tr>
    <tr>
        <td>cp</td>
        <td>chest pain type (4 values, 0 - 3)</td>
    </tr>
    <tr>
        <td>trestbps</td>
        <td>resting blood pressure</td>
    </tr>
    <tr>
        <td>chol</td>
        <td>serum cholestoral in mg/dl</td>
    </tr>
    <tr>
        <td>fbs</td>
        <td>fasting blood sugar > 120 mg/dl</td>
    </tr>
     <tr>
        <td>trestbps</td>
        <td>resting electrocardiographic results (values 0,1,2)</td>
    </tr>
    <tr>
        <td>chol</td>
        <td>maximum heart rate achieved</td>
    </tr>
    <tr>
        <td>fbs</td>
        <td>exercise induced angina</td>
    </tr>
        <tr>
        <td>oldpeak</td>
        <td>ST depression induced by exercise relative to rest (ST-segment is abnormally low below the baseline)</td>
    </tr>
     <tr>
        <td>slope</td>
        <td>the slope of the peak exercise ST segment</td>
    </tr>
    <tr>
        <td>chol</td>
        <td>number of major vessels (0-3) colored by flourosopy </td>
    </tr>
    <tr>
        <td>thal</td>
        <td>3 = normal; 6 = fixed defect; 7 = reversable defect</td>
    </tr>
</table>

In [None]:
data.isnull().sum()

# 2. Exploratory data analysis (EDA)

In [None]:
data.describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
sns.kdeplot(ax=axes[0], x=data.age)
sns.countplot(ax=axes[1], x=data.sex)

plt.show()

We can see that the age has a normal distribution which ranges from 29 to 77 years. There are more men than women in this dataset.



## 2.1 CAD:

In [None]:
people_with_cad = data.loc[data.target == 1]
people_without_cad = data.loc[data.target == 0]

plt.subplots(figsize=(6, 5))
sns.countplot(x=data.target)
plt.title("Target (heart disease)")
plt.xlabel("Has a heart disease (0 = no, 1 = yes)")
plt.show()

We have a fairly even distribution of people with and without a heart disease.

## 2.2 Correlation between attributes:

In [None]:
corr = data.corr()
plt.subplots(figsize=(15, 10))
sns.heatmap(corr, annot=True)
plt.show()

Correlation heat map shows mildly to strong positive correlation:
- between chest pain (cp) and target -> +0.43 
- between maximum heartbeat (thalach) and target -> +0.42
- between slope of the ST-segment (slope) and target -> +0.35

Correlation heat map shows mildly to strong negative correlation:
- between exang and target -> -0.44
- between oldpeak and target -> -0.43
- between ca and target -> -0.39
- between thal and target -> -0.34

## 2.3 Chest pain:


In [None]:
data.cp.unique()

In [None]:
cp = []
pc = []
for i in range(0, 4):
    cp_ = data.loc[data.cp == i]
    s = cp_.target.sum()
    t = cp_.target.size
    p = np.round((s/t)*100)
    print(f"{s} people of {t} people with chest pain type ({i}) had a heart disease. --> {p}%")
    cp.append(i)
    pc.append(p)

In [None]:
plt.subplots(figsize=(8, 5))
plt.bar(cp, pc)
plt.xlabel("Chest Pain")
plt.ylabel("Percentage with heart disease")
plt.xticks(cp)
plt.show()

We can see that people with chest pain types number 1, 2 and 3 have a high chance of getting a heart disease. This result matches our correlation heatmap where chest pain had one of the highest correlations with the target.
Notice that a lot of people in this dataset had chest pain type number 0.

## 2.4 Maximum heart rate:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.stripplot(ax=axes[0], x="target", y="thalach", data=data)
plt.ylabel("maximum heart rate achieved (thalach)")

sns.boxplot(ax=axes[1], x="target", y="thalach", data=data)
plt.ylabel("maximum heart rate achieved (thalach)")

plt.show()

Both the stripplot and the boxplot show that people with high maximum heart rates seem to get heart diseases more likely than people with a lower heart rate.
We can also see a few outliers which we may have to deal with later on.

In [None]:
thalach_cad = data.thalach.loc[data.target == 1]
thalach_no_cad = data.thalach.loc[data.target == 0]

thalach_max = np.max(thalach_cad)
thalach_min = np.min(thalach_cad)
thalach_mean = np.mean(thalach_cad)

thalach_max_no = np.max(thalach_no_cad)
thalach_min_no = np.min(thalach_no_cad)
thalach_mean_no = np.mean(thalach_no_cad)

print(f"Mean with CAD: {thalach_mean}\nMean without CAD: {thalach_mean_no}\n")
print(f"Max with CAD: {thalach_max}\nMax without CAD: {thalach_max_no}\n")
print(f"Min with CAD: {thalach_min}\nMin without CAD: {thalach_min_no}")

A short summary in numbers which shows the same results as the plots. We have higher mean maximum heart rates in the group of people with a heart disease.

## 2.5 Blood pressure

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.stripplot(ax=axes[0], x="target", y="trestbps", data=data)
plt.ylabel("blood pressure (trestbps)")

sns.boxplot(ax=axes[1], x="target", y="trestbps", data=data)
plt.ylabel("blood pressure (trestbps)")

plt.show()

Blood pressure doesnt seem to have a strong impact on heart diseases. But we can see a strong linear relation to age. It is widely known that blood pressure increases with higher age.

In [None]:
sns.lmplot(x="age", y="trestbps", data=data)
plt.title("Linear relation between age and blood pressure.")
plt.show()

## 2.6 ST-segment:

### 2.6.1 ST-segment slope:

In [None]:
data.slope.unique()

In [None]:
sl = []
pc = []

for i in range(0, 3):
    sl_ = data.loc[data.slope == i]
    s = sl_.target.sum()
    t = sl_.target.size
    p = np.round((s/t)*100)
    print(f"{s} people of {t} people with ST-segment slope of: {i} had a heart disease. --> {p}%")
    sl.append(i)
    pc.append(p)

In [None]:
plt.subplots(figsize=(8, 5))
plt.bar(sl, pc)
plt.xlabel("ST-segment slope")
plt.ylabel("Percentage with heart disease")
plt.xticks(sl)
plt.show()

We can see that only few people in this dataset have a ST-segment slope of 0 (21 people). 
Addidionaly it is clear that a very high slope (2) seems to be correlated with a heart disease as 74% (of 142 people) had one.

Here you can see the different slopes of the ST-Segment in an ECG.
- yellow -> slope = 0
- green -> slope = 1
- red -> slope = 2

![ST-Segment ECG](http://www.marc-julian.de/images/stsegment.jpeg)

### More pictures visualising the slope:

![ST-Segment slow slope](https://www.marc-julian.de/images/slowslope.jpg)
![ST-Segment rapid slope](https://www.marc-julian.de/images/rapidslope.jpg)

Several researches say that the slope of the ST-segment is a good way to predict coronary artery disease (which can lead to several other heart diseases).
<br>
A short quote from a study about this topic:

> The ST/HR slope was an improved ECG criterion for diagnosing CAD and compared favorably with TI imaging.

<u>Source:</u><br>
Finkelhor RS, Newhouse KE, Vrobel TR, Miron SD, Bahler RC. The ST segment/heart rate slope as a predictor of coronary artery disease: comparison with quantitative thallium imaging and conventional ST segment criteria. Am Heart J. 1986 Aug;112(2):296-304. doi: 10.1016/0002-8703(86)90265-6. PMID: 3739881.

### 2.6.2 ST-segment depression

In [None]:
data.head()
print(f"Max: {data.oldpeak.max()}")
print(f"Min: {data.oldpeak.min()}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.stripplot(ax=axes[0], x="target", y="oldpeak", data=data)
sns.boxplot(ax=axes[1], x="target", y="oldpeak", data=data)
plt.show()

In [None]:
plt.subplots(figsize=(10,5))
sns.boxplot(x="slope", y="oldpeak", data=data)
plt.title("Correlation between ST-segment slope and oldpeak")
plt.show()

# 3. Model creation

Creating a simple LogisticRegression model using sklearn.

In [None]:
model = LogisticRegression(max_iter=500)

X_train = preprocessing.StandardScaler().fit_transform(X_train)

model.fit(X_train, y_train)
print("Score with train data: %.3f" % model.score(X_train, y_train))

In [None]:
X_test = preprocessing.StandardScaler().fit_transform(X_test)

test_preds = model.predict(X_test)
print(f"Test predictions:\n{test_preds}")

In [None]:
print(f"Score with test data: %.3f" % model.score(X_test, y_test))