# Heart attack prediction

This analysis explores the problem of predicting heart attacks using the provided dataset. The dataset contains various attributes about the person's health and a label attribute whether or not the person has been deamed to have a high risk of heart attack.


The different datapoints for each patient are:

* Age (0=female, 1=male)
* Gender
* Chest pain type (categorized in 4 different types, 0=none)
* Resting blood pressure
* Serum cholestoral (unit mg/dl)
* Fasting blood sugar is higher than 120 mg/dl
* Resting ECG results (categorized in 3 types)
* Max heart rate
* Exercise induced angina (present or not)
* ST depression induced by exercise with respect to rest (present or not)
* Slope for the peak ST depression during exercise (1=upslope, 2=flat, 3=downslope)
* Number of major vessels (0-3)
* Thal (normal, fixed defect, reversible defect)
* **Target value** high risk for heart attack (0 or 1)

(The original dataset is collected from the [UCI machine learning repository](https://archive.ics.uci.edu/ml/datasets/heart+disease).)


### Contents

* [Initial exploration](#initial-exp)
* [Research](#research)
    - [Sex](#research-gender)
    - [Max heart rate](#research-mhr)
    - [Chest pain](#research-cp)
    - [ST depression slope](#research-slope)
    - [Major vessels](#research-vessels)
* [Modelling](#model)
    - [Clustering & K-Prototypes](#model-cluster)
    - [Neural network](#model-net)
* [Conclusion](#conclusion)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id="initial-exp"></a>
# Initial exploration

The first thing is to explore the data and to visualize it. We can get more insight by plotting the distributions of various attributes in the data.

In [None]:
# Define a map for converting abbretiations for the correct string
ABBR_MAP = {
    'age': 'Age',
    'sex': 'Gender',
    'cp': 'Chest pain',
    'trestbps': 'Resting BP',
    'chol': 'Cholesterol',
    'fbs': 'Fasting sugar > 120 mg/dl',
    'restecg': 'Resting ECG type',
    'thalach': 'Max heart rate',
    'exang': 'Exercise induced angina',
    'oldpeak': 'ST depression during exercise',
    'slope': 'Slope for ST depression',
    'ca': 'Number of major vessels',
    'thal': 'Thal abnormalities'
}

In [None]:
raw_df = pd.read_csv('../input/health-care-data-set-on-heart-attack-possibility/heart.csv')
attributes = list(raw_df.columns)
input_attributes = attributes[:-1]

In [None]:
raw_df.head()

We first separate the dataset into 2 parts, those with low risk for HA and those with a high risk.

In [None]:
low_risk_df, high_risk_df = map(lambda x: x.reset_index().drop('index', axis=1), map(lambda x: x[1], raw_df.groupby(by='target')))

We continue by plotting the distribution for the different attributes for the low and high risk datasets.

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=len(input_attributes), figsize=(16, 22))
fig.tight_layout()
for i, attr in enumerate(input_attributes):
    axes[i, 0].hist(low_risk_df[attr], bins=20, color='b')
    axes[i, 0].set_title(f"{ABBR_MAP[attr]}")
    axes[i, 1].hist(high_risk_df[attr], bins=20, color='orange')
    axes[i, 1].set_title(f"{ABBR_MAP[attr]}")
    
    # Set the same x-axis limits for better comparison
    x_max = max([axes[i, j].get_xlim() for j  in range(2)])
    for j in range(2):
        axes[i, j].set_xlim(x_max)
fig.show()

We note a couple of things from this initial examination:

* The high risk portion has similar amount of men and women, but men are more prominent in the low risk portion. We should investigate the amount of men and women further to draw any conclusions from this observation.
* Max heart rate is higher among high risk patients. This intuitively seems plausible.
* Chest pain is more prominent among high risk patients. This also seems intuitively plausible.
* Slope for ST depression seems to be downsloping on average among high risk patients. On the other hand, the slope is flat on average for low risk patients.
* Number of major vessels is higher among low risk patients compared to high risk ones.

We will research these observations further by comparing the amount of people in different groups.

<a id="research"></a>
# Research

In the first section we could draw some preliminary conclusions from the data. Nonetheless, we should confirm our suspicions by a more thorough analysis.

<a id="research-gender"></a>
### Sex

To determine whether or not sex has a factor in determining the risk for heart attack. We can calculate the proportion of men and women in the high risk group when compared to the overall dataset.

In [None]:
# Total amount of men and women
female_count, male_count = map(len, map(lambda x: x[1], raw_df.groupby(by='sex')))
# Amount of men and women in high risk group
female_count_hr, male_count_hr = map(len, map(lambda x: x[1], high_risk_df.groupby(by='sex')))
# Proportions
female_proportion = female_count_hr / female_count
male_proportion = male_count_hr / male_count
print(f"Male: {round(male_proportion, 2) * 100}%, female: {round(female_proportion, 2) * 100}%.")

Using this statistic, we can conclude that females tend to have a higher probability of being in the high risk group. Of course, this conclusion is based on the provided dataset, therefore we cannot draw any definitive conclusions.

<a id="research-mhr"></a>
### Maximum heart rate

Next, we'll determine whether the maximum heart rate is higher among high risk patients. We can do this by computing and plotting the quantiles (25%) from heart rate for both groups.

In [None]:
# .25 Quartiles
quantiles = [i * 0.25 for i in range(1, 4)]
lr_quartile, hr_quartile = (low_risk_df['thalach'].quantile(quantiles), high_risk_df['thalach'].quantile(quantiles))
print(lr_quartile)
print(hr_quartile)

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 9))
titles = ["Low risk", "High risk"]
dfs = [low_risk_df, high_risk_df]
for i in range(2):
    ax[i].boxplot(dfs[i]['thalach'].values, showfliers=False)
    ax[i].set_title(titles[i])
    ax[i].set_xticklabels(['Maximum heart rate'])
plt.show()

We can clearly see that the minimum, maximum and average values for the maximum achieved heart rate is higher among high risk patients.


<a id="research-cp"></a>
### Chest pain

We continue by determining whether or not chest pain is more prominent among high risk patients. We do this by calculating the prevalence of chest pain types in both groups.

In [None]:
from matplotlib.colors import Colormap
low_risk_groups = low_risk_df['cp'].map({0:0, 1:1, 2:1, 3:1})
high_risk_groups = high_risk_df['cp'].map({0:0, 1:1, 2:1, 3:1})
low_risk_neg, low_risk_pos = [low_risk_groups.value_counts(normalize=True)[i] for i in range(2)]
high_risk_neg, high_risk_pos = [high_risk_groups.value_counts(normalize=True)[i] for i in range(2)]

print(low_risk_pos)
print(high_risk_pos)

# Calculate the increase in percentages
percentage_more = (high_risk_pos - low_risk_pos) / (low_risk_pos)
percentage_more

Therefore we conclude that patients in the high risk group do exhibit chest pain significantly more often than low risk patients.


<a id="research-slope"></a>
### ST Depression slope

We now take a look at the different slopes for ST depression. We start by calculating the prevalence of flat and downwards slopes in each group. Because upwards tending slopes seem to be mostly absent from both groups, we ignore them in this deduction.

In [None]:
low_risk_slopes = low_risk_df['slope'].map({0: np.nan, 1:1, 2:2}).value_counts(normalize=True)
high_risk_slopes = high_risk_df['slope'].map({0: np.nan, 1:1, 2:2}).value_counts(normalize=True)

# Different percentages
print(low_risk_slopes[2])
print(high_risk_slopes[2])

# Calculate increase in proportions
percentage_more_slope = (high_risk_slopes[2] - low_risk_slopes[2]) / (low_risk_slopes[2])
percentage_more_slope

Again, we find our initial observation holds water under more closer examination. The prevalence of downwards slopes in the high risk group is around 1.5x more than in the low risk group.

<a id="research-vessels"></a>
### Major vessels

Finally, we'll examine the last observation we made from the distribution plots, namely whether or not the number of major vessels in higher in the low risk group. To do this, we calculate the mean number of major vessels found in both groups and compare them. We also should investigate the standard deviation to get more insight into this datapoint.

In [None]:
low_risk_mean_vessels, low_risk_vessel_std = low_risk_df['ca'].mean(), low_risk_df['ca'].std()
high_risk_mean_vessels, high_risk_vessel_std = high_risk_df['ca'].mean(), high_risk_df['ca'].std()

print(f"Medians")
print(f"Low: {round(low_risk_mean_vessels, 2)}\nHigh: {round(high_risk_mean_vessels, 2)}")
percentage_fewer_ca = 1 - high_risk_mean_vessels / low_risk_mean_vessels
print(f"{round(percentage_fewer_ca, 2) * 100} percent fewer major vessels on average in the high risk group.")
print(f"Standard deviations")
print(f"Low: {round(low_risk_vessel_std, 2)}\nHigh: {round(high_risk_vessel_std, 2)}")

From these values we can clearly see that patients in the low risk group tend to have a larger amount of major vessels. As the standard deviations are very close to each other, we can conclude that the amount of major vessels in each group indicate a difference in the patients' health rather than an anomaly in the dataset.

<a id="model"></a>
# Modelling

Now that we have identified some possible characteristics for separating the two groups from each other, we can begin on modelling.
We should from here on only include the 5 promising attributes in our analysis in addition to the flag attribute.

In [None]:
considered_input_attributes = ['sex', 'cp', 'slope', 'ca', 'thalach']
raw_input_data = raw_df[considered_input_attributes].values
raw_input_data.shape

<a id="model-cluster"></a>
### Clustering & K-Prototype

We cannot use the traditional K-means clustering as our dataset contains mainly categorical attributes (4 out of 5). Therefore, we have to use something else.

Lucklily, there exists a proposed method that is similar to K-means but works with categorical and numerical data, namely **K-Prototype clustering**.

In [None]:
from kmodes.kprototypes import KPrototypes

In [None]:
km_model = KPrototypes(n_clusters=2)
# Remove the only continuous variable:
categorical_data = raw_df[considered_input_attributes]
fit_model = km_model.fit(categorical_data, categorical=[0, 1, 2, 3])

clusters = km_model.predict(categorical_data, categorical=[0, 1, 2, 3])

# Try if the clusters map to the different groups.
predicted_df = raw_df[considered_input_attributes]
predicted_df['prediction'] = clusters
predicted_df['target'] = raw_df['target']
predicted_df[['prediction', 'target']]


predicted_df

In [None]:

# As we don't know which group is which, we btry both mappings
# First the direct map 0 -> 0, 1 -> 1:
correct_count = predicted_df.apply(lambda x: 1 if x['target'] == x['prediction'] else 0, axis=1).value_counts()[1]
print(correct_count)

# Try with inverse mapping: 0 -> 1, 1 -> 0
correct_count_inverse = predicted_df.apply(lambda x: 1 if x['target'] == int(not bool(x['prediction'])) else 0, axis=1).value_counts()[1]
print(correct_count_inverse)

We notice that the inverse mapping does find some kind of an association between the different groups and the input features. We could also try to use the same K-Prototypes algorithm but use **all** attributes, but still this approach does not seem to be bulletproof.

<a id="model-net"></a>
### Neural network

We can solve this problem of classifying the different labels very effectively just by utilizing the power of neural networks. We start by formatting the input data accordingly:

* Map the target value from the range of [0, 1] to [-1, 1]. This can be achieved with a simple map over the dataframe.
* Scale and center the numerical values to the range [0, 1].
* Create a function for mapping the rows of the dataframe into tensors.

In [None]:
nn_df = raw_df[considered_input_attributes].copy()
nn_output_series = raw_df['target'].map({0: -1, 1: 1})
nn_df['target'] = nn_output_series
min_t, max_t = nn_df['thalach'].min(), nn_df['thalach'].max()
nn_df['thalach'] = nn_df['thalach'].apply(lambda x: (x - min_t) / (max_t - min_t))

Now we just have to create a function for turning a row in the dataframe into a tensor. Before that we calculate how many dimensions the tensor should have when taking the 1-hot encoding into account.

In [None]:
# We should calculate the amount of dimensions required for tensors
total_dims = list(map(lambda attr: len(nn_df[attr].unique()), considered_input_attributes[:-1]))
total_dim_count = sum(total_dims) + 1
total_dim_count

Therefore we should map each row into a tensor with 15 dimensions. For better data manipulation, we construct a Dataset class for mapping the rows of the dataframe into tensors.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

In [None]:
categorical_attr = considered_input_attributes[:-1]

In [None]:
class HeartSet(Dataset):
    def __init__(self, raw_dataframe):
        self.raw = raw_dataframe
        self.output_values = self.raw['target'].values
        self.input_values = self.raw[considered_input_attributes].values
    
    def __getitem__(self, idx):
        running_index = list(map(lambda x: sum(total_dims[0:x]), [i for i in range(len(total_dims))]))
        def map_to_tens(row):
            input_tens = [0 for _ in range(14 + 1)]
            for i, attr in enumerate(categorical_attr):
                correct_index = row[attr] + running_index[i]
                input_tens[int(correct_index)] = 1
            input_tens[14] = row['thalach']
            return (np.asarray(input_tens), row['target'])
        values = self.raw.iloc[idx][categorical_attr + ['thalach', 'target']]
        if isinstance(idx, slice):
            input_res = values.apply(map_to_tens, axis=1)
            input_arr_l = list(map(np.array, input_res))
            input_arr = np.array(input_arr_l)
        else:
            input_res = map_to_tens(values)
            input_arr_l = list(map(np.array, input_res))
            input_arr = np.array(input_arr_l)
        
        # Now convert to tensors
        return (input_arr[0], input_arr[1])
    def __len__(self):
        return self.raw.shape[0]

In [None]:
dataset = HeartSet(nn_df)

Next, we'll use utility functions from PyTorch to split the dataset into training and testing sets.

In [None]:
train_size = 200
test_size = nn_df.shape[0] - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True)

In [None]:
# Check counts
train_ones = 0
test_ones = 0
for i in range(train_size):
    train_ones += 1 if train_dataset[i][1] == 1 else 0
for j in range(test_size):
    test_ones += 1 if test_dataset[j][1] == 1 else 0
print(train_ones)
print(test_ones)

Next, we'll define our network. It is a very simple network consisting of only 2 layers and it uses the *hyperbolic tangent* activation function.

In [None]:
import torch.nn as nn
N = total_dim_count

In [None]:
class ModelNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(N, 2*N)
        self.tanh1 = nn.Tanh()
        self.linear2 = nn.Linear(2*N, 1)
        self.tanh2 = nn.Tanh()
        
    def forward(self, input_tens):
        x = self.linear1(input_tens)
        x = self.tanh1(x)
        x = self.linear2(x)
        x = self.tanh2(x)
        return x

In [None]:
model = ModelNN()

We will utilize mean squared error loss function and stochastic gradient descent for the optimizer.

In [None]:
import torch.optim as optim

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0)

We train the network with 5 loops over the training dataset and with mini batches consisting of max 4 samples each.

In [None]:
loss_points = []
for epoch in range(5):
    running_loss = 0.0
    for i, data in enumerate(train_loader):
        inputs, labels = data
        labels = labels.unsqueeze(dim=1).to(torch.float32)
        inputs = inputs.to(torch.float32)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
        with torch.no_grad():
            loss_i = loss.item()
            running_loss += loss_i
    loss_points.append(running_loss)

We plot the loss to see how our network learned.

In [None]:
fig, ax = plt.subplots(1)
ax.plot(loss_points)
ax.set_ylim(ymin=0)
fig.show()

Next, we'll **evaluate** our network using the test dataset.

In [None]:
model = model.eval()

In [None]:
correct = 0
for input, output in test_loader:
    input = input.to(torch.float32)
    prediction = model(input)
    label_prediction = -1 if prediction < 0 else 1
    correct = correct + int(output == label_prediction)
correct / (len(test_loader))

As we see, we get around 80% classification rate when using neural networks. However, this **does depend on the exact partitioning** of the training and testing datasets, during experimentation the network could reach classification accuracies as high as 86% and as low as 76%.

<a id="conclusion"></a>
# Conclusion

As we have seen, from the provided dataset we can find factors which influence whether or not a patient has a high risk of heart attack. We have also seen that a simple K-prototype modelling method does find some clusters in the data which correspond to the two different risk levels.

Also, we saw how even a very simple neural network can do a reasonably good job in classifying patients into the different groups.