# Kaggle Heart Attack Prediction dataset

This data set has data from patients and looks to classify the patient as having low probability for having a heart attack (0) or having a high probability for having a heart attack (1).

The link to the data set is [Kaggle](https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)

## Loading packages for analyzing and modeling data

In [None]:
# Packages to hold and pre-process data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling packages
import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset

import sklearn
from sklearn import svm

import itertools

# Due to having and exploratory component when visualizing plenty of warnings come up, so ignore them for this notebook
import warnings
warnings.filterwarnings("ignore")


## Data Analysis and Preprocessing

In [None]:
#Load data into DataFrames
heart_data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
o2_data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/o2Saturation.csv')

In [None]:
heart_data.info()

In [None]:
o2_data.info()

From the first overview of the two files in the dataset, it can be seen that the hear.csv file containes 303 observations, while the o2Saturation.csv file containes 3585 observations. Given the difference in observations, and that there is no additional information the analysis for predicting values will be performed only with the heart.csv features. Since there are no columns with missing data there will be no need to perform any data imputation.

In [None]:
heart_data.head()

From the column names and the information on the first 5 observations we can confirm the data set contains both continuos variables and categorical variables. For continuos variables the statistical properties could be of interest, for categorial data the count of sub-types can give some insights to the data. To validate which names of columns are categorial print out unique values, if the set is small, it is an indication of some ordinal or categorical data. 

In [None]:
heart_data.nunique()

In [None]:
# Create dictionary using column names and unique count, and from that pass it to a DataFrame
dic_unique = dict(zip([i for i in heart_data.columns],[len(heart_data[i].unique()) for i in heart_data.columns]))
print(dic_unique)

From the low count in unique entries the continuos, categorial, and target/predicted columns will be

In [None]:
cont_colms=["age","trtbps","chol","thalachh","oldpeak"]
categ_colms=["sex","cp","fbs","restecg","exng","slp","caa","thall"]
output_colms=["output"]

In [None]:
print("Number of continuous variables is {}".format(len(cont_colms)))
print("Number of categorial variables is {}".format(len(categ_colms)))

## Continuos Variables

In [None]:
# Statistics for continuous variables
heart_data[cont_colms].describe()

In [None]:
#Show distribution of continuous variables
heart_data[cont_colms].hist(figsize=(15,15),bins=20)

From the histogram plots it can be seen that the continuous variables have some outliers, as well as some underlying distribution of values. This suggests using box plots to better understand the data. Seaborn has boxenplots and violin plots, since we want to visualize outliers, we will use the boxenplot. To understand any effect with respect to the output the histogram or distribution plots will be separated according to the output. 

In [None]:
# Plot categorial count plots on 2 x 3 subplot grid
fig, axs = plt.subplots(nrows=2,ncols=3,figsize=(20,20))
fig.suptitle("Distribution of Continuous Data")
for i in range(2):
    for j in range(3):
        if i==1 and j ==2:
            continue
        else:
            sns.boxenplot(data=heart_data,x=cont_colms[3*i+j],ax=axs[i,j])
            axs[i,j].xaxis.grid(True)
fig.delaxes(axs[1,2])

In [None]:
# Plot categorial count plots on 2 x 3 subplot grid
# Here displot and kdeplot give similar output, however the first is a figure level function and the second is an axis level function, 
# so the latter will be used to loop with the same construt as before
# Bandiwth adjustment parameter was chosen to smooth out all underlying distributions
fig, axs = plt.subplots(nrows=2,ncols=3,figsize=(20,20))
fig.suptitle("KDE for continuos variables segmented by output")
for i in range(2):
    for j in range(3):
        if i==1 and j ==2:
            continue
        else:
            sns.kdeplot(data=heart_data,x=cont_colms[2*i+j],ax=axs[i,j],hue="output",bw_adjust=0.75,fill=True)            
fig.delaxes(axs[1,2])        

From the plots it seems to indicate that for **age** and **thalachh** variables there is a difference in the mean and shape of the distribution. This could be a sampling issue, further investigations are needed to test differnce between the means. 

To avoid overfitting use Pearson correlation coefficient to understand relation between continuous variables

In [None]:
corr = heart_data[cont_colms].corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)]=True
plt.figure(figsize=(10,10))
sns.heatmap(corr,mask=mask,square=True,linewidths=.5,annot=True,fmt=".3f",vmin=-.5,vmax=.5)

Witht the Pearson correlation coefficient it seems there is a very strong negative correlation between **thalachh** and **age**, as well as between **thalachh** and **oldpeak**. However, since the number of observations is small we will keep the features and adjust the models accordingly. 

## Categorical Variables

From the count of unique entries it can be seen that the categorical data has between 2 and 5 unique entries depending on the column, so it would be interesting to do the count per each sub-type and visualize it

In [None]:
# Create dictionary with count of subtypes for each categorical columns
unique_vals=[]
categ_counts=[]
for i, col_name in enumerate(heart_data[categ_colms]):
    val_cat = list(heart_data[col_name].unique())
    sub_count = [len(heart_data[heart_data[col_name]==cat_count]) for j, cat_count in enumerate(heart_data[col_name].unique())]
    unique_vals.append(val_cat)
    categ_counts.append(sub_count)
print(unique_vals)
print(categ_counts)

In [None]:
# Plot categorial count plots on 3 x 3 subplot grid
fig, axs = plt.subplots(nrows=3,ncols=3,figsize=(20,20))
fig.suptitle("Count of Categorical Data")
for i in range(3):
    for j in range(3):
        if i==2 and j ==2:
            continue
        else:
            sns.countplot(data=heart_data,x=categ_colms[3*i+j],ax=axs[i,j])
fig.delaxes(axs[2,2])

## Creating models to understand data

Since the goal is to perform binary classification based on the available features, this suggests using the algorithms:
 - SVM
 - Neural Networks

The data that will be used will be scaled to train the models. 

To compare the models, the data will be split into training and testing data, and the metric to compare them will be the accuracy of the model on the validation set. 


In [None]:
#Separate features and objective
X = heart_data.drop(["output"],axis=1).copy()
Y = heart_data["output"].copy()

# Scaling of continuous variables using Robust Scales as the samples have outliers
scaler = sklearn.preprocessing.RobustScaler()
X[cont_colms]=scaler.fit_transform(X[cont_colms])
X.head()

In [None]:
# Training and testing split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,shuffle=True)

## SVM

The SVM packge of Scikitlearn implements vector classification with the module SVC, which allos for regularization and use of different kernels. The hyper parameters of the models will be adjusted to maximize the accuracy on the validation set. 

In [None]:
# Implement vector classification
clf = svm.SVC(C=10,kernel="poly")
clf.fit(x_train,y_train)
accuracy_train = clf.score(x_train,y_train)
y_pred = clf.predict(x_test)
accuracy_test = sklearn.metrics.accuracy_score(y_test,y_pred)
results = dict({"Training Accuracy":[accuracy_train],"Testing Accuracy":[accuracy_test]})
result=pd.DataFrame.from_dict(results)
result.head()

In [None]:
# Perform hyperparameter search, the search range for exponential range started -2 to 3, and iteratively shortened the range -1 to 1
c_param = np.logspace(-1,1,30)
gamma_param = np.logspace(-1,1,30)
training_accuracy=[]
testing_accuracy=[]
results = pd.DataFrame({"Regularization":[],"Gamma":[],"Training Accuracy":[],"Testing Accuracy":[]})
for i, c in enumerate(c_param):
    for j,gamma in enumerate(gamma_param):
        clf = svm.SVC(C=c,gamma=gamma)
        clf.fit(x_train,y_train)
        y_pred = clf.predict(x_test)
        training_accuracy.append(clf.score(x_train,y_train))
        testing_accuracy.append(sklearn.metrics.accuracy_score(y_test,y_pred))
        
# Create cartesian product of parameters for DataFrame
c_list = []
g_list = []
for elem in itertools.product(c_param,gamma_param):
    c_list.append(elem[0])
    g_list.append(elem[1])

results = pd.DataFrame({"Regularization":c_list,"Gamma":g_list,"Training Accuracy":training_accuracy,"Testing Accuracy":testing_accuracy})

In [None]:
mloc = results["Testing Accuracy"].argmax()
results.iloc[mloc]

With SVM the max accuracy on the validation set is **0.868852**

## Neural Network

For testing purposes we will define a neural netowkr with 2 hidden layers whose dimensions are to be defined, and the output is two components of a vector which will be transformed using a softmax function and from that define the classification

In [None]:
class NetClassifier(nn.Module):
    def __init__(self,hidden1,hidden2):
        super(NetClassifier,self).__init__()
        self.model = nn.Sequential(
            nn.Linear(13,hidden1),
            nn.ReLU(),
            nn.Linear(hidden1,hidden2),
            nn.ReLU(),
            nn.Linear(hidden2,2),
        )
    def forward(self,x):
        x = self.model(x)
        return x

In [None]:
# Create an instance of the class
net = NetClassifier(20,20)
print(net)

In [None]:
# Define training parameters
EPOCHS = 50
BATCH_SIZE=32
LEARNING_RATE = 0.01

# Define optimizer and loss function
optimizer = optim.Adam(net.parameters(),lr=LEARNING_RATE)
loss_function = nn.CrossEntropyLoss()

In [None]:
# Define variables as tensors to train the model
x_train_t = torch.tensor(x_train.values, dtype=torch.float)
x_test_t = torch.tensor(x_test.values, dtype=torch.float)
y_train_t = torch.tensor(y_train.values, dtype=torch.long)
y_test_t = torch.tensor(y_test.values, dtype=torch.long)

In [None]:
# Create DataSet and DataLoaders to train the neural network
train_data_set = TensorDataset(x_train_t,y_train_t) 
test_data_set = TensorDataset(x_test_t,y_test_t)

train_dataloader = DataLoader(train_data_set,batch_size = BATCH_SIZE)
test_dataloader = DataLoader(test_data_set,batch_size = BATCH_SIZE)

In [None]:
# Save losses for each bath for all epochs for plotting learning curve
losses=[]
for epoch in range(EPOCHS):
    epoch_loss=0
    epoch_acc=0
    for xb,yb in train_dataloader:
        # Zero gradients for training
        optimizer.zero_grad()
        # Use current model parameters to predict output
        y_pred = net(xb)
        #y_pred = torch.flatten(y_pred)
        # Turn probabilities into prediction 
        #pred = torch.round(torch.sigmoid(torch.flatten(y_pred)))
        # Calculate loss, use float type to calculate loss
        loss = loss_function(y_pred,yb)
        losses.append(loss.item())
        # Backpropagate
        loss.backward()
        # Step in the optimizer
        optimizer.step()
        epoch_loss+=loss.item()
        epoch_acc+=(yb == torch.argmax(y_pred,dim=1)).float().mean()
# Print epoch loss
    #print("Epoch {:>02d} | Loss {:.5f} ".format(epoch,epoch_loss/len(train_dataloader)))
    print("Epoch {:>02d} | Loss {:.5f} | Acc {:.3f}".format(epoch,epoch_loss/len(train_dataloader),epoch_acc/len(train_dataloader)))

In [None]:
# With trained nn predict the values for validation set and test accuracy
y_val = net(x_test_t)
y_val = torch.argmax(y_val,dim=1)
accuracy = (y_val == y_test_t).float().mean()
print("Classification accuracy for nn is {}".format(accuracy))

In [None]:
# Plot learning curve
plt.figure(figsize=(15,5))
sns.lineplot(np.arange(1,len(losses)+1),losses)
plt.title("Learning curve")
plt.ylabel("Batch Loss")
plt.xlabel("Batch iteration")

Wrap the implemented class to perform search for hyper parameters that leads to best accuracy on validation set

In [None]:
# Define range of parameters to test
hid1=np.arange(15,51,1)
hid2=np.arange(15,51,1)
# Create empty tuple to store results for training and testing values
training_acc=[]
training_loss=[]
testing_acc=[]

# Define training parameters
EPOCHS = 80
BATCH_SIZE=32
LEARNING_RATE = 0.01

# Search parameter space
for i, hidden1 in enumerate(hid1):
    for j, hidden2 in enumerate(hid2):
        # Define instance of class to test
        netb = NetClassifier(hidden1,hidden2)
        # Define optimizer and loss function
        optimizer = optim.Adam(netb.parameters(),lr=LEARNING_RATE)
        loss_function = nn.CrossEntropyLoss()
        # Perform training
        for epoch in range(EPOCHS):
            epoch_loss=0
            epoch_acc=0
            for xb,yb in train_dataloader:
            # Zero gradients for training
                optimizer.zero_grad()
            # Use current model parameters to predict output
                y_pred = net(xb)
            # Calculate loss, use float type to calculate loss
                loss = loss_function(y_pred,yb)
                losses.append(loss.item())
                # Backpropagate
                loss.backward()
                # Step in the optimizer
                optimizer.step()
                epoch_loss+=loss.item()
                epoch_acc+=(yb == torch.argmax(y_pred,dim=1)).float().mean()
        # Store accuracy on training set at end of training
        training_acc.append((torch.argmax(netb(x_train_t),dim=1)==y_train_t).float().mean().numpy())
        # Store loss on training set at end of training
        training_loss.append(loss_function(netb(x_train_t),y_train_t).item())
        # Store accuracy on testing set at end of training
        testing_acc.append((torch.argmax(netb(x_test_t),dim=1)==y_test_t).float().mean().numpy())

# Create cartesian product of parameters for DataFrame
h1_list = []
h2_list = []
for elem in itertools.product(hid1,hid2):
    h1_list.append(elem[0])
    h2_list.append(elem[1])
        
# Put results in a Data Frame
results = pd.DataFrame({"Neurons H1":h1_list,"Neurons H2":h2_list,"Training Loss":training_loss,
                        "Training Accuracy":training_acc,"Testing Accuracy":testing_acc})

In [None]:
results.head()

In [None]:
results = results.astype("float")
mloc = results["Testing Accuracy"].argmax()
results.iloc[mloc]

With the parameter search for the neural network architecture that predicts best on the validation set, it can be seen that the parameters **H1=17**, **H2=36** give an accuracy for the training and testing set of about **79%**. That is it performs about as well on seen data as on unseen data. 

## Logistic Regression

In [None]:
#Define LogisticRegression class by implementing a linear model with a sigmoid activation layer, use BCELoss for loss function
class LogisticRegression(nn.Module):
    def __init__(self,n_input_features):
        super(LogisticRegression,self).__init__()
        self.model = nn.Sequential(
            nn.Linear(n_input_features,1),
            nn.Sigmoid()
        )
    
    def forward(self,x):
        x=self.model(x)
        return x

In [None]:
n_features = X.shape[1]
lr = LogisticRegression(n_features)
print(lr)

In [None]:
# Define learning parameters
learning_rate=0.001
EPOCHS_LR=300
BATCH_SIZE=32
#Define optimizer
lr_optimizer = optim.Adam(lr.parameters(),lr=learning_rate)
#Define loss function
lr_loss = nn.BCELoss()

In [None]:
# Define variables as tensors to train the model, BCELoss requires target of type float
x_train_t = torch.tensor(x_train.values, dtype=torch.float)
x_test_t = torch.tensor(x_test.values, dtype=torch.float)
y_train_t = torch.tensor(y_train.values, dtype=torch.float)
y_test_t = torch.tensor(y_test.values, dtype=torch.float)

In [None]:
# Create DataSet and DataLoaders to train the neural network
train_data_set = TensorDataset(x_train_t,y_train_t) 
test_data_set = TensorDataset(x_test_t,y_test_t)

train_dataloader = DataLoader(train_data_set,batch_size = BATCH_SIZE)
test_dataloader = DataLoader(test_data_set,batch_size = BATCH_SIZE)

In [None]:
# Tuple to store batch loss
losses_lr=[]
# Model training
for epoch in range(EPOCHS_LR):
    for xb,yb in train_dataloader:
        # Zero gradients in optimizer
        lr_optimizer.zero_grad()
        # Forward pass on batch
        y_pred = lr(xb)
        y_pred = torch.flatten(y_pred)
        loss = lr_loss(y_pred,yb)
        losses_lr.append(loss.item)
        # Backward pass on batch
        loss.backward()
        #Optimizer step
        lr_optimizer.step()
    if epoch%10==0:
        print("Epoch {:>2d} | Loss {:.4f}".format(epoch,loss.item()))

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(np.arange(1,len(losses)+1,1),losses)
plt.title("Learning curve training logistic regression")
plt.xlabel("Batch iteration")
plt.ylabel("BCELoss")

In [None]:
# Forward pass of trained network
y_pred = lr(x_test_t)
y_pred_class = y_pred.round().flatten()
acc = (y_pred_class == y_test_t).float().mean()
print("Accuracy on classification of validation set is {:.4f}".format(acc))

## Conclusions

From the data it was understood that there is are indicators that lead to higher chances of having a heart attack. 

 - From the implemented models to understand the relation between the features and the probability of a person having a heart attack it is easy to implement models with at least 80% accuracy. Higher accuracy can be achieved by hyper-parameter tuning, but it is also important to consider that a type 2 error for the classifier can be very dangerous for a patient. 