In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 


In [None]:
heart = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
print(heart)

From the description of the dataset


1. age - age in years

2. sex - sex (1 = male; 0 = female)

3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)

4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)

5. chol - serum cholestoral in mg/dl

6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

7. restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)

8. thalach - maximum heart rate achieved

9. exang - exercise induced angina (1 = yes; 0 = no)

10. oldpeak - ST depression induced by exercise relative to rest

11. slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)

12. ca - number of major vessels (0-3) colored by flourosopy

13. thal - 3 = normal; 6 = fixed defect; 7 = reversable defect

14. num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)

We want to predict who is at a greater chance of a heart disease. My hunch is that we will apply logistic regression since it is ideal for predicting binary outcomes. 

Before, we do that however. Let's look at the relationship between our target/output variable that indicates chance of a heart disease and some of our continuous mesures first.

In [None]:
#We have four continuous predictors: age, resting blood pressure, cholestrol, and mximum heart rate. Let's look at how those are distributed across our outcome variable

#Sort data into numpy
age_data = [heart.age[heart['output']==x] for x in heart.output.unique()]
bp_data = [heart.trtbps[heart['output']==x] for x in heart.output.unique()]
chol_data = [heart.chol[heart['output']==x] for x in heart.output.unique()]
hr_data = [heart.thalachh[heart['output']==x] for x in heart.output.unique()]
oldpeak_data = [heart.oldpeak[heart['output']==x] for x in heart.output.unique()]

#Plot data
fig1,ax1 = plt.subplots(2,3,figsize=(10, 6))
ax1[0,0].violinplot(age_data,heart.output.unique())
ax1[0,0].set_ylabel('Age')
ax1[0,1].violinplot(bp_data,heart.output.unique())
ax1[0,1].set_ylabel('Blood Pressure (mm Hg)')
ax1[0,2].violinplot(chol_data,heart.output.unique())
ax1[0,2].set_ylabel('Cholestrol (mg/dl)')
ax1[1,0].violinplot(hr_data,heart.output.unique())
ax1[1,0].set_ylabel('Heart Rate (beats/sec)')
ax1[1,1].violinplot(oldpeak_data,heart.output.unique())
ax1[1,1].set_ylabel('Old Peak')


The above figures suggest that most of these measures do not substantially differ across our output variable. Interestingly it looks like heart rate differs the most. 

Before we begin working with the data more directly, lets look at grouping these data using some of the built in functions of pandas, violin plots, and the seaborn package. 

In [None]:
import seaborn as sns
fig2,ax2 = plt.subplots(2,3,figsize=(10, 6))
sns.violinplot(ax=ax2[0,0],x="output", y="age", hue="sex", data=heart, palette="Pastel1")
sns.violinplot(ax=ax2[0,1],x="output", y="trtbps", hue="sex", data=heart, palette="Pastel1")
sns.violinplot(ax=ax2[0,2],x="output", y="chol", hue="sex", data=heart, palette="Pastel1")
sns.violinplot(ax=ax2[1,0],x="output", y="thalachh", hue="sex", data=heart, palette="Pastel1")
sns.violinplot(ax=ax2[1,1],x="output", y="oldpeak", hue="sex", data=heart, palette="Pastel1")


Inspection of these figures suggest that our distributions are fairly similar across sexes, which is surprising because we might think that men are at a greater risk for heart disease

There are five more predictors, all of which are either counts or labels coded in an ordinal fashion. A full exploration of the full factorial combination of these predictors with our continuous predictors would be exhausting. Instead, let's focus on heart rate since it appears elevated for individuals who are at a higher risk of a heat disease and see if it varies with any other predictors of interest. Specifically, I am thinking exercise, chest pain, and ecocardiogram. 

In [None]:
fig3,ax3 = plt.subplots(1,3,figsize=(20, 6))
sns.violinplot(ax=ax3[0],x="output", y="thalachh", hue="exng", data=heart, palette="Pastel1")
sns.violinplot(ax=ax3[1],x="output", y="thalachh", hue="cp", data=heart, palette="Pastel1")
sns.violinplot(ax=ax3[2],x="output", y="thalachh", hue="restecg", data=heart, palette="Pastel1")


These graphs suggest, much like the previous ones, not a lot of variation across our categorical variables. Nonetheless, what's worth noting is that maximum heart rate is elevated across the board when someone is at risk of a heart disease. It may be particularly higher in those with abnormal ecocardiograms and chest pain but not exercise. 

Wtih these data visualized, I still think a logistic regression is the best approach.

First, we need to set some of our data aside on which to test our model. 

In [None]:
#select training and test data sets
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(heart, test_size=0.3,random_state=42)
print(train_set.restecg.unique(), test_set.restecg.unique())

Let's start by z-scoring our continuous predictors

In [None]:

def custom_zscore(data):
    reshaped_data = data.to_numpy().reshape(-1,1)
    zscore = (reshaped_data - np.mean(reshaped_data)) / np.std(reshaped_data)
    return zscore

train_set['zAge'] = custom_zscore(train_set.age)
train_set['zTrtbps'] = custom_zscore(train_set.trtbps)
train_set['zChol'] = custom_zscore(train_set.chol)
train_set['zThalachh'] = custom_zscore(train_set.thalachh)
train_set['zoldpeak'] = custom_zscore(train_set.oldpeak)
print(train_set)

Now lets confirm that these four continuous variables have been z-scored

In [None]:
fig4,ax4 = plt.subplots(2,3,figsize=(10, 6))
sns.violinplot(ax=ax4[0,0],x="output", y="zAge", data=train_set, palette="Pastel1")
sns.violinplot(ax=ax4[0,1],x="output", y="zTrtbps", data=train_set, palette="Pastel1")
sns.violinplot(ax=ax4[0,2],x="output", y="zChol", data=train_set, palette="Pastel1")
sns.violinplot(ax=ax4[1,0],x="output", y="zThalachh", data=train_set, palette="Pastel1")
sns.violinplot(ax=ax4[1,1],x="output", y="zoldpeak", data=train_set, palette="Pastel1")

Note that this figure looks much like the first one we produced, except now all of these data have the same scale.

Next, we will insert labels for our categorical predictors and then use sklearns dummy code labeler to create our dummy codes

In [None]:
#Let's use one-hot encoding to encode our categories with labels (cp, rest_ecg)
from sklearn.preprocessing import OneHotEncoder
cpcat_encoder = OneHotEncoder()
resetecgcat_encoder = OneHotEncoder()

train_set.loc[train_set['cp'] == 0,'cp'] = 'Typical'
train_set.loc[train_set['cp'] == 1,'cp'] = 'Atypical'
train_set.loc[train_set['cp'] == 2,'cp'] = 'Non-anginal'
train_set.loc[train_set['cp'] == 3,'cp'] = 'Asymptomatic'

train_set.loc[train_set['restecg'] == 0,'restecg'] = 'Normal'
train_set.loc[train_set['restecg'] == 1,'restecg'] = 'ST-T'
train_set.loc[train_set['restecg'] == 2,'restecg'] = 'LeftVent'

train_set.loc[train_set['slp'] == 0,'slp'] = 'UpSlope'
train_set.loc[train_set['slp'] == 1,'slp'] = 'FlatSlope'
train_set.loc[train_set['slp'] == 2,'slp'] = 'DownSlope'

train_set.loc[train_set['thall'] == 1,'thall'] = 'normal'
train_set.loc[train_set['thall'] == 2,'thall'] = 'fixed'
train_set.loc[train_set['thall'] == 3,'thall'] = 'defect'

cp_onehot = cpcat_encoder.fit_transform(train_set.cp.values.reshape(-1,1))
restecg_onehot = resetecgcat_encoder.fit_transform(train_set.restecg.values.reshape(-1,1))
slp_onehot = resetecgcat_encoder.fit_transform(train_set.restecg.values.reshape(-1,1))
thall_onehot = resetecgcat_encoder.fit_transform(train_set.restecg.values.reshape(-1,1))

cp_dummy = cp_onehot.toarray()
restecg_dummy = restecg_onehot.toarray()
slp_dummy = slp_onehot.toarray()
thall_dummy = thall_onehot.toarray()


#Pass data back into dataframe
train_set.insert(1,'cp0',cp_dummy[:,0])
train_set.insert(1,'cp1',cp_dummy[:,1])
train_set.insert(1,'cp2',cp_dummy[:,2])
train_set.insert(1,'cp3',cp_dummy[:,3])

train_set.insert(1,'re0',restecg_dummy[:,0])
train_set.insert(1,'re1',restecg_dummy[:,1])
train_set.insert(1,'re2',restecg_dummy[:,2])

train_set.insert(1,'slp0',slp_dummy[:,0])
train_set.insert(1,'slp1',slp_dummy[:,1])
train_set.insert(1,'slp2',slp_dummy[:,2])

train_set.insert(1,'thall0',thall_dummy[:,0])
train_set.insert(1,'thall1',thall_dummy[:,1])
train_set.insert(1,'thall2',thall_dummy[:,2])
print(train_set)

Now we will drop all the original columns for which we now have transformed data, followed by fitting a CrossValidated Logistic Regression model. 

To measure its performance I computed the probability of a hit, correct rejection, miss and false alarm. We want to see that the hit and correct rejection rate are both substantially greater than the miss and false alarm rate. 

In [None]:
#  drop non-transformed vars
cleaned_train_set = train_set.drop(['age', 'cp', 'trtbps', 'restecg', 'chol', 'thalachh', 'oldpeak', 'slp', 'thall'],axis=1)
print(cleaned_train_set)

from sklearn.linear_model import LogisticRegressionCV


X = cleaned_train_set.drop(['output'],axis=1)
y = cleaned_train_set.output
clf = LogisticRegressionCV(cv=5,random_state=42).fit(X,y)
print(clf.score(X,y))


The model seems to have done a good job. Our hit and correct rejection rates are around 80 % and our miss and false alarm rates are less than 20%

Now let's do the same transform on our test data and see how the model does. 

In [None]:
#transform and clean test set

#z-score
test_set['zAge'] = custom_zscore(test_set.age)
test_set['zTrtbps'] = custom_zscore(test_set.trtbps)
test_set['zChol'] = custom_zscore(test_set.chol)
test_set['zThalachh'] = custom_zscore(test_set.thalachh)
test_set['zoldpeak'] = custom_zscore(test_set.oldpeak)

#one hot encoding

test_set.loc[test_set['cp'] == 0,'cp'] = 'Typical'
test_set.loc[test_set['cp'] == 1,'cp'] = 'Atypical'
test_set.loc[test_set['cp'] == 2,'cp'] = 'Non-anginal'
test_set.loc[test_set['cp'] == 3,'cp'] = 'Asymptomatic'

test_set.loc[test_set['restecg'] == 0,'restecg'] = 'Normal'
test_set.loc[test_set['restecg'] == 1,'restecg'] = 'ST-T'
test_set.loc[test_set['restecg'] == 2,'restecg'] = 'LeftVent'

test_set.loc[test_set['slp'] == 0,'slp'] = 'UpSlope'
test_set.loc[test_set['slp'] == 1,'slp'] = 'FlatSlope'
test_set.loc[test_set['slp'] == 2,'slp'] = 'DownSlope'

test_set.loc[test_set['thall'] == 1,'thall'] = 'normal'
test_set.loc[test_set['thall'] == 2,'thall'] = 'fixed'
test_set.loc[test_set['thall'] == 3,'thall'] = 'defect'

cp_onehot = cpcat_encoder.fit_transform(test_set.cp.values.reshape(-1,1))
restecg_onehot = resetecgcat_encoder.fit_transform(test_set.restecg.values.reshape(-1,1))
slp_onehot = resetecgcat_encoder.fit_transform(test_set.restecg.values.reshape(-1,1))
thall_onehot = resetecgcat_encoder.fit_transform(test_set.restecg.values.reshape(-1,1))

cp_dummy = cp_onehot.toarray()
restecg_dummy = restecg_onehot.toarray()
slp_dummy = slp_onehot.toarray()
thall_dummy = thall_onehot.toarray()


#Pass data back into dataframe
test_set.insert(1,'cp0',cp_dummy[:,0])
test_set.insert(1,'cp1',cp_dummy[:,1])
test_set.insert(1,'cp2',cp_dummy[:,2])
test_set.insert(1,'cp3',cp_dummy[:,3])

test_set.insert(1,'re0',restecg_dummy[:,0])
test_set.insert(1,'re1',restecg_dummy[:,1])
test_set.insert(1,'re2',restecg_dummy[:,2])

test_set.insert(1,'slp0',slp_dummy[:,0])
test_set.insert(1,'slp1',slp_dummy[:,1])
test_set.insert(1,'slp2',slp_dummy[:,2])

test_set.insert(1,'thall0',thall_dummy[:,0])
test_set.insert(1,'thall1',thall_dummy[:,1])
test_set.insert(1,'thall2',thall_dummy[:,2])

#drop what's no longer needed
cleaned_test_set = test_set.drop(['age', 'cp', 'trtbps', 'restecg', 'chol', 'thalachh', 'oldpeak', 'slp', 'thall'],axis=1)
print(cleaned_test_set)


In [None]:
#test model
X_test = cleaned_test_set.drop(['output'],axis=1)
y_test = cleaned_test_set.output

print(clf.score(X_test,y_test))

#
from sklearn.metrics import f1_score
y_pred = clf.predict(X_test)
print(f1_score(y_test,y_pred))




In [None]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)


Overall the model's performance generalizes fairly well to the test set, with little indication of a systematic error in our modeling exercise. 

We also see that the number of true positives and true negatives is substantially higher than the false negatives and false positives. 