# Beginner (first) data science project on understanding what leads to heart diseases

## Introduction

Heart disease is the leading cause of death for men and women. In the US, 610,000 people die from heart diseases every year, which is equivalent to 1 in 4 deaths. Coronary heart disease is the most common type of heart disease, which is reduced blood flow to the heart, causing 370,000 deaths every year. Several physiological factors can give rise to heart diseases, such as high cholesterol levels and high blood presure (Center for Disease Control and Prevention, 2015). 

The dataset presented in this notebook contains patients and information on various variables as well as a target variable that specifies whether a patient has more than 50% of arteries that leading to the heart occluded. The 50% threshold is an indication for significant narrowing, and is a sign for cardiologists and other physicians to perform further diagnosis (Harris et al, 1980). This notebook analyzes the different variables in the dataset and answer the following questions:

1. How much are the individual variables associated with diameter narrowing? (Related findings highlighted in blue)
2. How do the variables between males and females? (Related findings highlited in yellow)

In addition, this notebook will present and compare using a decision tree to predict whether or not a patient has >50% narrowing.

## Analysis

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sci
import seaborn as sns
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from graphviz import Source
from IPython.display import Image  
from sklearn.tree import export_graphviz
%matplotlib inline

In [None]:
heart = pd.read_csv('../input/heart-disease-uci/heart.csv')
heart.head()

In [None]:

print(heart.shape)

In [None]:
heart.describe()

There are 14 columns in this dataset, their definitions are listed below: <br>
- __age:__ The person's age in years <br>
- __sex:__ The person's sex (1 = male, 0 = female)<br>
- __cp:__ The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)<br>
- __trestbps:__ The person's resting blood pressure (mm Hg on admission to the hospital)<br>
- __chol:__ The person's cholesterol measurement in mg/dl<br>
- __fbs:__ The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)<br>
- __restecg:__ Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)<br>
- __thalach:__ The person's maximum heart rate achieved<br>
- __exang:__ Exercise induced angina (1 = yes; 0 = no)<br>
- __oldpeak:__ ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot.)<br>
- __slope:__ the slope of the peak exercise ST segment (Value 1: downsloping, Value 2: flat, Value 3: upsloping)<br>
- __ca:__ The number of major vessels (0-3)<br>
- __thal:__ A blood disorder called thalassemia (1 = fixed defect, 2 = normal, 3 = reversed defect)<br>
- __target:__ Angiographic disease status (0 = >50% diameter narrowing, 1 = <50% diameter narrowing)<br>

In [None]:
heart.dtypes

Note that sex, cp, fbs, restecg, exang, slope, ca, thal, and target are categorical types. 

### Check duplicate entries


In [None]:
dup_count = heart.duplicated()
heart[dup_count]

Row 164 is duplicated with another row. Let's examine in more detail

In [None]:
dup = heart.loc[heart['age'] == 38]
dup

Entries 163 and 164 have identical values. Since the two rows are right next to each other, it is likely that they are both from one person, and the data is accidently copied. Entry 164 should be deleted.

In [None]:
heart = heart.drop(163)
heart.shape

### Check null values

In [None]:
null_count = heart.isnull().sum()
null_count

No null values, the dataset appears to be cleaned.

Categorical columns 'ca' and 'thal' have more value types than described on the dataset. 'Ca' is supposed to have 4 values (0-3), while 'thal' is supposed to have 3 (1 = fixed defect, 2 = normal, 3 = reversed defect).

In [None]:
print('ca value counts')
print(heart['ca'].value_counts())
print('\nthal value counts')
print(heart['thal'].value_counts())

In [None]:
heart[heart['ca'] == 4]

In [None]:
heart[heart['thal']==0]

Remove the rows where ca == 4 and thal == 0.

In [None]:
heart = heart.drop(list(heart[heart['ca'] == 4].index))
heart = heart.drop(list(heart[heart['thal'] == 0].index))
heart.shape

### Examine distribution of age and sex first

In [None]:
male_age_dist = heart.loc[heart['sex'] == 1]['age'] #Male
female_age_dist = heart.loc[heart['sex'] == 0]['age'] #Female
sns.distplot(male_age_dist, kde = False, label = 'Male', \
             hist_kws={"histtype": "step", "linewidth": 2, "alpha": 0.5, "color": "g"})
sns.distplot(female_age_dist, kde = False, label = 'Female', \
             hist_kws={"histtype": "step", "linewidth": 2, "alpha": 0.5, "color": "b"})
plt.legend()
plt.title('Male vs Female Age distribution')
plt.show()

Looks like there are a lot more males than females in the dataset. A large proportion of females are within the age 50 - 65.

In [None]:
heart_corr = heart.corr()
heart_corr.style.background_gradient(cmap = plt.get_cmap('bwr'))

In [None]:
g = sns.FacetGrid(heart, col = 'restecg', row = 'target', hue = 'sex', legend_out = True, height = 3)
g.map(sns.kdeplot, 'age', shade = True).add_legend()

plt.show()

It appears that most patients with >50% narrowing have abnormal ST-T waves at rest. Let's examine ST depression values of patients after exercise

In [None]:
g4 = sns.FacetGrid(heart, col = 'restecg', row = 'target', hue = 'sex', legend_out = True, height = 3)
g4.map(plt.scatter, 'age', 'oldpeak', facecolors = 'none').add_legend()
plt.show()

<div class="alert alert-block alert-info"> There is a larger proportion of individuals with &#60;50% narrowing having ST depression values near 0. Conversely, those with &#62;50% narrowing have higher values of ST depression</div>

ST depression represents the end of ventricular depolarization, and it is located betweek the QRS complex and the T wave (Kashou et al). This region is expected to be flat, and a depression level of >1mm is considered abnormal (Hill and Thomas) . Elevation or depression in the ST phase could both indicate abnormalities. A depression of the ST phase could indicate hypokalemia or cardiac ischemia (Kashou et al).

Now let's examine the relationship between sex and chest pain type

In [None]:
g = sns.FacetGrid(heart, col = 'cp', row = 'target', hue = 'sex', legend_out = True, height = 3)
g.map(sns.kdeplot, 'age', shade = True).add_legend()
plt.legend()
plt.show()

There were a lot more females between age 55-60 who had chest pain 1 (typical angina) and had significant heart condition (>50% narrowing). Interestingly, females that have chest pain 2 and 3 had no significant heart condition. Remember that chest pain types are: -- 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic. 

According to the research conducted by Hermann et al, a typical angina is when "(1) Substernal chest pain or discomfort that is (2) Provoked by exertion or emotional stress and (3) relieved by rest or nitroglycerine (or both)" (Hermann et al). An atypical angina is when 2 out of 3 criterias are met. Asymptomatic chest pain is when the chest pain is uneither causing nor exhibiting symptoms ("The MSDS HyperGlossary: Asymptomatic", n.d.). In my understanding this could be due to problems in other organs near the heart eg. lung disorders.

In [None]:
g2 = sns.FacetGrid(heart, col = 'cp', row = 'target', hue = 'sex', legend_out = True, height = 3)
g2.map(plt.scatter, 'age', 'chol', facecolors = 'none').add_legend()

plt.show()

A deeper look at the plot revealed that only 2 females have chest pain type 1 with significant heart conditions. This explained the huge peak for target = 0 | cp = 1 for females in the previous plot. Also note that one female had a chest pain type 2 attribute and also had significant heart conditions. Note that although it appears males are much more likely to have >50% narrowing, it could be due to the limited sample size for females.

<div class="alert alert-block alert-warning"> It is difficult to draw conclusions on chest pain differences between males and females with &gt;50% narrowing due to the limited sample size for females.However, it appears that for those with &lt;50% narrowing, there appears to be no difference in chest pain between males and females.</div>

Theoretically, people with heart conditions should have higher cholesterol levels. It is difficult to see whether this is the case from the plot above. Let's use a boxplot instead. We can use the groupby dataframe to split the data into 2 groups (target = 0 and target = 1)

In [None]:
heart_male = heart[heart['sex'] == 1].groupby('target')[['trestbps', 'chol', 'thalach', 'oldpeak']]
heart_female = heart[heart['sex'] == 0].groupby('target')[['trestbps', 'chol', 'thalach', 'oldpeak']]

In [None]:
heart_male.agg([np.mean, np.max, np.min])

In [None]:
heart_female.agg([np.mean, np.max, np.min])

The values in the groupby dataframe supports the cholesterol level difference between males and females. They also revealed that females have a slightly higher resting blood pressure than men for both when a heart condition is present. Let's construct a boxplot to visualize it better.

In [None]:
fig = plt.figure(figsize = (15,5))
fig.suptitle('Male')
ax = fig.subplots(nrows = 1, ncols = 2, sharey = True)
heart_male.boxplot(column = ['trestbps', 'chol', 'thalach'], rot = 90, ax = ax)
ax[0].set_title('>50% Narrowing')
ax[1].set_title('<50% Narrowing')

In [None]:
fig = plt.figure(figsize = (15,5))
fig.suptitle('Female')
ax = fig.subplots(nrows = 1, ncols = 2, sharey = True)
heart_female.boxplot(column = ['trestbps', 'chol', 'thalach'], rot = 90, ax = ax)
ax[0].set_title('>50% Narrowing')
ax[1].set_title('<50% Narrowing')

In [None]:
fig = plt.figure(figsize = (15,5))
ax = fig.subplots(nrows = 1, ncols = 2, sharey = True)
heart_male.boxplot(column = ['oldpeak'], rot = 90, ax = ax)
ax[0].set_title('>50% Narrowing')
ax[1].set_title('<50% Narrowing')

In [None]:
fig = plt.figure(figsize = (15,5))
ax = fig.subplots(nrows = 1, ncols = 2, sharey = True)
heart_female.boxplot(column = ['oldpeak'], rot = 90, ax = ax)
ax[0].set_title('>50% Narrowing')
ax[1].set_title('<50% Narrowing')

<div class="alert alert-block alert-warning">The boxplot and groupby dataframes support that females have higher cholesterol levels compared to males.</div> 

This matches with the study conducted from Baker Institute (Baker Institute, 2010).

<div class="alert alert-block alert-warning">ST depression levels appear to be similar between male and females, and quite different across diameter narrowing.</div>

Now let's compare ST depression with chest pain type again, this time without filtering the gender.

In [None]:
g5 = sns.FacetGrid(heart, col = 'restecg', row = 'target', hue = 'cp', legend_out = True, height = 3)
g5.map(plt.scatter, 'thalach', 'oldpeak', facecolors = 'none').add_legend()

plt.show()

<div class="alert alert-block alert-info">It appears that typical angina chest pain is very correlated to having &gt;50% narrowing. This makes intuitive sense, as angina is pain due to reduced blood flow to the heart. Non-anginal chest pain is more closely associated with &lt;50% diameter narrowing and a minimal ST depression value. </div>

Now let's find out the relationship of the thalium stress test (thal). The thalium stress test involves injecting a nuclear tracer, thalium, into a patients heart and exert stress on the heart through making them exercise. Healthy heart muscles clls should absorb the tracer, which can then be detected through nuclear imaging. If the heart muscles absorb the tracer at rest but not during exercise, the patient will be classified as "reversible defect". If the tracer is not absorbed at all, the patient will have a "fixed defect" ("Cardiac Stress Testing Review", n.d.).

Note that the data does not match the given data dictionary. The correct explanation should be (1 = fixed defect, 2 = normal, 3 = reversed defect), according to the plots from Luc Demortier's github post (Demortier, 2015)

In [None]:
g6 = sns.FacetGrid(heart, col = 'thal', row = 'target', hue = 'cp', legend_out = True, height = 3)
g6.map(plt.scatter, 'thalach', 'oldpeak', facecolors = 'none').add_legend()

plt.show()

<div class="alert alert-block alert-info">It appears that both the thalium stress test and chest pain types a good indication on whether a patient has &gt;50% narrowing. Most patients who have fixed defect also had &gt;50% narrowing and anginal chest pain. On the other hand, Most patients who had normal results from the thalium stress test also had &lt;50% narrowing, non-anginal chest pain, and a low ST-depression value.</div>

Next, let's evaluate the fluoroscopy results and see how it relates to diameter. Fluoroscopy involves highlighting the blood vessels through nuclear imaging. 

In [None]:
g6 = sns.FacetGrid(heart, col = 'ca', row = 'target', hue = 'cp', legend_out = True, height = 3)
g6.map(plt.scatter, 'thalach', 'oldpeak', facecolors = 'none').add_legend()

plt.show()

<div class="alert alert-block alert-info">It appears that cardiac angiography results have slight correlations with diameter narrowing, with individuals having 1 or more arteries colored have more chance of having &gt;50% narrowing. However, patients with >50% narrowing appear have equal chances of having 0, 1, 2, or 3 arteries colored through fluoroscopy.</div>

Now let's explore the columns related to exercising, in particular exercise induced angina and slope. The scatterplot will not plot ST depression anymore, because 'slope' has values of ST depression as well as elevation

In [None]:
g7 = sns.FacetGrid(heart, col = 'exang', row = 'target', hue = 'slope', legend_out = True, height = 3)
g7.map(plt.scatter, 'age', 'thalach', facecolors = 'none').add_legend()
plt.show()

After exercise, the slope of the ST depression should become slightly upwards (reminder that 0 = downsloping, 1 = flat, 2 = uploping), however an elevation of 1mm is considered abnormal (Hill and Timmis). 

<div class="alert alert-block alert-info">It can be seen that upsloping correlates with &lt;50% narrowing and no exercised induced angina, and vice versa for a flat slope. Interestinly, it appears that downsloping can be seen for both &lt;50% and &gt;50% narrowing. However, this may be due to a limited sample size.</div>

In this plot it is more evident that patients with <50% narrowing have a higher heart rate than those with >50% narrowing, especially when they do not have exercise induced angina.

Earlier on, we discovered differences in blood pressure between males and females. Let's examine closer.

In [None]:
sns.lmplot(x = 'age', y = 'trestbps', hue = 'sex', data = heart, col = 'target', markers=["o", "x"], palette="Set1")

The plot revealed blood pressure increases with age for both males and females, and males with &lt;50% narrowing had slightly higher blood pressure compared to females. This matches with early studies comparing the blood pressure of healthy males and females, showing that healthy males had slighly higher blood pressre than healthy females. (Wiinberg et al).

Let's do a t-test to further confirm this.

In [None]:
mean_male_bp = heart[heart['sex']==1]['trestbps'].mean()
mean_female_bp = heart[heart['sex']==0]['trestbps'].mean()
print(mean_male_bp)
print(mean_female_bp)

In [None]:
bp_m_f = sci.stats.ttest_ind(heart[heart['sex']==1]['trestbps'], heart[heart['sex']==0]['trestbps'])
bp_m_f

<div class="alert alert-block alert-warning"> The t-test shows the p-value is 0.31, which is more than 0.05, therefore there is an insignificant difference in blood pressure between male and females in this sample. </div>

For those who had >50% narrowing, it appeared that females have a higher resting blood pressure than male. However, this is limited by the small sample size, and most of the female samples lie between the age 50-70 years, with only one female sample having an age less than 50. 

In [None]:
sns.lmplot(x = 'age', y = 'thalach', hue = 'sex', data = heart, col = 'target', markers=["o", "x"], palette="Set1")

Heart rate decreases with age for individuals with <50% narrowing. Reasons for this is still under research, however it is partially due to the lowered firing of sinoatrial myocytes (SAMs) (Larson et al), which are responsible for generating action potentials through the heart to initiate a heart beat. For individuals with >50% narrowing, the data is once again affected by the limited female samples of younger ages.

The above 2 plots showing resting blood pressure and maximum heart rate both had the same outlier for target = 0. Let's find out where this datapoint is located in the dataframe.

In [None]:
heart.loc[(heart['sex'] == 0) & (heart['age'] < 45) & (heart['target'] == 0)]

There was also an outlier in cholesterol level among the female samples. Let's find this datapoint.

In [None]:
heart.loc[heart['chol'] == max(heart['chol'])]

Let's delete these outliers to prevent them from affecting the machine learning predictions

In [None]:
heart = heart.drop([85, 215])
heart.shape

## Machine Learning Predictions

Converting categorical values into object types instead of integers

In [None]:
shuffled_index = np.random.permutation(heart.index)
train_max_row = math.floor(heart.shape[0] * .7)

heart_dt = heart.copy()
heart_dt['sex'] = heart_dt['sex'].astype('object')
heart_dt['cp'] = heart_dt['cp'].astype('object')
heart_dt['fbs'] = heart_dt['fbs'].astype('object')
heart_dt['restecg'] = heart_dt['restecg'].astype('object')
heart_dt['exang'] = heart_dt['exang'].astype('object')
heart_dt['slope'] = heart_dt['slope'].astype('object')
heart_dt['thal'] = heart_dt['thal'].astype('object')

In [None]:
heart_dt = pd.get_dummies(heart_dt, drop_first=True)
heart_dt.head()

In [None]:
columns = list(heart_dt.columns)
columns.remove('target')
columns

Split the training and test dataset

In [None]:
heart_dt = heart_dt.reindex(shuffled_index)
heart_data = heart_dt[columns]
heart_label = heart_dt['target']
train_heart_data = heart_data.iloc[:train_max_row]
test_heart_data = heart_data.iloc[train_max_row:]
train_heart_label = heart_label.iloc[:train_max_row]
test_heart_label = heart_label.iloc[train_max_row:]

In [None]:
print(heart_data.shape)
print(heart_label.shape)
print(train_heart_data.shape)
print(test_heart_data.shape)
print(train_heart_label.shape)
print(test_heart_label.shape)

In [None]:
train_heart_label.value_counts()

### Decision tree classifier:

In [None]:
clf = DecisionTreeClassifier()
clf.fit(train_heart_data, train_heart_label)
predictions_train = clf.predict(train_heart_data)
predictions = clf.predict(test_heart_data)
errors_train = roc_auc_score(train_heart_label, predictions_train)
errors = roc_auc_score(test_heart_label, predictions)
print(errors_train)
print(errors)

The model is clearly overfitting the training data. The training set has an accuracy of 100%, while the test set has an roc_auc_score of only 74%. We need to try tuning some hyperparameters. However, let's first visualize the current decision tree at the moment.

In [None]:
export_graphviz(clf, out_file='tree_limited.dot', feature_names = columns,
                rounded = True, proportion = False, precision = 2, filled = True)
!dot -Tpng tree_limited.dot -o tree_limited.png -Gdpi=600
Image(filename = 'tree_limited.png')

Currently, the decision tree has a max depth of 10, branches are continuously split even when there are less than 3 samples, and leaves with only 1 or two samples are present. Let's change these parameters.

In [None]:
clf = DecisionTreeClassifier(max_depth = 7, min_samples_split = 5, min_samples_leaf = 4)
clf.fit(train_heart_data, train_heart_label)
predictions_train = clf.predict(train_heart_data)
predictions = clf.predict(test_heart_data)
predictions_proba = clf.predict_proba(test_heart_data)[:,1]
errors_train = roc_auc_score(train_heart_label, predictions_train)
errors = roc_auc_score(test_heart_label, predictions)
print(errors_train)
print(errors)

Test accuracy improved slightly. Let's see the new decision tree.

In [None]:
export_graphviz(clf, out_file='tree_limited.dot', feature_names = columns,
                rounded = True, proportion = False, precision = 2, filled = True)
!dot -Tpng tree_limited.dot -o tree_limited.png -Gdpi=600
Image(filename = 'tree_limited.png')

In [None]:
fpr, tpr, thresholds = roc_curve(test_heart_label, predictions_proba)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

## Discussion

This notebook analyzed the factors that give rise to artery diameter narrowing, which is an indication for significant heart disease, and how the factors differed between males and females. It appeared that some diagnostic procedures such as ECG results and thalium stress tests are highly indicative of diameter narrowing. Interestingly, physiological factors such as cholesterol level and blood pressure do not provide much insight on predicting heart disease, however this may be affected by limiting data for females. The biggest difference between males and females discovered in this notebook is cholesterol level, where females haev much higher cholesterol levels than males. Note that cholesterol can be in the form of <u>low density lipoprotein (LDL)</u> or <u>high density lipoprotein (HDL)</u>. LDL is harmful to the body as it leads to plaque building up in arteries, however HDL is beneficial to the body. Females have higher cholesterol levels because oestrogen boosts HDL levels, however HDL levels fall after menopause and LDL level rises (Michos, E.D.). Therefore, a better way to show cholesterol level would be to show the LDL and HDL levels individually.

In addition to understanding the factors affecting heart disease, this notebook also constructed a decision tree model to classify whether a patient has significant heart disease. The model only achieved a ROC score of 0.78. Although the prediction could be affected by limited data, other machine learning and statistical models can be used to determine the best way to predict heart disease.

## References

Cardiac Stress Testing Review. Retrieved from https://www.healio.com/cardiology/learn-the-heart/cardiology-review/topic-reviews/stress-testing-review

Centers for Disease Control and Prevention. "Underlying Cause of Death 1999–2013 on CDC WONDER Online Database, released 2015. Data are from the Multiple Cause of Death Files, 1999–2013, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program." (2015).

Demortier, L. (2015). Project McNulty: Estimating the Risk of Heart Disease. Retrieved from https://lucdemortier.github.io/projects/3_mcnulty

Hermann, L. K., Weingart, S. D., Yoon, Y. M., Genes, N. G., Nelson, B. P., Shearer, P. L., ... & Henzlova, M. J. (2010). Comparison of frequency of inducible myocardial ischemia in patients presenting to emergency department with typical versus atypical or nonanginal chest pain. The American journal of cardiology, 105(11), 1561-1564.

Hill, J., & Timmis, A. (2002). Exercise tolerance testing. Bmj, 324(7345), 1084-1087.

Kashou, A. H., & Kashou, H. E. (2017). Rhythm, ST Segment. In StatPearls [Internet]. StatPearls Publishing.

Larson, E. D., Clair, J. R. S., Sumner, W. A., Bannister, R. A., & Proenza, C. (2013). Depressed pacemaker activity of sinoatrial node myocytes contributes to the age-dependent decline in maximum heart rate. Proceedings of the National Academy of Sciences, 110(44), 18011-18016.

Michos, E. D. Why Cholesterol Matters for Women. Retrieved from https://www.hopkinsmedicine.org/health/wellness-and-prevention/why-cholesterol-matters-for-women

The MSDS HyperGlossary: Asymptomatic. Retrieved from http://www.ilpi.com/msds/ref/asymptomatic.html

Wiinberg, N., Høegholm, A., Christensen, H. R., Bang, L. E., Mikkelsen, K. L., Nielsen, P. E., ... & Bentzon, M. W. (1995). 24-h ambulatory blood pressure in 352 normal Danish subjects, related to age and gender. American journal of hypertension, 8(10), 978-986.