# Analysis of Heart Disease

This notebook aims to find the correlation between different features of the dataset, and possibly create some compund features.

There are some features whose meaning are not obvious to the oridinary eye, so relavant medical information will be given when necessary. 

Hope you enjoy going through this notebook.

<a id="section-one"></a>
# 1. Understanding the dataset

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import kurtosis, skew

sns.set()
sns.set_style("white")
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")
print("Sneakpeek of the dataset\n")
print(f"This dataset has {df.shape[0]} rows and {df.shape[1]} columns.\n\nThe columns are:\n")
for thing in df.columns:
    print(f"{thing}", end = " | ")

print("\n\nSample of the data:")
    
df.sample(5)

The name of the columns are not very clear, here is a bit verbose version of them:
1. Age (in years)
2. Sex (1 = male, 0 = female)
3. CP (chest pain type) (4 values: 0 == no pain??)
4. Trestbps (resting blood pressure)
5. Chol (serum cholestoral in mg/dl)
6. FBS (fasting blood sugar > 120 mg/dl)
7. Restecg (resting electrocardiographic results (values 0,1,2))
8. Thalach (maximum heart rate achieved)
9. Exang (exercise induced angina) (1 = yes, 0 = no)
10. Oldpeak (ST depression induced by exercise relative to rest)
11. Slpoe (the slope of the peak exercise ST segment)
12. CA (number of major vessels (0-3) colored by flourosopy)
13. Thal (3 = normal; 6 = fixed defect; 7 = reversable defect)
14. Target (0 = fit, 1 = diseased)

Here, savour yourself some statistics.

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
print("NaNs?")
df.isna().sum()

The fact that there aren't NaNs in this dataset makes me very happy.

In [None]:
print("Unique values in each column:")
df.nunique(axis=0)

It is clear that age, trestbps, chol, thalach and old peak are numeric, while the others are categorical.

Now, let us visualize each of the features and understand what they really mean.

In [None]:
numeric = [
    "age",
    "trestbps",
    "chol",
    "thalach",
    "oldpeak"
]

cat = [
    "sex",
    "cp",
    "fbs",
    "restecg",
    "exang",
    "slope",
    "ca",
    "thal"
]

<a id="section-two"><a/>
# 2. Visualization
    
## Correlation Matrix

In [None]:
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 15));

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=sns.color_palette("vlag", as_cmap=True), annot=True, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, fmt=".2f");
plt.title("Correlation matrix");

We cannot find highly correlated field (|corr| > 0.75), but there are several features which have low/medium correlation. A few of them are:
1. target-thalach (negative correlation)
2. chestpain-thalach (positive correlation)
3. chestpain-exang (negative correlation)
4. exang-thalach (negative correlation)
5. oldpeak-thalach (negative correlation)
6. thalach-slope (positive correlation)
7. oldpeak-slope (relatively high negative correlation)
8. target-chestpain (positive correlation)
9. target-thalach (positive correlation)
10. target-exang (negative correaltion)
11. target-oldpeak (negative correlation)
12. target-slope (positive correlation)
13. target-ca (negative correlation)
14. target-thal (negative correlation)

## 2.1 Age and Sex

In [None]:
print("Distribution of sexes:")
sexes = df
sexes["sex"].replace({1: "Male", 0: "Female"}, inplace=True)
print(sexes["sex"].value_counts())

sexes = sexes["sex"].value_counts().to_frame().reset_index().rename(columns={'index': "Sex", "sex": "Count"})
plt.figure(figsize=(7, 7))
sns.set_style("white")
sns.barplot(data=sexes, x="Sex", y="Count", palette="muted", alpha=0.6);
sns.set(font_scale=1.3)
sns.despine()
plt.title("Distribution of Sexes");

In [None]:
sns.set_style("white")
sns.displot(df, x="age", hue="target", kind="kde", col="sex", fill=True, alpha=0.5, palette="muted");
sns.despine()
# plt.title("Distribution of age");
plt.axvline(df['age'].mean(), c="orange", ls="-", lw=1, label="mean")
plt.axvline(df['age'].median(), c="darkblue", ls="--", lw=1, label="median")
plt.legend();

It is quite clear from the plots above that the proportion of men with heart diseases is roughly equal to those without; however, for women, it appears that significantly more are diseased. Perhaps the plots below will illustrate this better.

In [None]:
plt.figure(figsize=(7, 7))
sns.violinplot(data=df, x="sex", y="age", hue="target", palette="pastel", split=True, inner="box", 
               scale="width");
sns.despine()
plt.title("Age, sex and heartbreaks");

In [None]:
print(f"There are {df[(df['sex'] == 'Male') & (df['target'] == 1)].shape[0]} diseased males.")
print(f"There are {df[(df['sex'] == 'Male') & (df['target'] == 0)].shape[0]} fit males.")
print(f"There are {df[(df['sex'] == 'Female') & (df['target'] == 1)].shape[0]} diseased females.")
print(f"There are {df[(df['sex'] == 'Female') & (df['target'] == 0)].shape[0]} fit females.")

This may seem rather unbalanced, but this actually representative of the real world - cardiovascular diseases are less prevalent in women compared to men; however, they tend to be more fatal when it comes to women.

## 2.2 Chest Pain

There are 4 type of chest pain in the dataset.

Here is what they mean as per UCI's data repository
> 1 - typical angina
>
> 2 - atypical angina
> 
> 3 - non-anginal pain
>
> 4 - asymptomatic

Here's some detail about the types of chest pain

#### Typical Angania (1)

> Typical angina (TA) is defined as substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin.
>
> -- <cite>National Institutes of Health</cite>

It is the condition where the heart does not get enough blood/oxygen.


#### Atypical Angania (2)

I couldn't find a proper definition, but here is what I found.
>Women may have more of a subtle presentation called atypical angina. For example, in one study of over 500 women who suffered a heart attack, 71% had fatigue, 48% had sleep disturbances, 42% had shortness of breath, and 30% had chest discomfort in the month prior to the heart attack.
>
> -- <cite>Harrington Hospital</cite>

It appears to be a subtle form of angania which usually affects women.


#### Non-anganial Pain (3)
> The term "atypical chest pain" is a waste-basket term that leads physicians to send any patient with chest pain to coronary angiography. In order to avoid this term, we must learn to distinguish atypical angina from nonanginal chest pain before angiography is considered in order to avoid unnecessary invasive procedures. A chest pain is very likely nonanginal if its duration is over 30 minutes or less than 5 seconds, it increases with inspiration, can be brought on with one movement of the trunk or arm, can be brought on by local fingers pressure, or bending forward, or it can be relieved immediately on lying down. 
>
> -- <cite>National Library of Medicine</cite>

I doubt I can explain it any better.

#### Asymptomatic (4)

Well, asymptomatic.

## A big problem

According to UCIs site these are labeled from 1-4, but in the dataset we are working with it's labeled from 0-4. 
For now, let us assume type 0 == asymptomatic

In [None]:
sns.displot(data=df, x="cp", hue="target", discrete=True, palette="muted", alpha=0.5);
plt.title("Type of Chest Pain and Heart Disease");

Assuming that the distribution of the types of chest pain in the dataset is representative of what it is like in the real world, we can deduce that if you do not have any chest pain, then you probably do not have any heart disease, but it is the contrary when you have any sort of chest pain. It also suggests that type 3 chest pain is rather uncommon.

## 2.3 Resting Blood Pressure

In [None]:
df["trestbps"].describe()

In [None]:
sns.kdeplot(data=df, x="trestbps", hue="target", fill=True, alpha=0.5);
sns.despine()
plt.title("Distribution of resting blood pressure");

In [None]:
diseased = df[df["target"] == 1]
fit = df[df["target"] == 0]

print(f"Kurtosis of resting blood pressure of the diseased: {kurtosis(diseased['trestbps'])}")
print(f"Skewness of resting blood pressure of the diseased: {skew(diseased['trestbps'])}")

print(f"\nKurtosis of resting blood pressure of the fit people: {kurtosis(fit['trestbps'])}")
print(f"Skewness of resting blood pressure of the fit people: {skew(fit['trestbps'])}")

In [None]:
plt.figure(figsize=(7, 7))
sns.violinplot(data=df, x="target", y="trestbps", hue="sex", split=True, inner="quartile", palette="muted", alpha=0.5);
sns.despine()

This is rather interesting - if we take a look at healthy people, the median resting bps of females is the 3rd quartile of males; when it comes to people with cardiovascular diseases, the quartiles for men and women line up, which is a bit uncanny.

## 2.4 Cholestrol
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

In [None]:
from bokeh.resources import INLINE
import bokeh.io
from bokeh import *
bokeh.io.output_notebook(INLINE)

In [None]:
from bokeh.plotting import figure, show
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.palettes import Viridis3
from bokeh.transform import linear_cmap

In [None]:
Viridis3 = Viridis3[0:2]
mapper = linear_cmap(field_name='target', palette=Viridis3 ,low=0 ,high=1)

source = ColumnDataSource(dict(x=df["age"],y=df["target"]))

p = figure(title="Age and Cholestrol", y_axis_label='Cholestrol', x_axis_label='Age')
p.circle(y="chol", x="age", line_color=mapper,color=mapper, fill_alpha=0.8, size=8, source=df)

color_bar = ColorBar(color_mapper=mapper['transform'], width=8)

p.add_layout(color_bar, 'right')

show(p)

It appears as if there is a rather weak linear relationship between  cholestrol and age.
Also, rather surprisingly, there is no correlation between heart diseases and cholestrol - at least according to this dataset.

# Classification

In [None]:
from sklearn.base import clone

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score

from sklearn.tree import export_graphviz


In [None]:
df_dr = df.drop(["age", "chol"], axis=1)

In [None]:
numeric = [
    "trestbps",
    "thalach",
    "oldpeak"
]

cat = [
    "sex",
    "cp",
    "fbs",
    "restecg",
    "exang",
    "slope",
    "ca",
    "thal"
]

In [None]:
one_hot = pd.get_dummies(df_dr, columns=cat)

In [None]:
one_hot.columns

In [None]:
X_train, X_test, y_train, y_test = train_test_split(one_hot.drop("target", axis=1), one_hot["target"], test_size = 0.25, random_state = 0)

In [None]:
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric)
])

In [None]:
sc = StandardScaler()

In [None]:
# train_features = preprocessor.fit_transform(X_train)
# test_features = preprocessor.transform(X_test)
train_features = sc.fit_transform(X_train)
test_features = sc.transform(X_test)

In [None]:
def evaluate_model(mod, processed_train_x, train_y, processed_test_x, test_y, supress_out=False):
    if not supress_out:
        fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    
    y_pred = mod.predict(processed_test_x)
    score_test = mod.score(processed_test_x, test_y)
    conf_m = confusion_matrix(test_y, y_pred)
    report = classification_report(test_y, y_pred)
    if not supress_out:
        print("Test Report:\n")
        print('Accuracy:', accuracy_score(test_y, y_pred), end='\n\n')
        print('Report:', report, sep='\n')
        sns.heatmap(conf_m, ax=axes[0], annot=True);
        plt.title("Confusion Matrix");
        print("CrossValidation Score:")
        print(cross_val_score(mod, processed_test_x, test_y ,cv=3, scoring="accuracy"))
    
    
    y_pred = mod.predict(processed_train_x)
    score_train = model.score(processed_train_x, train_y)
    conf_m = confusion_matrix(train_y, y_pred)
    report = classification_report(train_y, y_pred)
    if not supress_out:
        print("Train Report:\n")
        print('Accuracy:', accuracy_score(train_y, y_pred), end='\n\n')
        print('Report:', report, sep='\n')
        print("CrossValidation Score:")
        print(cross_val_score(mod, processed_train_x, train_y ,cv=3, scoring="accuracy"))

        sns.heatmap(conf_m, ax=axes[1], annot=True);
        plt.title("Confusion Matrix");
    return (score_test, score_train)

### Logistic Regression

In [None]:
model = LogisticRegression(solver='liblinear', random_state=0)
model.fit(train_features, y_train)
evaluate_model(model, train_features, y_train, test_features, y_test);

### Decision Tree

In [None]:
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(train_features, y_train)
evaluate_model(tree, train_features, y_train, test_features, y_test)

In [None]:
for i in range(2, 20):
    tree = DecisionTreeClassifier(max_depth=i)
    tree.fit(train_features, y_train)
    result = (evaluate_model(tree, train_features, y_train, test_features, y_test, supress_out=True))
    print(f"Tree depth: {i:<2} | test: {result[0]} | train {result[1]}")