<a id="1"></a> <br> 
# Data

<center><img src="https://images.ctfassets.net/yixw23k2v6vo/2gmXzt47usTgvqBdGsZthG/14b99ab35cd6d18930f5a0b02c4829d6/HEART_INFO_stas.png?w=824&h=464&fit=thumb" width="500px"/></center>

#### INTRODUCTION

A small 1988 heart disease dataset originally from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/heart+disease), which uses 14 out of the 76 HD features attributed to 303 patients. The goal is to correctly classify and predict patients with HD using the smallest number of features and common prediction models. Feedback highly appreciated :)

**Summary**
* Logistic regression: Accuracy = 87.21%, Sensitivity = 97.67%
* K-nearest neighbors algorithm: Accuracy = 86.05%, Sensitivity = 90.70% 
* Random Forest: Accuracy = 87.21%, Sensitivity = 90.70% 
* Support vector machine: Accuracy = 84.88%, Sensitivity = 88.37%

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

import warnings  
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('../input/heart-disease-uci/heart.csv')
print("Number of rows & columns:", data.shape, "\n")
print("Number of missing values: \n", data.isnull().sum())

The dataset is complete and contains no missing values.

In [None]:
data.sample(2) #show n random rows 

In [None]:
data.describe()

Corrected feature descriptions copied from [The ultimate guide to this dataset!](https://www.kaggle.com/ronitf/heart-disease-uci/discussion/105877).
>For some unknown reason, the dataset for download on Kaggle is VERY different from the one you can download at https://archive.ics.uci.edu/ml/datasets/heart+Disease
And what's worse: the description here on Kaggle is the same as the one in the Cleveland page, that means every interpretation you make based on the Kaggle dataset is WRONG.

* **Age:** age in years
* **Sex:** (1 = male, 0 = female)
* **Cp:** chest pain type (0 = asymptomatic, 1 = atypical angina, 2 = non-anginal pain, 3 = typical angina)
* **Trestbps:** resting blood pressure (in mm Hg on admission to the hospital)
* **Chol:** serum cholestoral in mg/dl
* **Fbs:** (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* **Restecg:** resting electrocardiographic results (0 = showing probable or definite left ventricular hypertrophy by Estes' criteria, 1 = normal, 2 = having ST-T wave abnormality)
* **Thalach:** maximum heart rate achieved
* **Exang:** exercise induced angina (1 = yes, 0 = no)
* **Oldpeak:** ST depression induced by exercise relative to rest
* **Slope:** the slope of the peak exercise ST segment (0 = downward, 1 = flat, 2 = upward)
* **Ca:** number of major vessels (0-3) colored by flourosopy
>data #93, 139, 164, 165 and 252 have ca=4 which is incorrect. In the original Cleveland dataset they are NaNs (so they should be removed)
* **Thal:** 1 = fixed defect, 2 = normal, 3 = reversable defect
>thal is for Thalium, a radioactive tracer injected during a stress test. 
* **Target:** have disease or not (0 = yes, 1 = no)

In [None]:
data = data[data.ca != 4] #deleting patients with ca = 4 (five rows in total)
data["ca"].describe()

In [None]:
data.shape

#### EXPLORATORY DATA ANALYSIS

**Data distribution**

In [None]:
fig, ax = plt.subplots(11,2, figsize=(18,15))
sns.despine(top=True, right=True)
sns.set_palette("twilight_shifted", 2)

g1 = sns.countplot(data=data, x="target", edgecolor=(0,0,0), ax=ax[0,0])
ax[0,0].set_title("Non-HD vs. HD patients")
g2 = sns.countplot(data=data, x="sex", edgecolor=(0,0,0), ax=ax[0,1])
ax[0,1].set_title("Female vs. male patients")
g3 = sns.countplot(data=data, x="sex", hue="target", edgecolor=(0,0,0), ax=ax[1,0])
ax[1,0].set_title("Gender vs. heart disease")
g4 = sns.boxplot(data=data, x="sex", y="age", hue="target", ax=ax[1,1])
ax[1,1].set_title("Age and gender vs. heart disease")
g5 = sns.histplot(data=data, x="age", hue="sex", multiple="stack",ax=ax[2,0])
ax[2,0].set_title("Age and gender")
g6 = sns.histplot(data=data, x="age", hue="target", multiple="stack", ax=ax[2,1])
ax[2,1].set_title("Age and heart disease")
g7 = sns.scatterplot(data=data, x="trestbps", y="chol", hue="target", ax=ax[3,0])
ax[3,0].axvline(data["trestbps"].mean(), color="grey", linestyle="dotted")
ax[3,0].axhline(data["chol"].mean(), color="grey", linestyle="dotted")
g8 = sns.histplot(data=data, x="chol", hue="target",multiple="stack", ax=ax[3,1])
g9 = sns.countplot(data=data, x="restecg", hue="target", edgecolor=(0,0,0), ax=ax[4,0])
g10 = sns.histplot(data=data, x="slope", hue="target", multiple="stack", ax=ax[4,1])
g11 = sns.countplot(data=data, x="exang", hue="target",edgecolor=(0,0,0), ax=ax[5,0])
g12 = sns.countplot(data=data, x="ca", hue="target", edgecolor=(0,0,0), ax=ax[5,1])
g13 = sns.stripplot(data=data, x="ca", y="chol", hue="target", jitter = .1, alpha = .8, ax=ax[6,0])
ax[6,0].axhline(data["chol"].mean(), color="grey", linestyle="dotted")
g14 = sns.histplot(data=data, x="chol", hue="target", multiple="stack", ax=ax[6,1])
g15 = sns.histplot(data=data, x="thalach", hue="target", multiple="stack", ax=ax[7,0])
g16 = sns.stripplot(data=data, x="ca", y="thalach", hue="target", jitter = .1, alpha = .8, ax=ax[7,1])
ax[7,1].axhline(data["thalach"].mean(), color="grey", linestyle="dotted")
g17 = sns.histplot(data=data, x="oldpeak", hue="target",multiple="stack", ax=ax[8,0])
g18 = sns.countplot(data=data, x="cp", hue="target", edgecolor=(0,0,0), ax=ax[8,1])
g19 = sns.countplot(data=data, x="fbs", hue="target", edgecolor=(0,0,0), ax=ax[9,0])
df = data.loc[data["thal"] != 0]
g20 = sns.stripplot(data=data, x="fbs", y="age", hue="target", jitter = .1, alpha = .8, ax=ax[9,1])
g21 = sns.stripplot(data=data, x="fbs", y="chol", hue="target", jitter = .1, alpha = .8, ax=ax[10,0])
g22 = sns.countplot(data=df, x="thal", hue="target", edgecolor=(0,0,0), ax=ax[10,1])


plt.tight_layout()
plt.show()

* The dataset is almost evenly distributed between healthy and non-healthy 
* The sex ratio is skewed (approx. twice amount of men)
* Heart disease is more prevalent amongst men
* Age is a risk factor 
* Women are diagnosed with heart disease at a higher age than men
* High resting blood pressure and even more so, high cholesterol levels indicate heart disease
* Probable or definite left ventricular hypertrophy by Estes' clearly coincides with heart disease
* A flat slope increases risk for heart disease
* A high maximum heart rates correlates with better health
* Two or more vessels colored by fluoroscopy indicate heart disease. Patients with three colored vessels also exhibit mostly below-average maximum heart rate.
* An increase in ST depression clearly correlates with heart disease (>2: 100% heart disease)
* Most heart disease patients have asymptomatic chest pain
* A reversable defect thal test result is a good indictor for heart disease 
* Fbs appears not to be a very useful metric

**Feature correlations**

In [None]:
corr_matrix = data.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) #make diagonal heatmap

plt.figure(figsize=(17,8))
sns.despine(top=True, right=True, bottom=True)
g1 = sns.heatmap(corr_matrix, mask=mask, vmin=-1, vmax=1,annot=True, cmap="magma")
g1.set_title("Pairwise correlation matrix")

There is no high correlation between the target value and any of the other features.

#### PRE-PROCESSING

**Removal of outliers**

Outliers are removed using the interquartile range method. 

In [None]:
fig, ax = plt.subplots(3,1, figsize=(8,8))
sns.despine(top=True, right=True)
g1 = sns.boxplot(data=data, x="target", y="chol", ax=ax[0])
g2 = sns.boxplot(data=data, x="target", y="age", ax=ax[1])
g3 = sns.boxplot(data=data, x="target", y="trestbps", ax=ax[2])
plt.tight_layout()
plt.show()

In [None]:
def outlier_remove(df, col_name):
    Q1 =  data[col_name].quantile(0.25)
    Q3 = data[col_name].quantile(0.75)
    IQR = Q3 - Q1
    outliers = list(df[(df[col_name] < Q1-1.5*IQR ) | (df[col_name] > Q3+1.5*IQR)][col_name].index)
    df = df.drop(outliers, axis=0, inplace = True)
    
outlier_remove(data, "chol")
outlier_remove(data, "trestbps")
outlier_remove(data, "age")

In [None]:
fig, ax = plt.subplots(3,1, figsize=(8,8))
sns.despine(top=True, right=True)
g1 = sns.boxplot(data=data, x="target", y="chol", ax=ax[0])
g2 = sns.boxplot(data=data, x="target", y="age", ax=ax[1])
g3 = sns.boxplot(data=data, x="target", y="trestbps", ax=ax[2])
plt.tight_layout()
plt.show()

 **Assigning dummy values to categorical features**

Dummy variables are assigned to categorical features cp, restecg, thal, slope and ca.

In [None]:
X = data.drop("target", axis = 1) #input
Y = data["target"] #output

a = pd.get_dummies(data["cp"], prefix = "cp")
b = pd.get_dummies(data["thal"], prefix = "thal")
c = pd.get_dummies(data["restecg"], prefix = "restecg")
d = pd.get_dummies(data["slope"], prefix = "slope")
d = pd.get_dummies(data["ca"], prefix = "ca")

X = pd.concat([X, a,b,c,d], axis=1)
X = X.drop(["cp", "thal", "restecg", "slope", "ca"], 1)

X = X.drop(["fbs"], axis=1) #dropping the least informative  feature (according to EDA)

**Data splitting**

To avoid any contamination, the dataset is split 70:30 (training:test) before any data processing.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 123)

print(data.shape,X_train.shape, X_test.shape, Y_train.shape, Y_test.shape )

**Standardization**

The training data is standardized using `StandardScaler` since predictive models perform significantly better when features are scaled to standard range.

In [None]:
from sklearn.preprocessing import StandardScaler

def standardize(df):
    to_scale = ["age","trestbps", "chol", "thalach", "oldpeak"]
    scaler = StandardScaler()
    df[to_scale] = scaler.fit_transform(df[to_scale])

standardize(X_train)
standardize(X_test)

# Prediction

#### LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegressionCV
import sklearn.metrics as metrics
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 123)
standardize(X_train)
standardize(X_test)

lrcv_model = LogisticRegressionCV(cv=10, scoring="accuracy", random_state=123).fit(X_train, Y_train)
Y_pred = lrcv_model.predict(X_test)
confusion_matrix = metrics.confusion_matrix(Y_test, Y_pred)

print("Training accuracy: {:.2f}%".format(lrcv_model.score(X_train, Y_train) * 100), "\n")
print("Testing accuracy: {:.2f}%".format(lrcv_model.score(X_test, Y_test) * 100), "\n")
print("F1 score: {:.2f}".format(f1_score(Y_test, Y_pred)), "\n")
print("Sensitivity: {:.2f}%".format(recall_score(Y_test, Y_pred)* 100), "\n")



sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="mako")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()

#### K-NEAREST NEIGHBORS

In [None]:
from sklearn.neighbors import KNeighborsClassifier

parameters = list(range(1,40))
accuracy = []

for p in parameters:
    knn = KNeighborsClassifier(n_neighbors = p)
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)
    score = knn.score(X_test, Y_test)
    accuracy.append(score)
    
knn = KNeighborsClassifier(n_neighbors = 7)    
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
confusion_matrix = metrics.confusion_matrix(Y_test, Y_pred)
    
print("Maximum testing accuracy: {:.2f}%".format(max(accuracy) * 100), "\n")
print("F1 score: {:.2f}".format(f1_score(Y_test, Y_pred)), "\n")
print("Sensitivity: {:.2f}%".format(recall_score(Y_test, Y_pred)* 100), "\n")


fig, ax = plt.subplots(figsize=(10,5))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.bar(parameters, accuracy, edgecolor=(0,0,0))
plt.xlim(min(parameters), max(parameters))
plt.ylim(.7,.9)
plt.locator_params(axis='x', nbins= max(parameters))
plt.show()

sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="mako")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()



#### RANDOM FOREST

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state = 123, n_estimators=1000)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)
confusion_matrix = metrics.confusion_matrix(Y_test, Y_pred)

print("Training accuracy: {:.2f}%".format(rf.score(X_train, Y_train) * 100), "\n")
print("Testing accuracy: {:.2f}%".format(rf.score(X_test, Y_test) * 100), "\n")
print("F1 score: {:.2f}".format(f1_score(Y_test, Y_pred)), "\n")
print("Sensitivity: {:.2f}%".format(recall_score(Y_test, Y_pred)* 100), "\n")


sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="mako")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()

#### SUPPORT VECTOR MACHINE

In [None]:
from sklearn.svm import SVC

svm = SVC(random_state = 123, kernel="linear")
svm.fit(X_train, Y_train)
Y_pred = svm.predict(X_test)
confusion_matrix = metrics.confusion_matrix(Y_test, Y_pred)

print("Training accuracy: {:.2f}%".format(svm.score(X_train, Y_train) * 100), "\n")
print("Testing accuracy: {:.2f}%".format(svm.score(X_test, Y_test) * 100), "\n")
print("F1 score: {:.2f}".format(f1_score(Y_test, Y_pred)), "\n")
print("Sensitivity: {:.2f}%".format(recall_score(Y_test, Y_pred)* 100), "\n")


sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="mako")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()