# Abstract

Kaggle Dasatet Link: https://www.kaggle.com/ronitf/heart-disease-uci

* In this notebook I've been through some classification algorithms to predict the presence of heart disease in a patient using previous patient's data.

* Even though we have a labeled dataset, I've tried to use K-Means Clustering (Unsupervised), since I didn't use it before, to predict using Principal Component Analysis decomposition and achieved a similar result.

* Target variable represented by the column 'target' maps: {0: 'Healthy', 1: 'Abnormality'}

# Imports

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
df = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
df.isna().sum()

In [None]:
df['thal'].unique()

In [None]:
df.loc[df['thal'] == 0]

**Note:** There is no register of the 0 value for 'thal' in https://archive.ics.uci.edu/ml/datasets/Heart+Disease. I've tried to remove the only 2 entries for some simplicity, but I've got much worse results. So I'm leaving as it is.

In [None]:
df.columns

Renaming DataFrame columns to a more comprehensible feature description

In [None]:
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

In [None]:
df['sex'] = df['sex'].map({0: 'F',
                           1: 'M'})

df['chest_pain_type'] = df['chest_pain_type'].map({0: 'typical angina',
                                                   1: 'atypical angina',
                                                   2: 'non-anginal pain',
                                                   3: 'asymptomatic'})

df['fasting_blood_sugar'] = df['fasting_blood_sugar'].map({0: 'lower than 120mg/ml',
                                                           1: 'greater than 120mg/ml'})

df['rest_ecg'] = df['rest_ecg'].map({0: 'normal',
                                     1: 'ST-T wave abnormality',
                                     2: 'left ventricular hypertrophy'})

df['exercise_induced_angina'] = df['exercise_induced_angina'].map({0: 'no',
                                                                   1: 'yes'})

df['st_slope'] = df['st_slope'].map({0: 'upsloping',
                                     1: 'flat',
                                     2: 'downsloping'})

df['thalassemia'] = df['thalassemia'].map({1: 'normal',
                                           2: 'fixed defect',
                                           3: 'reversable defect'})

In [None]:
df.head()

In [None]:
df.dtypes

# Data Visualization

In [None]:
sns.displot(data=df, x='age', col='sex', hue='target', kind='kde')

In [None]:
sns.displot(data=df, x='age', col='chest_pain_type', hue='target')

In [None]:
sns.displot(df['resting_blood_pressure'], kde=True)

In [None]:
sns.displot(df['cholesterol'], kde=True)

Outlier!

In [None]:
df.loc[df['cholesterol'] > 500]

In [None]:
sns.displot(df['max_heart_rate_achieved'], kde=True)

In [None]:
sns.displot(df['st_depression'], kde=True)

In [None]:
sns.countplot(x='num_major_vessels', data=df)

In [None]:
sns.boxplot(x='chest_pain_type', y='age', data=df)

In [None]:
sns.pairplot(df, hue='target')

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

One Hot Enconding our dataset

In [None]:
onehot_df = pd.get_dummies(df)

In [None]:
onehot_df.head()

In [None]:
y = onehot_df.target.values
x = onehot_df.drop('target', axis=1).values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=5)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import KFold, cross_val_score

# Testing Classifiers

10 Fold Cross Validation to evaluate the performance of some algorithms

In [None]:
# Evaluate Models

n_folds = 10
models = []


# Scaling features
scaler = StandardScaler()
scaler.fit(x_train)

scaled_x_train = scaler.transform(x_train)
scaled_x_test = scaler.transform(x_test)


models.append(('LR', LogisticRegression()))
models.append(('Tree', DecisionTreeClassifier()))
models.append(('Forest', RandomForestClassifier()))
models.append(('XGB', XGBClassifier(use_label_encoder=False, eval_metric='logloss')))
models.append(('NB', GaussianNB()))

for name, model in models:
    kfold = KFold(n_splits=n_folds)
    cv_results = cross_val_score(model, scaled_x_train, y_train, cv=kfold, scoring='accuracy')
    print("%6s %.3f %.3f " % (name, cv_results.mean(), cv_results.std()))

Testing Random Forest. We could do a randomized search with [sklearn's randomized search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and try to find the optimal hyperparameters for our case, but I won't cover this in this notebook.

In [None]:
rf = RandomForestClassifier(max_depth=5, n_estimators=100)

# Train the model on training data
rf.fit(x_train, y_train)

In [None]:
pred = rf.predict(x_test).round()
print('Random Forest')
print("Test Accuracy: %.2f" % ((pred == y_test).mean()))

In [None]:
cm = confusion_matrix(y_test, pred)
sns.heatmap(cm, square=True, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')

In [None]:
print(classification_report(y_test, pred))

In [None]:
feature_importance = rf.feature_importances_
onehot_df_x = onehot_df.drop('target', axis=1)
names = [col for col in onehot_df_x.columns[feature_importance.argsort()[::-1]]]

plt.figure(figsize=(10,10))
sns.barplot(y=names, x=np.sort(feature_importance)[::-1], orient='h').set_title('Feature Importance')

# Dimensionality Reduction (PCA)

Let's try to find some clusters in our data so we can try others approaches. We will decompose our data in components so we retain 90% of variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.9)
pc = pca.fit_transform(scaled_x_train)

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(x=np.array(range(pca.n_components_)), y=pca.explained_variance_ratio_)
plt.xlabel("Components")
plt.ylabel("Variance")

In [None]:
plt.scatter(pc[:, 0], pc[:, 1], c=y_train, label=y_train)
plt.title("PCA")
plt.xlabel("1st Component")
plt.ylabel("2nd Component")
plt.show()

With only 2 components we have some noise in the center but it's already looking pretty distinguishable! Let's see with 3 components...

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)
ax.scatter(pc[:, 0], pc[:, 1], pc[:, 2], c=y_train)
ax.set_xlabel('1st Component')
ax.set_ylabel('2nd Component')
ax.set_zlabel('3rd Component')
ax.set_title('PCA')

# K-Nearest Neighbours

Now we're going to test the performance of the KNN algorithm using our PCA decomposition, since we could see clusters and some clear distinguish between our target label.

For comparison, let's train our KNN with the original dataset, and then use the PCA decompositions to see if we can get any better.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
# K-Nearest-Neighbors

print('  k     Accuracy      MSE_In        MSE_Out')
print('--------------------------------------------')
for k in range(1, 30, 2):
    knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn = knn.fit(scaled_x_train, y_train)

    y_train_predict = knn.predict(scaled_x_train)
    y_test_predict  = knn.predict(scaled_x_test)

    acc = (y_test_predict == y_test).mean()
    mse_in  = mean_squared_error(y_train, y_train_predict)
    mse_out = mean_squared_error(y_test, y_test_predict)
    
    
    print("%3d %10.2f %13.4f  %12.4f" % (k , acc, mse_in , mse_out))

And now with PCA

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
# K-Nearest-Neighbors

print('  k     Accuracy      MSE_In        MSE_Out')
print('--------------------------------------------')
for k in range(1, 30, 2):
    knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn = knn.fit(pc, y_train)

    y_train_predict = knn.predict(pc)
    y_test_predict  = knn.predict(pca.transform(scaled_x_test))

    acc = (y_test_predict == y_test).mean()
    mse_in  = mean_squared_error(y_train, y_train_predict)
    mse_out = mean_squared_error(y_test, y_test_predict)
    
    
    print("%3d %10.2f %13.4f  %12.4f" % (k , acc, mse_in , mse_out))

An increase of 2% in Test Accuracy when using the PCA components!

# K-Means Clustering

Now, let's see the performance of a basic KMeans with our PCA components.

In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit(pc)

scaled_x_test = scaler.transform(x_test)
test_pcs = pca.transform(scaled_x_test)

predicts = km.predict(test_pcs)

acc1 = (predicts == y_test).mean()
acc2 = (predicts == np.logical_not(y_test)).mean()

print('K-Means')
print('Test Accuracy: %.2f' % max(acc1, acc2))

# Conclusion

All models we've tested presented good results, and we could get a nice view of our data with the PCA decomposition.