# Heart Disease Simple Predictions : KNN, DecisionTree, Random Forest

## Introduction
The term “heart disease” refers to several types of heart conditions. Sometimes heart disease may be “silent” and not diagnosed until a person experiences signs or symptoms of a heart attack, heart failure, or an arrhythmia. When these events happen, symptoms may include :
* Heart attack: Chest pain or discomfort, upper back or neck pain, indigestion, heartburn, nausea or vomiting, extreme fatigue, upper body discomfort, dizziness, and shortness of breath.
* Arrhythmia: Fluttering feelings in the chest (palpitations).
* Heart failure: Shortness of breath, fatigue, or swelling of the feet, ankles, legs, abdomen, or neck veins.

In [None]:
# for data processing
import numpy as np
import pandas as pd

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# for the model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler


## Data

### Import Data

In [None]:
df = pd.read_csv('../input/heart-disease-uci/heart.csv')

Take a look the data

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

understand the meaning of data column
* **age :** Person's age in years
* **sex :** Person's sex (1 = male, 0 = female)
* **cp :** Chest pain type experienced (1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic)
* **trestbps :** Resting blood pressure in mmHg
* **chol :** Cholesterol measurement in mg/dl
* **fbs :** Fasting blood sugar > 120mg/dl (0 = <120mg/dl, 1 = >120mg/dl)
* **restecg :** Resting electrocardiographic results (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertropy by Estes' criteria)
* **thalach :** Maximum heart rate achieved
* **exang :** Exercise induced angina
* **oldpeak :** ST depression induced by exercise relative to rest
* **slope :** The slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)
* **ca :** Number of major vessels
* **thal :** Thalasemia (3 = Normal, 6 = Fixed defect, 7 = reversable defect)
* **target :** Heart disease (0 = No, 1 = Yes)

### Exploratory Data Analysis

In [None]:
# get correlation of each features in data
plt.figure(figsize=(15,10))
ax = sns.heatmap(df.corr(), cmap="RdYlGn", annot = True, vmin = -1, vmax = 1)
ax.set_title("Correlation Heatmap")
plt.show()

In [None]:
# data distribution in each column
df.hist(figsize=(15,15))
plt.show()

### Data Processing

Create dummy columns for categorical variables.

In [None]:
catcol = ['sex','cp','fbs','restecg','exang','slope','ca','thal']
df_hd = pd.get_dummies(df, columns = catcol ) 

Do a standard scaling for columns which have a lot variation

In [None]:
standardScaler = StandardScaler()
scale_col = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
df_hd[scale_col] = standardScaler.fit_transform(df_hd[scale_col])

In [None]:
df_hd.head()

Split the dataframe into X (the data) and y (the labels) 

In [None]:
X = df_hd.drop(['target'], axis = 1)
y = df_hd['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 43, stratify = y)

## Modeling

### KNN
Predict the label of data by looking at the 'k' closest neighbors labeled data, and then taking the majority vote

In [None]:
# search for the best 'k' in range 1 - 13
knn_scores = []
for k in range(1,14):
    knn_classifier = KNeighborsClassifier(n_neighbors = k)
    score = cross_val_score(knn_classifier, X, y, cv = 5)
    knn_scores.append(score.mean())

In [None]:
# plotting
plt.figure(figsize = (12,8))
plt.plot([k for k in range(1,14)], knn_scores, color = 'red')
plt.xticks([k for k in range(1,14)])
plt.title('KNN scores for different K values')
plt.xlabel('K value')
plt.ylabel('Score')
for k in range(1,14):
    plt.text(x = k, y = knn_scores[k-1], s = (k,knn_scores[k-1]), fontsize = 8,
             bbox=dict(facecolor='red', alpha=0.5), horizontalalignment='center')
plt.show()

In [None]:
# Create KNN classifier using 6 neighbors
knn_classifier = KNeighborsClassifier(n_neighbors = 6)
knn_classifier.fit(X_train, y_train)
print('Train Score : ',knn_classifier.score(X_train, y_train))
print('Test Score : ',knn_classifier.score(X_test, y_test))

### Decision Tree
Sequence of if else questions about individual features

In [None]:
tree_scores = []
for tree in range(1,14):
    dt = DecisionTreeClassifier(max_depth = tree)
    score = cross_val_score(dt, X, y, cv = 5)
    tree_scores.append(score.mean())

In [None]:
plt.figure(figsize = (12,8))
plt.plot([tree for tree in range(1,14)], tree_scores, color = 'g')
plt.title("Decision Tree score for different max depth values")
plt.xlabel('Max Depth Value')
plt.ylabel('Score')
plt.xticks([tree for tree in range(1,14)])
for i in range(1,14):
    plt.text(x = i, y = tree_scores[i-1], s = (i, tree_scores[i-1]),
            horizontalalignment = 'center', bbox = dict(facecolor = 'g', alpha = 0.5))
plt.show()

In [None]:
# Create Decision Tree Classifier using 3 max depth
dt = DecisionTreeClassifier(max_depth = 3)
dt.fit(X_train, y_train)
print('Train Score : ',dt.score(X_train, y_train))
print('Test Score : ',dt.score(X_test, y_test))

### Random Forest
Ensemble method that uses a decision tree as base estimator. Each estimator is trained on a different boostrap sample having the same size as the training set. Random forest introduces further randomization than bagging when training each of the base estimators

In [None]:
rf_scores = []
for n in range(1,14):
    rf = RandomForestClassifier(n_estimators = n)
    score = cross_val_score(rf, X, y, cv=5)
    rf_scores.append(score.mean())

In [None]:
plt.figure(figsize = (12,8))
plt.plot([n for n in range(1,14)], rf_scores, color = 'b')
plt.title("Random Forest scores for different estimator values")
plt.xlabel('Estimator Value')
plt.ylabel('Score')
plt.xticks([n for n in range(1,14)])
for i in range(1,14):
    plt.text(x = i, y = rf_scores[i-1], s = (i, rf_scores[i-1]),
            horizontalalignment = 'center', bbox=dict(facecolor='b', alpha=0.5))
plt.show()

In [None]:
# Create Random Forest Classifier using 13 estimators
rf = RandomForestClassifier(n_estimators = 13)
rf.fit(X_train, y_train)
print('Train Score : ',rf.score(X_train, y_train))
print('Test Score : ',rf.score(X_test, y_test))