# Heart Disease Classification

This notebook explores the stroke-prediction-dataset that creates a model to predict heart disease. The analysis starts with simple exploration, then goes into a simple logistic regression model, and finally trains a random forest model acheving a 95% accuracy! 

In [None]:
import pandas as pd 
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
filepath = '../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv'
stroke_raw = pd.read_csv(filepath)
stroke_raw.head()

## Exploring the data

In [None]:
plt.figure(figsize=(9, 5))
sns.histplot(x="bmi", data=stroke_raw).set_title("BMI Distribution");

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# value count plots
sns.set(font_scale=2)
plt.figure(figsize=(9, 5))
sns.countplot(x="heart_disease", data=stroke_raw).set_title("Heart Disease");

Because the amount of patients with heart disease is so small, we need to either downsample or create more data of individuals with heart disease to train the model. We use the `SMOTE` algorithm to create a more balanced dataset. We also preprocess the data for training.

In [None]:
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score 
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE

X = stroke_raw.drop(["heart_disease", "id"], axis=1).fillna(stroke_raw.median())
le = LabelEncoder()
X["gender"] = le.fit_transform(X["gender"])
X["ever_married"] = le.fit_transform(X["ever_married"])
X["work_type"] = le.fit_transform(X["work_type"])
X["Residence_type"] = le.fit_transform(X["Residence_type"])
X["smoking_status"] = le.fit_transform(X["smoking_status"])

sclr = StandardScaler()
X = sclr.fit_transform(X)

y = stroke_raw["heart_disease"]

oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=44, stratify=y)

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=22)

In [None]:
y.value_counts() # ensuring even samples

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(train_X, train_y)

In [None]:
scores = cross_val_score(model, train_X, train_y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

preds = model.predict(test_X)
confusion_matrix(test_y, preds)

Perhaps we can look at a different model to acheive a higher accuracy

## Random Forest

For the random forest model we do a randomized grid search to get close to the optimal hyperparameters. We will test the accuracy. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)


In [None]:
# Making a Random Forest Classifier 
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=33, n_jobs=-1)
# rf_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
# rf_clf.fit(train_X, train_y)

rf_random_search = RandomizedSearchCV(
    estimator=rf_clf, 
    param_distributions=random_grid, 
    n_iter = 50, 
    cv=3, 
    verbose=2, 
    random_state=32, 
    n_jobs = -1)
# Fit the random search model
rf_random_search.fit(train_X, train_y)

In [None]:
print(rf_random_search.best_params_)

cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=21)

scores = cross_val_score(rf_random_search, train_X, train_y, scoring='accuracy', cv=5, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

In [None]:
preds = rf_random_search.predict(test_X)
confusion_matrix(test_y, preds)

In [None]:
acc = accuracy_score(test_y, preds)
print(f'Test set accuracy score was: {acc:.3f}')

# Conclusion

And there we go! Our random forest model acheived 95% accuracy and does not appear to overfit the training set.