# Predict heart disease with a RandomForestClassifier

This [heart disease data comes from Kaggle](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) and I used it to predict if a person has heart disease using a RandomForestClassifier.

[You can see how I'd host this model at a low-cost/no-cost way on GCP](https://github.com/scottfrasso/host-a-model)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
heart_disease = pd.read_csv("heart-disease.csv")

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Shuffle and split the data

In [13]:
np.random.seed(42)

# Shuffle the data
heart_disease_shuffled = heart_disease.sample(frac=1)

# Split into X & y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

## We'll RandomForestClassifier and a GridSearchCV to tune the hyperparameters

Using GridSearchCV allows us to automatically tune the hypterparamters for the RandomForestClassifier to get the 
best possible prediction.

In [8]:
%%capture
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

gs_grid_search_params = {
     'n_estimators': [100, 200, 500],
     'max_depth': [None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_split': [6],
     'min_samples_leaf': [1, 2]
}

# Split into X & y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
gs_clf = GridSearchCV(estimator=clf,
                      param_grid=gs_grid_search_params, 
                      cv=5,
                      verbose=2)

# Fit the RandomizedSearchCV version of clf
gs_clf.fit(X_train, y_train);

## The best parameters that GridSearchCV found

In [9]:
gs_clf.best_params_

{'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 100}

## The Results

Using just the heart disease data tuning the hyperparameters gives a decent accuracy score of 81.97%

In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

gs_y_preds = gs_clf.predict(X_test)

accuracy = accuracy_score(y_test, gs_y_preds)
print(f"Acc: {accuracy * 100:.2f}%")

precision = precision_score(y_test, gs_y_preds)
print(f"Precision: {precision:.2f}")

recall = recall_score(y_test, gs_y_preds)
print(f"Recall: {recall:.2f}")

f1 = f1_score(y_test, gs_y_preds)
print(f"F1 score: {f1:.2f}")

Acc: 81.97%
Precision: 0.77
Recall: 0.86
F1 score: 0.81
