For this dataset, I'm going to use two classification prediction methods: Logistic Regression and Random Forests. I will compare the  results of each method with standard model evaluation metrics (ROC AUC, Accuracy, etc.).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, roc_auc_score, roc_curve

import seaborn as sns
sns.set(style="whitegrid")

In [None]:
df = pd.read_csv('../input/health-care-data-set-on-heart-attack-possibility/heart.csv')
df.head()

In [None]:
df.select_dtypes(exclude=['object']).isnull().sum()

No missing values. How nice!

In [None]:
df.dtypes

And no categorical variables to worry about encoding. Let's jump into some preliminary data viz

In [None]:
plt.figure(figsize = (10,7))
ax1 = sns.countplot(x = 'target', data = df, palette = ["C2", "C3"])
ax1.set_xticklabels(["Low","High"])
plt.title("Heart Attack Chance Patient Counts", weight = 'bold', fontsize = 15)
plt.xlabel('Heart Attack Chance')
plt.ylabel("Patient Count")

Looks like our target variable is not severely imbalanced, with a good number of positive and negative outcomes

In [None]:
plt.figure(figsize = (8,5))
ax2 = sns.countplot(x = 'sex', data = df, palette = ["C2", "C3"], hue = 'target')
ax2.set_xticklabels(["Female","Male"])
plt.title("Heart Attack Chance by Sex", weight = 'bold', fontsize = 15)
plt.xlabel('Sex')
plt.ylabel("Patient Count")
plt.legend(title = "Heart Attack Chance by Sex",labels=['Low', 'High'], loc = 'upper left')
plt.show()

plt.figure(figsize = (8,5))
ax2 = sns.countplot(x = 'sex', data = df,)
ax2.set_xticklabels(["Female","Male"])
plt.title("Sex Sampling Counts", weight = 'bold', fontsize = 15)
plt.xlabel('Sex')
plt.ylabel("Patient Count")
plt.show()
#plt.legend(title = "Heart Attack Chance",labels=['Low', 'High'], loc = 'upper left')

Many more men were sampled than women

In [None]:
plt.figure(figsize= (8,5))
sns.distplot(df['age'])
plt.title("Heart Attack Dataset Age Distribution", weight = 'bold', fontsize = 15)

Let's examine the linear correlations between the dataset features

In [None]:
corr = df.corr()

plt.figure(figsize = (18,18))
sns.heatmap(corr, annot = True, cmap = 'coolwarm', vmin = -1, vmax=1)

With respect to the target variable, chest pain type, max heart rate, angina, and ST depression have the strongest correlations. Interestingly, serum cholesterol has one of the lowest correlation coefficients.

In [None]:
X = df.drop('target', axis = 1)
y = df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, shuffle = True)

# Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(random_state = 42)

In [None]:
def full_report(y_test,y_hat,y_hat_probs,name = ''):
    if name != '':
        print(name)
    print(classification_report(y_test, y_hat))
    print("ROC AUC = ",roc_auc_score(y_test, y_hat_probs),'\n\n')
    
def roc_plot_label():
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')
    plt.legend(loc="best")
    

In [None]:
rf_model.fit(X_train, y_train)
yhat_forest = rf_model.predict(X_test)

confusion_matrix = pd.crosstab(y_test, yhat_forest, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, fmt = 'g', cmap = 'Reds')
plt.show()

yhat_forest_probs = rf_model.predict_proba(X_test)
yhat_forest_probs = yhat_forest_probs[:,1]

full_report(y_test,yhat_forest, yhat_forest_probs, name = "Base Model")

fpr, tpr, _ = roc_curve(y_test, yhat_forest_probs)

plt.figure(figsize=(10,7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = "Base", color = "Blue")
roc_plot_label()
plt.show()

# Random Forest Tuning

We'll try some hyper parameter tuning. First, with RandomSearchCV. First, let's see the parameters in RandomForestClassifier

In [None]:
print(rf_model.get_params())

There's a lot going on here. I'll limit the features used down to the grid below. The lists give a range of possible values for each parameter

In [None]:
random_grid = {'max_depth': [5,10,25,50,100,250,500,None],
               'max_features': ['auto', 'sqrt', 'log2', None],
               'min_samples_leaf': np.linspace(0.1, 0.5, 5, endpoint=True),
               'min_samples_split': np.linspace(0.1, 1.0, 10, endpoint=True),
               'n_estimators': [2, 4, 8, 16, 32, 64, 100, 200, 500]}

We'll now use RandomizedSearchCV randomly search through these paramater ranges for a given number of iterations.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf_random = RandomizedSearchCV(estimator = rf_model, param_distributions = random_grid, n_iter = 100, cv = 5, random_state = 42, n_jobs = -1, verbose = 2)
rf_random.fit(X_train, y_train)


We've got some parameters from our random search, lets do the RandomForestMdoel again, but with the RandomSearchCV best parameters

In [None]:
best = rf_random.best_params_
rf_random.best_params_

In [None]:
rf_model2 = RandomForestClassifier(random_state = 42, 
                                   n_estimators = best['n_estimators'], 
                                   min_samples_split = best['min_samples_split'], 
                                   min_samples_leaf = best['min_samples_leaf'], 
                                   max_features = best['max_features'], 
                                   max_depth = best['max_depth'])
rf_model2.fit(X_train, y_train)
yhat_forest2 = rf_model2.predict(X_test)

confusion_matrix = pd.crosstab(y_test, yhat_forest2, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, fmt = 'g', cmap = 'Reds')
plt.show()

yhat_forest_probs2 = rf_model2.predict_proba(X_test)
yhat_forest_probs2 = yhat_forest_probs2[:,1]

full_report(y_test,yhat_forest,yhat_forest_probs, name = "Base Model")
full_report(y_test,yhat_forest2,yhat_forest_probs2, name = "RandomSearchCV Tuned Model")

fpr2, tpr2, _ = roc_curve(y_test, yhat_forest_probs2)

plt.figure(figsize=(10,7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = "Base", color = 'blue')
plt.plot(fpr2, tpr2, label = "RandomTuned", color = 'green')
roc_plot_label()
plt.show()

It's a pretty decent bump in accuracy, though AUC fell very slightly.

Let's try using a GridSearch

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid={'max_depth': [24, 25, 28, 32, None],
            'max_features': ['auto', 'sqrt', 'log2', None],
            'min_samples_leaf': [1, 2, 3],
            'min_samples_split': [1, 2, 3],
            'n_estimators': [100, 300, 500]}

In [None]:
grid = GridSearchCV(rf_model, param_grid = param_grid, cv = 5, verbose=2, n_jobs=-1)
grid.fit(X_train, y_train)

In [None]:
best2 = grid.best_params_
best2

Now, as before, we use the best parameters found in the search for a new RandomForest model

In [None]:
rf_model3 = RandomForestClassifier(random_state = 42, 
                                   n_estimators = best2['n_estimators'], 
                                   min_samples_split = best2['min_samples_split'], 
                                   min_samples_leaf = best2['min_samples_leaf'], 
                                   max_features = best2['max_features'], 
                                   max_depth = best2['max_depth'])
rf_model3.fit(X_train, y_train)
yhat_forest3 = rf_model3.predict(X_test)

confusion_matrix = pd.crosstab(y_test, yhat_forest3, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, fmt = 'g', cmap = 'Reds')
plt.show()

yhat_forest_probs3 = rf_model3.predict_proba(X_test)
yhat_forest_probs3 = yhat_forest_probs3[:,1]

full_report(y_test,yhat_forest,yhat_forest_probs, name = "Base Model")
full_report(y_test,yhat_forest2,yhat_forest_probs2, name = "RandomSearchCV Tuned Model")
full_report(y_test,yhat_forest3,yhat_forest_probs3, name = "GridSearchCV Tuned Model")

fpr3, tpr3, _ = roc_curve(y_test, yhat_forest_probs3)

plt.figure(figsize=(10,7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = "Base", color = 'blue')
plt.plot(fpr2, tpr2, label = "RandomTuned", color = 'green')
plt.plot(fpr3, tpr3, label = "GridSearch", color = 'purple')
roc_plot_label()
plt.show()

We have lesser accuracy gains, but AUC is slightly greater than the base model

Finally, we'll take a look at the precision-recall curves

In [None]:
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import plot_precision_recall_curve

precision_base, recall_base, _ = precision_recall_curve(y_test, yhat_forest_probs)
precision_r, recall_r, _ = precision_recall_curve(y_test, yhat_forest_probs2)
precision_g, recall_g, _ = precision_recall_curve(y_test, yhat_forest_probs3)

plt.figure(figsize=(10,7))
sns.lineplot(recall_base, precision_base, label="Base Model", color = 'blue', ci = None)
sns.lineplot(recall_r, precision_r, label="RandomSearchCV Tuning", color = 'green', ci = None)
sns.lineplot(recall_g, precision_g, label="GridSearchCV Tuning", color = 'purple', ci = None)

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curves", weight ='bold', fontsize = 15)
plt.legend(loc="best")

auc_score = auc(recall_base, precision_base)
auc_score_r = auc(recall_r, precision_r)
auc_score_g = auc(recall_g, precision_g)

print("P-R AUC (Base Model):", auc_score)
print("P-R AUC (RandomSearchCV):", auc_score_r)
print("P-R AUC (GridSearchCV):", auc_score_g)

# Logistic Regression

We'll need to scale our data to perform the Logistic Regression. We'll create a pipeline to house both the StandardScaler transformation and the LogisiticRegression function

In [None]:
from sklearn.pipeline import Pipeline

steps = [('scaler', StandardScaler()), ('LR', LogisticRegression())]
pipe = Pipeline(steps)

pipe.fit(X_train, y_train)

In [None]:
y_hat_lr = pipe.predict(X_test)
y_hat_lr_probs = pipe.predict_proba(X_test)
y_hat_lr_probs = y_hat_lr_probs[:,1]

full_report(y_test,yhat_forest2,yhat_forest_probs2, name = "RandomSearchCV Tuned Model")
full_report(y_test,yhat_forest3,yhat_forest_probs3, name = "GridSearchCV Tuned Model")
full_report(y_test,y_hat_lr,y_hat_lr_probs, name = 'Logistic Regression')

fpr_lr, tpr_lr, _ = roc_curve(y_test, y_hat_lr_probs)

plt.figure(figsize=(10,7))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_lr, tpr_lr,color = 'red', label = "Logistic Regression")
plt.plot(fpr2, tpr2,color = 'green', label = "RandomTuned RF")
plt.plot(fpr3, tpr3, color = 'purple', label = "GridSearch RF")
roc_plot_label()
plt.show()


In [None]:
precision_lr, recall_lr, _ = precision_recall_curve(y_test, y_hat_lr_probs)

plt.figure(figsize=(10,7))
sns.lineplot(recall_lr, precision_lr, color = 'red', label="Logistic Regression", ci = None)
sns.lineplot(recall_r, precision_r, color = 'green', label="RandomSearchCV Tuning", ci = None)
sns.lineplot(recall_g, precision_g, color = 'purple', label="GridSearchCV Tuning", ci = None)

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curves", weight ='bold', fontsize = 15)
plt.legend(loc="best")

auc_score_lr = auc(recall_lr, precision_lr)
auc_score_r = auc(recall_r, precision_r)
auc_score_g = auc(recall_g, precision_g)

print("P-R AUC (LogisticRegression):",auc_score_lr)
print("P-R AUC (RandomSearchCV):", auc_score_r)
print("P-R AUC (GridSearchCV):", auc_score_g)



We end with considerably lower accuracy and AUC compared to our tuned RandomForestClassifier

All of the categorical columns are encoded, and there are enough instances of each target variable outcome, so I will not tune further in these respects.

# Conclusion

From our RandomSearch hyper parameter tuned RandomForest, we end up with a model with Overall Accuracy of 87%, weighted average F1 of 87%, and AUC = 0.908

From our GridSearch hyper parameter tuned RandomForest, we end up with a model with Overall Accuracy of 84%, weighted average F1 of 83%, and AUC = 0.911

Finally, from our LogisticRegression, we end with a model with Overall Accuracy of 81%, weighted average F1 of 81%, aand AUC = 0.882

A Random Forest Classifier is definitely the way to go between the two methods. Our GridSearchCV tuning resulted in the best AUC, but our RandomSearchCV yielded higher accuracy and was much faster. The AUC did drop compared to the base model, however