# Task: Predict Patient Illness

The purpose of this Task is to predict if a patient has heart disease based on a set of symptoms. As a secondary objective highlight the features which have the greatest impact. As it turns out, this data set was uploaded to Kaggle twice. This submission was initially composed for VolodymyrGavrysh's task, but since it is applicable to another task on the other dataset I will submit it there as well.



Special thanks to:
* VolodymyrGavrysh for both the task and for uploading the dataset to Kaggle: https://www.kaggle.com/volodymyrgavrysh/heart-disease
* And also Ronit for posting the dataset on Kaggle as well, https://www.kaggle.com/ronitf/heart-disease-uci
* Shoumik for the Task https://www.kaggle.com/shoumikgoswami

And the principal investigator responsible for the data collection at each institution:
* Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
* University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
* University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
* V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# First steps: simple EDA

Check the data to make sure its cleaned, encoded, and not containing missing values.

In [None]:
data = pd.read_csv('/kaggle/input/heart-disease/heart.csv')
data.head()

The following I copy/pasted from the data source

"
# Attribute Information:

    Age: Age
    Sex: Sex (1 = male; 0 = female)
    ChestPain: Chest pain (typical, asymptotic, nonanginal, nontypical)
    RestBP: Resting blood pressure
    Chol: Serum cholestoral in mg/dl
    Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
    RestECG: Resting electrocardiographic results
    MaxHR: Maximum heart rate achieved
    ExAng: Exercise induced angina (1 = yes; 0 = no)
    Oldpeak: ST depression induced by exercise relative to rest
    Slope: Slope of the peak exercise ST segment
    Ca: Number of major vessels colored by flourosopy (0 - 3)
    Thal: (3 = normal; 6 = fixed defect; 7 = reversable defect)
    target: AHD - Diagnosis of heart disease (1 = yes; 0 = no)
"

In [None]:
data.describe()

In [None]:
data.isnull().values.any()

In [None]:
import seaborn as sns

In [None]:
_ = data.corr()
sns.heatmap(_, cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9))
#Not the prettiest plot but I'm using a diverging colour pallette so 
#that its easier to pick out the values closest to 0 around the target

# Second Step: model selection

Because the dataset is this small I'll be experimenting with model complexity before feature selection/importance. Avoiding overfitting will be the main hurdle for this task.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
X = data.drop(['target'], axis=1)
y = data['target']

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1, test_size=.2)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

est_range = range(5, 105, 5)
score_graph = {'ROC AUC Score': [], 'n_estimators': []}
est_list = []
score_list = []
for num in est_range:
    gbr = GradientBoostingRegressor(n_estimators=num, random_state=0)
    gbr.fit(train_X, train_y)
    ls_preds = gbr.predict(val_X)
    acc = roc_auc_score(val_y, ls_preds)
    print('ROC AUC Score with',num ,'estimators is: ', acc)
    score_graph['ROC AUC Score'] = score_graph['ROC AUC Score'] + [acc]
    score_graph['n_estimators'] = score_graph['n_estimators'] + [num]

In [None]:
sns.lineplot(x=score_graph['n_estimators'], y=score_graph['ROC AUC Score'])

# When selecting an optimal parameter where there is a plateau
# I tend to pick the side of the plateau that has the lowest complexity.
# This helps prevent overfitting



# Step Three: feature importance

Working backwards from a sufficient model now we will use Permutation Importance to determine which features are adding signal and which are adding noise.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

my_model = GradientBoostingRegressor(n_estimators=35).fit(train_X, train_y)

perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

As we can see above there are three strong features (ca, oldpeak, and thal). Slope, age, and sex have some relevence as well but not a high enough relevence to confirm that it isn't due to random chance.

Permeation importance works by shuffling the data in a single column, then rerunning the model and evaluating its performance. a higher number indicates a larger positive impact to the model by leaving the datapoint intact. A negative permeation importance means that the model actually improved as a result of the shuffle (so random chance was a better predictor than the original datapoint). If we look at the largest negative of the set, 'exang', its value is -0.03, so that is where I'm drawing my line of feature value.

In [None]:
feat = ['ca', 'oldpeak', 'thal']
X = data[feat]
y = data['target']
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1, test_size=.2)

In [None]:
gbr = GradientBoostingRegressor(n_estimators=35, random_state=0)
gbr.fit(train_X, train_y)
ls_preds = gbr.predict(val_X)
print('ROC AUC Scoree is: ', roc_auc_score(val_y, ls_preds))

As a side note: at this point since we are using fewer features we also need to prune the complexity of the model to prevent overfitting.

In [None]:
est_range = range(5, 105, 5)
score_graph = {'ROC AUC Score': [], 'n_estimators': []}
est_list = []
score_list = []
# A coder with some forsight might have just created a function for this instead of copy pasting my prior code
for num in est_range:
    gbr = GradientBoostingRegressor(n_estimators=num, random_state=0)
    gbr.fit(train_X, train_y)
    ls_preds = gbr.predict(val_X)
    acc = roc_auc_score(val_y, ls_preds)
    print('ROC AUC Score with',num ,'estimators is: ', acc)
    score_graph['ROC AUC Score'] = score_graph['ROC AUC Score'] + [acc]
    score_graph['n_estimators'] = score_graph['n_estimators'] + [num]
sns.lineplot(x=score_graph['n_estimators'], y=score_graph['ROC AUC Score'])

In [None]:
gbr = GradientBoostingRegressor(n_estimators=5, random_state=0)
gbr.fit(train_X, train_y)
ls_preds = gbr.predict(val_X)
print('ROC AUC Scoree is: ', roc_auc_score(val_y, ls_preds))

# Step 5: Classification thresholds

Up to this point we've been using the area under curve for assessing the probabilities that the model assessed to each individual. Now to change tracks to predict which patients will have heart disease. This time I'm going to loop through and see how high we can set the threshold before we start seeing false negatives (where someone with heart disease would be told they are fine).

While not strictly speaking the goal of the initial task, one of the metrics that matters most to me is if the model would tell someone they are fine, only to have a heart attack occur (basically I'm looking for an accurate yet pessimistic model) 

In [None]:
from sklearn.metrics import confusion_matrix

def rounder(num, thresh=0.5):
    if num >= thresh:
        return 1
    else:
        return 0

In [None]:
rounding = pd.Series([rounder(x,.2) for x in ls_preds])
confuse = confusion_matrix(val_y, rounding)
print(confuse[[1],[0]])

In [None]:
import matplotlib
from matplotlib.pyplot import figure
false_negitives = []
threshholds = []
for num in range(1, 10, 1):
    n = num/10
    rounding = pd.Series([rounder(x,n) for x in ls_preds])
    print('Accuracy with threshold set to',n ,'is: ', accuracy_score(val_y, rounding))
    confuse = confusion_matrix(val_y, rounding)
    false_negitives = false_negitives + [int(confuse[[1],[0]])]
    threshholds = threshholds + [n]
    confuse = pd.DataFrame(confuse)
    figure(num=None, figsize=(5, 5))
    sns.heatmap(confuse, linewidths=1,annot=True, fmt='.5g', annot_kws={"size": 12},cmap="YlGnBu", 
                yticklabels=['Negitive', 'Positive'], xticklabels=['Negitive Predicted', 'Positive Predicted']).set_title('Threshhold set to: '+str(n))

In [None]:
# this plots the rate of false negitives as we increase the 
# threshholds for when a person is determined to have heart deasease 
sns.lineplot(y=false_negitives, x=threshholds)

# Final Thoughts:

With the threshold for predicting heart disease set to .5 (so the model initially predicted a 50% chance or greater to produce a positive outcome) we have 0 false positive cases and an accuracy of 88%. It's not perfect, but I'm absolutely satisfied with those results.

I would have liked to get a more accurate model, but due to the limited number of cases I probably was overly cautious of overfitting. I also probably could have used grid search to squeeze a bit more performance out of the model. Furthermore, the data that's needed to supply this model, and effort needed to prepare the data, almost makes it easier just to have a physician evaluate the patient. The original data set has a few more features in it, and I wonder if it would have been worth while to source the added data or if it would have symply created too much noise.

As one other side note, many of the techniques I chose to use in this notebook scale poorly. So having a small dataset to start with isn't always a bad thing. Alternatively, some of the loops I used could be used on a subset of a larger dataset.

This year I've challenged myself to complete one task on Kaggle per week, in order to develop a larger Data Science portfolio. If you found this notebook useful or interesting please give it an upvote. I'm always open to constructive feedback. If you have any questions, comments, concerns, or if you would like to collaborate on a future task of the week feel free to leave a comment here or message me directly.
