We're going to take a look at this small heart failure data set: <br>
https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

autogen code to bring in a data set hosted by kaggle:


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

First things first, lets read the article: <br>
<b><a href="https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5">The original article</a></b> 
<br> <br>
So this is a set of 299 patients who sufferend some sort of heart failure.  Note the data collected in 2015, so somewhat recent. <br> 
**<a href="https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/1"> Table of variable descriptions</a>** <br>
<br>
I'm going to list these out here just so we have everything together in one spot <br><br>
**Age:** Age of the patient in years (continuous integer) <br>
**anaemia:** indicates a decrease in red blood cells or hemoglobin (binary) <br>
**creatinine_phosphokinase:** Level of the CPK enzyme in the blood, measured in mcg/L (continuous) <br>
&emsp;&emsp; <a href="https://www.hopkinslupus.org/lupus-tests/clinical-tests/creatine-phosphokinase-cpk/"> more info on CPK enzyme here</a>  <br />
**diabetes:** indicates the patient had diabetes (binary) <br>
**ejection_fraction:** Percentage of blood leaving the heart at each contraction, as a percentage (continuous integer) <br>
**high_blood_pressure:** indicates the patient has hypertension (binary) <br />
**platelets:** count of platelets in the blood, measured in kiloplatelets/mL (continuous integer) <br>
**serum_creatinine:** level of serum creatinine in the blood, measured in mg/dL (continuous) <br>
**serum_sodium:** Level of serum sodium in the blood, measured in mEq/L (continuous)  <br>
**sex:** Woman or man (binary) <br />
**smoking:** indicates the patient was a smoker (binary) <br />
**time:** Follow-up period (days) (int, continuous) <br />
**DEATH_EVENT:** indicates paitient mortality during follow-up period (binary) <br />
<br />
<br /> Right off the bat I'm concerned about the time variable.  If a patient dies, does this cut short the follow-up period?  Also, at the time of the heart failure event, we wouldn't know what the follow-up period would be.  The authors of the article can make whatever case they want, but if I'm going to make a model that would be intended for use (calculating the probability of mortality for a given patient at the time of a heart failure event) then I can't in good faith include it. <br />
<br /> I also want to mention that this is an <i> extremely </i> limited dataset.  We don't know what risk factors these patients might have (including chronic conditions), what their patter of care has been over the preceding months/years, what exacty was done during the heart failure event, etc.  

In [None]:
# reading our data set using pandas and taking a quick look at the variables included to make sure they meet expectations
df = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df.head()

In [None]:
df.info()

note - no null values

dropping time

In [None]:
df = df.drop(['time'], axis=1)

In [None]:
df.info()

Going to do a little bit of data exploration.  I might come back and do more if I find time.

In [None]:
import seaborn as sns
sns.distplot(df['age'])

Most of our patients are in the 40-70 range.  <br>
<br>
Since we have such a small data set I won't consider splitting it.  By that I mean we could train a model on the group <= 65 and and another on the > 65 group.  I would expect a model trained excusively on working age patients to perform better than a set that includes retirees.  From a US healthcare insurance perspective this would also make sense; the working class group would have commercial insurance and the majority of retirees would have Medicare.<br>
<br>
For the same reason I'm not going to consider further limiting our data either (trimming or winsorizing outliers, removing noise, etc.)  I feel like this changes the overall scope of the problem, and isn't in the spirit of this particular exercise .

In [None]:
#### looking at the distribution of age, splitting on death event ####
sns.violinplot(x="sex", y="age", data=df, hue="DEATH_EVENT", palette="pastel", split=True) 

No surprises here.  Going to run pairplots on the rest of these variables just to get an idea of what we're looking at.  I've already stated that I'm not planning on limiting this data set further than it already is.  If that wasn't the case we could spend more time here considering our options.

In [None]:
sns.pairplot(df[['age','sex','anaemia','creatinine_phosphokinase','diabetes','DEATH_EVENT']],hue="DEATH_EVENT")

In [None]:
##### going to change the palette so it's a bit easier to differentiate between these plots #####
sns.pairplot(df[['age','sex','ejection_fraction','high_blood_pressure','platelets','DEATH_EVENT']],hue="DEATH_EVENT",palette="dark")

In [None]:
sns.pairplot(df[['age','sex','serum_creatinine','serum_sodium','smoking','DEATH_EVENT']],hue="DEATH_EVENT",palette="bright")

In [None]:
# lets put together some models 
# setting up our training data set
x = df.drop(['DEATH_EVENT'], axis=1).values
y = df['DEATH_EVENT'].values

In [None]:
# train test split
# using default training size.  It's worth considering a larger training size

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.75)

I'm going to run the classifier gauntlet so to speak.  As I discover more I'll add them here as a reference. <br>
<br>
<b> Logistic Regression </b>
<br>
fitting the model

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

creating predictions on the test set

In [None]:
log_pred = logreg.predict(x_test)

compiling evaluation metrics for comparison <br>
<br>
<a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">I find the wikipedia article on ROC a very convenient reference</a>



In [None]:
from sklearn.metrics import confusion_matrix, auc, accuracy_score, f1_score
log_cm = confusion_matrix(y_test, log_pred)
log_acc = accuracy_score(y_test, log_pred)
log_f1 = f1_score(y_test, log_pred)
print(str(log_acc))

<b> Support Vector Machines </b> <br>
Using the default value C=1.  If I remember I will go back and run this for different values of C <br>
I could also play with different kernels, but I'm not really expecting this model to do well really

In [None]:
from sklearn.svm import SVC
supp_vect = SVC()
supp_vect.fit(x_train, y_train)

In [None]:
scv_pred = supp_vect.predict(x_test)

In [None]:
svc_cm = confusion_matrix(y_test, scv_pred)
svc_acc = accuracy_score(y_test, scv_pred)
svc_f1 = f1_score(y_test, scv_pred)

<b> K Nearest Neighbors </b>

In [None]:
from sklearn.preprocessing import StandardScaler
knn_scale = StandardScaler()
knn_scale.fit(x)
knn_x = knn_scale.transform(x)

In [None]:
knn_x_train, knn_x_test, knn_y_train, knn_y_test = train_test_split(knn_x,y,train_size = 0.75)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

err = []
k = []

for i in range(1,30):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(knn_x_train,knn_y_train)
    knn_pred_i = knn.predict(knn_x_test)
    err.append(np.mean(knn_pred_i != knn_y_test))
    k.append(i)

In [None]:
sns.lineplot(k, err)

k = 19 seems to be a pretty good choice

In [None]:
knn = KNeighborsClassifier(n_neighbors=19)
knn.fit(knn_x_train,knn_y_train)
knn_pred = knn.predict(knn_x_test)

knn_cm = confusion_matrix(knn_y_test, knn_pred)
knn_acc = accuracy_score(knn_y_test, knn_pred)
knn_f1 = f1_score(knn_y_test, knn_pred)

<b> The SciKit Random Forest model </b> 
<br> <br> I want to note that there's a SciKit Decision Tree model as well.  The Random Forest model almost always perfoms better so I'm going to skip it for now.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
rfc_pred = rfc.predict(x_test)

In [None]:
rfc_cm = confusion_matrix(y_test, rfc_pred)
rfc_acc = accuracy_score(y_test, rfc_pred)
rfc_f1 = f1_score(y_test, rfc_pred)

<b><a href="https://xgboost.readthedocs.io/en/latest/">xgboost</a></b> - Extreme Gradient Boosted trees.  Right now (Oct 2020) this is one of the best performing ml algorithms out there.

In [None]:
from xgboost import XGBClassifier

xgb_out = []
xgb_est = []
for e in range(5,30):
    classifier = XGBClassifier(n_estimators = e, max_depth=12, subsample=0.75)
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test)
    xgb_out.append(accuracy_score(y_test,y_pred))
    xgb_est.append(e)

In [None]:
sns.lineplot(xgb_est, xgb_out)

In [None]:
classifier = XGBClassifier(n_estimators = 14, max_depth=12, subsample=0.75)
classifier.fit(x_train, y_train)
xgb_pred = classifier.predict(x_test)


xgb_cm = confusion_matrix(y_test, xgb_pred)
xgb_acc = accuracy_score(y_test, xgb_pred)
xgb_f1 = f1_score(y_test, xgb_pred)

<b> Catboost </b>

In [None]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(x_train, y_train, silent=True)

In [None]:
cat_pred = classifier.predict(x_test)

cat_cm = confusion_matrix(y_test, cat_pred)
cat_acc = accuracy_score(y_test, cat_pred)
cat_f1 = f1_score(y_test, cat_pred)

<b><a href="https://lightgbm.readthedocs.io/en/latest/"> lightgbm </a> </b> - This model has a lot of parameters that can be optimized.  If I get time I'll try to add to this.

In [None]:
import lightgbm as lgb
lgtrain = lgb.Dataset(x_train, label=y_train)
lgtest = lgb.Dataset(x_test, label=y_test)

In [None]:
params = {  "boosting_type":'gbdt', 
            "class_weight":None,
            "num_leaves": 100,
            "objective": 'binary',
            "metric": 'auc',
            "verbose": -1}

In [None]:
lgbm = lgb.train(params, lgtrain, 100, valid_sets=[lgtrain, lgtest], early_stopping_rounds=200, verbose_eval=False)

In [None]:
lgb_pred = lgbm.predict(x_test)

In [None]:
# need to optimize a threshold for these predictions 
# here we're going to maximize F1 by minimizing the negated F1-Score
def neg_f1(threshold, y_true, y_hat):
    return -f1_score(y_true, y_hat > threshold)

In [None]:
from scipy.optimize import minimize
f1_opt = minimize(fun=neg_f1, x0=np.median(lgb_pred), args=(y_test,lgb_pred), method='nelder-mead')
f1_opt.x[0]

In [None]:
lgb_cm = confusion_matrix(y_test, lgb_pred > f1_opt.x[0])
print(lgb_cm)
TN, FP, FN, TP = confusion_matrix(y_test, lgb_pred > f1_opt.x[0]).ravel()
lgb_acc = (TP + TN)/(TP + TN + FP + FN)

looking at accuracy.  We could also compare some of the other metrics I've been compiling as well.

In [None]:
print('the accuracy of the logistic model is: {}'.format(round(log_acc, 3)))
print('the accuracy of the supprot vector machine is: {}'.format(round(svc_acc, 3)))
print('the accuracy of the k nearest neighbor is: {}'.format(round(knn_acc, 3)))
print('the accuracy of the scikit random forest is : {}'.format(round(rfc_acc, 3)))
print('the accuracy of the xgboost model is: {}'.format(round(xgb_acc, 3)))
print('the accuracy of the catboost model is: {}'.format(round(cat_acc, 3)))
print('the accuracy of the lightboost model is: {}'.format(round(lgb_acc, 3)))

There's more we can do to tune these models, but I don't feel like this particilar problem is well suited for this sort of thing considering how small our data set is.  These scores are not very good whatsoever, but I think that speaks to the data moreso than anything.