# Heart Failure: The gold digger V the Doctor, Precision V Recall

In this notebook I am looking at heart failure data, a dataset that takes a number of measureable health related features that correspond to a person's mortality from heart failure.

To keep this interesting I wanted to view this from two perspectives, one being that of a gold digger and the other a Doctor. The reason for this is that they both will want something very different out of a model that predictes mortality.

The sinister gold digger will not want to waste their time hooking up with someone that has been falsely predicted to die as the result of heart failure. They will want to know with high precision that their betrothed will die.

Our kind Doctor on the other hand is there to save as many lifes as they can. To them they want to find all the people that are at risk of heart failure so they can provide the best action as soon as possible, be it a healthier lifestyle, medication or surgical intervention. It doesn't matter as much if they wrongly predict a person to be positive when they arn't as they know a number of follow up tests will be carried out and healthy lifestyle changes will be good for a person even if they are not at present risk. However letting someone slip through the net as it were could be aweful. For them they want high recall.


## Steps Needed

Now that we have some motives and we have some data. Next is how I plan to tackle this.

1.   Load and analyse the data. For this I will be using pandas and pandas-profiling. For this small amount of data these are perfect tools to load and check what we have with very little effort.
2.   Clean the data.
3.   Create train and test sets.
4.   Preprocess the data.
5.   Train our model.
6.   Validate our model.
7.   Prediction Probablilities and Precision/Recall Trade Off

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

#Suppressing all warnings
warnings.filterwarnings("ignore")

%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report, plot_roc_curve

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from pandas_profiling import ProfileReport

## Load and Analyse the Data

In [None]:
def loadData():
    return pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

data = loadData()

In [None]:
report = ProfileReport(data,progress_bar=False)
report.to_notebook_iframe()

The pandas profiling report is great. There are no missing values and all the data types are as they should be. I do however want to have a closer look at the correlations as I woulld like to know which features are of importance.

In [None]:
data_corr = data.corr(method='pearson')

data_corr_columns = data_corr.columns
data_corr_index = data_corr.index

data_corr = data_corr.to_numpy()

np.fill_diagonal(data_corr, np.nan)

data_corr = pd.DataFrame(data_corr, columns=data_corr_columns, index=data_corr_index)

In [None]:
plt.subplots(figsize=(15,10))
sns.heatmap(data=data_corr, cmap='YlGnBu', annot=True).set_title('Correlation Heat Map');

It looks as if age and serum_creatinine have the largest positive correlations to the death event. The age factor makes common sense as we expect the older person to be more at risk of heart failure. This is also something that is also quite common when we thing about famous gold diggers, those that marry people much older than themselves.

The serum creatinine is an interesting one as this has to do with how well the kidneys are working, a high levels shows poor kidney function.

## Clean the Data

After looking through the report there doesn't seem much to do with the data and so I am going to model it as it is.

## Create Train and Test Sets



In [None]:
# dropping target value from featues
features = data.drop(['DEATH_EVENT'],axis=1)
# extracting DEATH_EVENT as an array
target = data['DEATH_EVENT']

# Spliting the data into train test sets
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.2,
                                                    random_state=30,
                                                    stratify=target)

## Preprocess the Data

To process the data I am going to use the Pipeline and ColumnTransform functions from sklearn.
These make life a lot easier although their use here is a bit overkill.

In [None]:
# lists of columns that will go through the pipeline
scaled_cols = ['age',
               'creatinine_phosphokinase',
               'ejection_fraction',
               'platelets',
               'serum_creatinine',
               'serum_sodium']
nothing_cols = ['anaemia',
                'diabetes',
                'high_blood_pressure',
                'sex',
                'smoking',
                'time']

In [None]:
scaler = StandardScaler()

numeric_transform = Pipeline(steps=[
    ('scaler', StandardScaler())])

nothing_transform = Pipeline(steps=[
    ('nothing', None)])


preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transform, scaled_cols),
        ('nothing', nothing_transform, nothing_cols)
    ])


rf = Pipeline(steps=[('preprocessor', preprocess),
#                     ('classifier', RandomForestClassifier())
                    ])

In [None]:
# transforming the training and testing data
trans_train_data = rf.fit_transform(X_train)

trans_test_data = rf.transform(X_test)

## Train our Model

In [None]:
# using the Random Forest Classifier with it default parameters
rfc = RandomForestClassifier()

In [None]:
# training the model with our transformed data
rfc.fit(trans_train_data, y_train)

## Validate our Model

In [None]:
# getting predictions to our test data
predictions = rfc.predict(trans_test_data)

In [None]:
print(classification_report(y_test, predictions))

Our model has done well we have a accuracy of 88%. However this isn't perfect for either our Doctor or our gold digger.
Lets have a look at a confusion matrix to get a better idea of what is going on.

In [None]:
cm = confusion_matrix(y_test, predictions)

cm = pd.DataFrame(cm,
                  columns=['Predicted Negative',
                           'Predicted Positive'],
                  index=['Actual Negative',
                         'Actual Positive'])

cm.style.background_gradient(cmap='viridis')

In the above confusion matrix we can clearly see the issue for both our Doctor and our gold digger. The Predicted column shows that 2 out of our 16 positive predictions were actually negatives. This is bad news for our gold digger as they wouldn't want to accidently marry one of those two people as they will have to wait a long time if at all before getting their money. The Actual Positive row shows the issues for the Doctor because 4 out of the total 19 positive cases failed to be positively predicted.

Our Precision is 0.88 and our Recall is 0.74

This isn't the end of the story though as our model also gives us access to the probabilities for each prediction and as both of our characters are willing to trade off precision for recall or vice verse we can do more with this data.

I am next going to get these probabilities, the predictions and the actual outcomes and concatenate them into a dataframe so we can get a better understanding of it.

To do this Random Forest Classifier in sklearn comes with the <code>.predict_proba()</code> method. Other classifiers may use <code>.decision_function()</code>. This allows us to get the probabilities that we want.

## Prediction Probablilities and Precision/Recall Trade Off

In [None]:
# Lets first get the probability given to each prediction
y_pred_proba = rfc.predict_proba(X_test)

# next make a dataframe of the outcomes
train_df = pd.DataFrame(pd.Series(y_test))

# reset the index so we can concatenate the right rows together
train_df.reset_index(inplace=True)

# crate a dataframe of the predictions
predictions_df = pd.DataFrame(predictions, columns=['Predictions'])

# from the proba we are taking the second column and making it a dataframe
proba_df = pd.DataFrame(y_pred_proba[:,1], columns=['proba'])

In [None]:
# concatenating all the dataframes together
precision_recall = pd.concat([train_df, predictions_df, proba_df], axis=1, ignore_index=False).set_index('index')

To see how these probablilties relate to the outcomes and predictions I am going to take a random sample from the precision_recall dataframe them sort in ascending order by the proba column and lastly transpose the dataframe.

In [None]:
# increasing max columns displayed by pandad
pd.options.display.max_columns = 30

precision_recall.sample(30).sort_values('proba').T

With this we can see that as the proba gets higher than 0.7 all the predictions are positive and so are the the outcomes. This is exactly what our gold digger wants. If we were to filter the data by only classing proba 0.7 and above as positive we would have a higher Precision however we would miss out of predicting some of the positive outcomes lowering Recall. The oposite can be done by lowering the wanted proba to say 0.4, this would mean that more of the positive outcomes would be classed as positive how we would lower the Precission.

Lets do this for both our Doctor and gold digger and see however the precission and recall are affected.

In [None]:
gold_digger_df = precision_recall.copy()
doctor_df = precision_recall.copy()

In [None]:
gold_digger_df.loc[(gold_digger_df.proba >= 0.7), ('Predictions')] = 1
gold_digger_df.loc[(gold_digger_df.proba < 0.7), ('Predictions')] = 0

doctor_df.loc[(doctor_df.proba >= 0.4), ('Predictions')] = 1
doctor_df.loc[(doctor_df.proba < 0.4), ('Predictions')] = 0

In [None]:
print(classification_report(gold_digger_df.DEATH_EVENT, gold_digger_df.Predictions))

In [None]:
gb_cm = confusion_matrix(gold_digger_df.DEATH_EVENT, gold_digger_df.Predictions)
gb_cm = pd.DataFrame(gb_cm, columns=['Predicted Negative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])
gb_cm.style.background_gradient(cmap='viridis')

In our gold digger set we can now see that of all the instances we predicted to be postive all of them were thus giving us 100% precision. However we did miss out on an number of actual postive outcomes. Which has led to a lower recall.

In [None]:
print(classification_report(doctor_df.DEATH_EVENT, doctor_df.Predictions))

In [None]:
dr_cm = confusion_matrix(doctor_df.DEATH_EVENT, doctor_df.Predictions)
dr_cm = pd.DataFrame(dr_cm, columns=['Predicted Negative', 'Predicted Positive'], index=['Actual Negative', 'Actual Positive'])
dr_cm.style.background_gradient(cmap='viridis')

For the Doctor set most of the actual positives have now been predicted as positive giving us a much higher Recall. However we have now got a lot more False Positives lowering the Precision.

To see this trade of we can plot the the True Positive Rate against the False Positive Rate as below.

In [None]:
plot_roc_curve(rfc, trans_test_data, y_test);