# Hello all!

I found this dataset, which seems to be really easily used by beginners, so I decided I might give it a shot. So we'll load the data in and I'll see what my amateur knowledge can do.



In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%pylab inline

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import GridSearchCV

In [None]:
raw = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
raw.tail(10)

# Data Inferrence

Right, this seems to be a bit medically intensive, so I'll do a bit of research just to be sure. First of all, the age and sex column seems to be quite self-explanatory. We'll do analysis on the data using those mentioned columns.

By the way, woman is 0, and man is 1.

## 1. Conditions 

A lot of the binary data seems to focus on conditions the patient might face before the death.

* **Anaemia: Basically the decrease of red blood cells, or hemoglobin. In a healthy person's blood there are more than just the red-looking substance as there are also plasma and plateles, so a decrease in hemoglobin might suggest a blockache, or in general how little oxygen is being carried around.**

* **Diabetes: This condition is probably famous, basically there's too much sugar in your blood. I don't know the relevance of the type of diabetes, since some are innate and genetic, while some are developed later in life. I'll assume this only concerns whether a person's blood glucose is high.**

* **High blood pressure: Exactly what it sounds like, your blood pressure is too high. Usually prevalent in older people with conditions like the aforementioned diabetes, or just an unsual sodium-rich diet.**

* **Smoking: Well, we'll see if smoking kills.**

### Sex

Let's begin with a heatmap to see where we need to go first.

In [None]:
main = "#68C643"
secondary = "#A144C5"

In [None]:
sns.heatmap(raw[["anaemia", "diabetes", "smoking", "high_blood_pressure", "sex"]].corr(), annot=True)

Well, there seems to be **a very notable** correlation between gender and smoking. Before visualizing the conditions, we have to keep in mind if more gender bias is presented in this dataset.

In [None]:
sns.countplot(raw["sex"], color=main)

To which, there are. There are *twice* as many men as there are women presented in this dataset, do keep that in mind.

In [None]:
conditions_df = raw[["anaemia", "diabetes", "smoking", "high_blood_pressure", "sex", "age"]]
for i in ["anaemia", "diabetes", "smoking", "high_blood_pressure"]:
    plot = sns.FacetGrid(conditions_df, col="sex")
    plot.map(sns.countplot, i, color=main)

> So, here are some general assumptions I've gathered from this:

* The number of women suffering from anaemia and not seems to be rather close to each other, while more men did not experience anaemia. The same is quite true for diabetes and high blood pressure also.
* Women generally do not smoke, as the number of women smoker is abyssmal. The ratio between smokers and non-smokers is more balanced in men, though non-smokers still outnumber the other by a smidge.

### Age

Let's begin with a heatmap once more.

In [None]:
sns.heatmap(raw[["anaemia", "diabetes", "smoking", "high_blood_pressure", "age"]].corr(), annot=True)

Hm... there seems to lack any distinct features. Well, anyways, let's first see the distribution of age. We'll first see the mean age relative to the gender, to see if there are any biases.

In [None]:
raw[["age", "sex"]].groupby("sex").agg(["min", "max", "mean", "median"])

Right! So the minumin, maximum and median age are the same, with the mean score being different (probably due to the difference in gender participating in the dataset). So this is quite balanced a dataset in terms of age distribution.

In [None]:
ageplot = sns.FacetGrid(raw[["age", "sex"]], col="sex")
ageplot.map(sns.kdeplot, "age", shade=True, color=main)

I would guess that swarmplots or boxplots depicts the relations between age and conditions pretty well, so we'll see.

In [None]:
for i in ["anaemia", "diabetes", "smoking", "high_blood_pressure"]:
    plot = sns.FacetGrid(conditions_df, col="sex")
    plot.map(sns.swarmplot, i, "age", color=secondary)
    plot.map(sns.boxplot, i, "age", color=main)

> As per usual, here are the general assumptions

* Women suffering from anaemia tends to be a bit younger, while men tend to be a bit older.
* Diabetes is more balanced. People who suffer from diabetes tend to have a narrower, younger age rage. What is notable is that women who do not suffer from diabetes tend to be older than the rest.
* There is little difference between men who smoke, comparing to men who do not, except for the fact that the latter seems to live longer (at least in the context of this dataset). It is hard to say for women smokers since there are only 4/100+ women who smokes.
* Women who has normal blood pressure tend to be younger than those who do not, while it is generally the opposite for men.

Those are all interesting conclusions, but what pique my interest are those 4 women who smokes, let's see them in the dataframe.

In [None]:
raw[(raw["sex"] == 0) & (raw["smoking"] == 1)]

In [None]:
raw[raw["sex"] == 0].iloc[:4]

Hmm... it seems that there are no major difference, except for the (maybe) increase in other conditions, but there are so few female smokers we can't tell for sure.

## Levels and Percentage

Data that are not binary seems to be measured in two different ways, that is the concentration of X in Y, or the percentage.

* Creatinine phosphokinase (mcg/L): The level of an enzyme called CPK in the bloodstream. These are not typically abundant in the bloodstream, and leak out when tissues are damaged.
* Ejection fraction (%): The percentage of blood that leaves the heart in each contractions.
* Platelets (kiloplateletes/mL): The amount of plateles in the blood.
* Serum creatinine (mg/dL): Level of serum cretinine in the blood. All I know is that this thing is a waste product caused by the natural wear and tear of the body.
* Serum sodium (mEq/L): The amount of sodium, or... kinda salt presented in the body.

So we can assume that except for the ejection fraction and plateles, high levels of the other columns typically mean the body is under danger. Lower amounts of sodium can also typically lead to unhealthy conditions. As for the ejection, because we are accounting for heart failure, we can assume that a lower percentage of blood leaving the body can be quite a telltale sign.

It is a protocol at this point, but let's draw up another heatmap.

In [None]:
levels_df = raw[["creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "DEATH_EVENT"]]
sns.heatmap(levels_df.corr(), annot=True)

Right...! So ejection fraction, serum cretinine and serum sodium seems to be the prevalent death factors. But since this data is so varied, seeing a bigger picture might help, and I mean that literally.

In [None]:
sns.pairplot(data=levels_df, hue="DEATH_EVENT", corner=True, kind="reg", diag_kind="hist", palette=sns.color_palette([main, secondary]))

In any cases, I don't think it helped me, at all.

In [None]:
for i in ["creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium"]:
    death_plot = sns.FacetGrid(levels_df, col="DEATH_EVENT")
    death_plot.map(sns.swarmplot, i, color=main)

Hm...

> Conclusion

* A lower ejection fraction tends to lead to death.
* An unsually high level of cretinine serum or phosphokinase can also lead to death.

The rest seems hard to make any surefire conclusions...

In [None]:
raw["DEATH_EVENT"].value_counts()

**IMPORTANT:** There is a class imbalance, do remember to calculate class weights later.

# Machine Learning

In [None]:
ds_df = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
y = ds_df.pop("DEATH_EVENT")
normalize_cols = ["ejection_fraction","serum_creatinine", "serum_sodium", "time", "creatinine_phosphokinase", "platelets"]
ds_df_n = ds_df
ds_df_n[normalize_cols] = ((ds_df[normalize_cols] - ds_df[normalize_cols].min()) / (ds_df[normalize_cols].max() - ds_df[normalize_cols].min())) * 20
X_train, X_test, y_train, y_test = train_test_split(ds_df_n, y, test_size=0.33)
class_weights = compute_class_weight("balanced", np.unique(y_train), y_train)
print(class_weights)
ds_df.head()

In [None]:
ds_df_n.head()

In [None]:
# I have used a GridSearchCV, and personal judgement to come up with the following
# hyperparameters for the RandomForest model
rf = RandomForestClassifier(bootstrap=False, criterion="entropy", max_depth=3, n_estimators=120)
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))
plot_confusion_matrix(rf, X_test, y_test)

As you can see, it performs surprisingly well, and thus, I begin to suspect that there is something wrong with my work. Data leakage? I don't know. Class bias? Does not seem so, as it predicts 1s just fine (though there is considerable False Negatives, which is quite dangerous).

In [None]:
plot_confusion_matrix(rf, ds_df_n, y)
print(rf.score(ds_df_n, y))

The tendency to get FNs are more apparent, but we can see that when the model predicts a 1, there is often a high chance of the patient actually suffering, as False Positives are quite low while True Positives are impressive enough (for me, anyways).

> Thus, it might be wise for further testing when the model predicts a 0, but seem to be not vice versa. Nobody should use my work anyways, as it is sloppy, and from a high schooler (haha).

Anyhow, I'll save my model here to load it later, to get a more realistic process when actually using it on data.

In [None]:
from joblib import dump, load
dump(rf, "heart_failure_predictor.joblib")

*Note: this piece of code right below does save the result to the current directory, though it seems that it cannot be done on this Notebook, probably because I already saved the model there (that's just a wild guess though).*

In [None]:
def generate_predictions(filepath, normalize_cols, modelpath):
    from joblib import load
    import pandas as pd
    def preprocess_data(dataframe):
        dataframe[normalize_cols] = ((dataframe[normalize_cols] - dataframe[normalize_cols].min()) / (dataframe[normalize_cols].max() - dataframe[normalize_cols].min())) * 20
        return dataframe
    model = load(modelpath)
    # VERY IMPORTANT!!!
    #
    # REMOVE .drop("DEATH_EVENT", axis=1) in when using on a real, unpredicted dataset
    # I'm just putting it there to conveniently demonstrate the usage on the same dataset
    data = preprocess_data(pd.read_csv(filepath)).drop("DEATH_EVENT", axis=1) # REMOVE this on real dataset
    data.insert(0, "predicted", model.predict(data))
    data.to_csv("./results.csv", index=False)
    return data

In [None]:
result = generate_predictions("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv",
                    ["ejection_fraction","serum_creatinine", "serum_sodium", "time", "creatinine_phosphokinase", "platelets"],
                    "./heart_failure_predictor.joblib")
result.sample(10)