Here we are given a dataset to predict an even of death happenieng given certain independent features like some test results, diseases, etc. Approach would be to understand the data a bit and then build a model on it.

NOTE: We have to take this with a pinch of salt that the data we are given is not a represantation of all the data groups we have, because we only have 299 rows which might not be very sufficient to draw conclusions. But assuming that this is all the data we have we will do some anaysis and build a Model.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
file = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
file.head()

In [None]:
file.shape

First lets gaether some information abour the features that we have in the dataset: source(Wikipedia)

- Anaemia: It is a decrease in the total amount of red blood cells (RBCs) or hemoglobin in the blood, or a lowered ability of the blood to carry oxygen
- Creatinine_phosphokinase: Creatine kinase (CK), also known as creatine phosphokinase (CPK) or phosphocreatine kinase, is an enzyme expressed by various tissues and cell types.
- Diabetes: Diabetes mellitus (DM), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time.
- Ejection_fraction: An ejection fraction (EF) is the volumetric fraction (or portion of the total) of fluid (usually blood) ejected from a chamber (usually the heart) with each contraction (or heartbeat).
- High_blood_pressure:It is a long-term medical condition in which the blood pressure in the arteries is persistently elevated.
- Platelets: Platelets, or thrombocytes, are small, colorless cell fragments in our blood that form clots and stop or prevent bleeding.
- Serum_creatinine A serum creatinine test measures the level of creatinine in your blood and provides an estimate of how well your kidneys filter (glomerular filtration rate)
- Serum_sodium: A sodium blood test is a routine test that allows your doctor to see how much sodium is in your blood. It's also called a serum sodium test. Sodium is an essential mineral to your body. It's also referred to as Na+. Sodium is particularly important for nerve and muscle function


Lets visualize the data using some diseases

# Age Groups
Lets further investigate how the Age factor with respect to Sex affects the Deaths

In [None]:
bins= [45,50,55,60,65,70,75,80,85,90,95,120]
labels = ['45-50','51-55','56-60','61-65','66-70','71-75','76-80','81-85','86-90','91-95','96-100']
file['AgeGroup'] = pd.cut(file['age'], bins=bins, labels=labels, right=False)

plt.rcParams["figure.figsize"] = 10,8
df1 = file.groupby(["AgeGroup", "sex"]).agg({"DEATH_EVENT": "count"}).unstack()
df1.plot(kind = "bar", stacked = True)
plt.show()

# **Visualizing the two Tests - Serum Creatinine and Serum Sodium**

**Lets look at the two tests we have and see how they are related to deaths **

In [None]:

sns.boxplot(file["AgeGroup"], file["serum_creatinine"], hue= file["DEATH_EVENT"])

This is a very interesting plot:

- Most of the outliers we see in this plot are the cases which have resulted in a DEATH_EVENT.
- The mean value of the test in case of a DEATH EVENT is way more when a person has survived. Except for the age-group 86-90.
- There are some outliers which we can also for the test when the person has survived in the age group 61 - 70.

In [None]:
sns.boxplot(file["AgeGroup"], file["serum_sodium"], hue= file["DEATH_EVENT"])

This is also a very interesting chart from my point of view:

- From the above plot we can cleary say that lower the serum sodium level, greater are the chances of a DEATH EVENT occuring.
- It is a good indicator that early age - groups betwen 45 - 65 can sustain lower serum sodium levels but as the age increases rage of 130 is also beginning to look fatal.
- All the people who have survived have their Serum Sodium levels mean above 135.
- We can see some outliers in the 61 -65 age group which we need to check.

# Creatinine Phosphokinase

In [None]:
sns.boxplot(file["AgeGroup"], file["creatinine_phosphokinase"], hue = file["DEATH_EVENT"])

# DIABETES

In [None]:
plt.rcParams["figure.figsize"] = 12, 8
df4 = file.groupby(["AgeGroup", "diabetes"]).agg({"DEATH_EVENT": "count"}).unstack()
df4.plot(kind = "bar")
plt.show()

# Ejection fraction wrt DEATH EVENT

In [None]:
sns.boxplot(file["AgeGroup"], file["ejection_fraction"], hue = file["DEATH_EVENT"])

Some good observations from the above plot:
- Lower the ejection fraction higher are the chances of a DEATH EVENT.
- As we know from our description it is the fraction of blood being pumped from the heart and as this decreases the chances of blood reaching our remote parts of the body also decreases which may cause seizues or affect the body in some form.
- The mean for this fraction is around 37 38 (this is from the sample that we have at hand)


# High Blood Pressure wrt DEATH EVENT

In [None]:
plt.rcParams["figure.figsize"] = 8,6
sns.countplot(file["high_blood_pressure"],  hue = file["DEATH_EVENT"])

- We can say that the ratio of DEATH EVENT in terms of high blood pressure is quite more if we compare to No blood pressure.

# Anaemia with Platelets with respect to DEATH_EVENTS

In [None]:
plt.rcParams["figure.figsize"] = 10, 6
sns.stripplot(file["anaemia"], file["platelets"], hue=file["DEATH_EVENT"])

This is a fairly distributes plot by all means, there are not much differences:

- Platelet distribution for DEATH EVENT is also decently distributed
- So is the case for Anaemia

Though this plot completely depends on the count of people in the Agegroup we have, our dataset has more rows from the first 6 groups so from 45 - 75 age group but we can see that as we cross 75 the death rate increases by a lot.



# Smoking

In [None]:
plt.rcParams["figure.figsize"] = 8,6
sns.countplot(file["smoking"],  hue = file["DEATH_EVENT"])

- **I actually thought that smoking would have an affect on the DEATH EVENT numbers but as we can see it is considerable equal in ratios.**

We have good insights of the data, now lets build a model.

# Building Models

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
#In such problems we should always choose our metrics which suit the problem. 
def metric(model, preds, y_valid):
    
    f1 = f1_score(y_valid, preds)
    precision = precision_score(y_valid, preds)
    recall = recall_score(y_valid, preds)
    if hasattr(model, 'oob_score_'):
        return f1, precision, recall, model.oob_score_
    else:
        return f1, precision, recall

    
def feat_imp(model, cols):
    return pd.DataFrame({"Col_names": cols, "Importance": model.feature_importances_}).sort_values("Importance", ascending=False)

def plot_i(fi, x, y):
    return fi.plot(x, y, "barh", figsize = (12,8))

In [None]:
X = file.drop(["DEATH_EVENT", "AgeGroup"], axis = 1)
y = file["DEATH_EVENT"]

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X, y)

In [None]:
%%time
Rf = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=3, max_features= 0.6, oob_score=True)
model_Rf = Rf.fit(x_train, y_train)
preds = model_Rf.predict(x_valid)
print(metric(model_Rf, preds, y_valid))


This seems to be a decent model, as we are getting a good recall score of abour 65%. As here we are much more concerned about false negatives. We will try to improve it even further using hyper parameter tuning

Getting some insights from our RandomForest model.

In [None]:
feat10 = feat_imp(model_Rf, x_train.columns)
feat10

In [None]:
plot_i(feat10, "Col_names", "Importance")

Building a model based on some important features

In [None]:
to_keep = feat10[feat10["Importance"] > 0.03]
len(to_keep)

In [None]:
X = X[to_keep.Col_names]
x_train, x_valid, y_train, y_valid = train_test_split(X, y)

In [None]:
%%time
Rf = RandomForestClassifier(n_estimators=160, max_depth=5, min_samples_leaf=3, max_features= 0.5, oob_score=True)
model_Rf = Rf.fit(x_train, y_train)
preds = model_Rf.predict(x_valid)
print(metric(model_Rf, preds, y_valid))

This has increased our model performace by a great extent we have gone to a score of 71% on recall which I think is a great score to achieve. 

Further lets take the imp variables from this model and use it to build our Logistic regression model

In [None]:
feat_2 = feat_imp(model_Rf, x_train.columns)
feat_2

# Logistic Regression model

In [None]:
to_keep = feat_2[feat_2["Importance"] > 0.1]
to_keep

In [None]:
X = X[to_keep.Col_names]
X.head(2)

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X, y)


In [None]:
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
penalty = ['l1', 'l2', 'elasticnet']
C = [0.2, 0.4, 0.6, 0.8, 1]
Lr = LogisticRegression()
param_grid = dict(solver = solver, penalty = penalty, C= C)
grid = GridSearchCV(Lr, param_grid=param_grid, n_jobs=-1, cv = 3)         


In [None]:
%%time
grid_result = grid.fit(x_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
Lr = LogisticRegression()
model_Lr = Lr.fit(x_train, y_train)
preds = model_Lr.predict(x_valid)
print(metric(model_Lr, preds, y_valid))

As we saw that we did get a good score using Random forest without doing much with a precision of about 71%. Using other Boosting algorithms would be an overkill for such a small dataset. 

Most important takeaway from this is understanding the relationships of the independant variables with the dependent variablle. For eg: how a particular test result like Serum Creatinine can give us an idea about the health of a patient wrt to a death event occuring.

Any feedback or suggestions would be highly appreciated! :) Stay safe