# HEART FAILURE CLASSIFICATION

With this study, I am going to use Random Forest Classifier model to predict mortality by heart failure.

* First of all I will use data visualization technics to understand our dataset.  

* Then I will use the random forest classifier with its' default parameters and make some predictions. We will see the accuracy, precision, recall, f1 scores for our default model. And I will demonstrate the results with a confusion matrix.

* Later I will tune the default model with the n_estimators (number of trees) and max_features (number of features for the best split) parameters by using gridsearch cross validation technic.

* We will again evaluate the tuned model and see for any improvement.

* It is not always good idea to use the best parameters. The best parameters could improve our accuracy (or precision, recall etc.), but at the same time we should consider about the computation power and the model performance. For some application it could be wiser to choose some other values for parameters in spite of the decrease in the model scores.

* For the optimum model, I will investigate the important features. Then I will use only those features for the new model. 

* Also I will plot the n_estimator parameters and the model scores to find an optimum n_estimator value similar to the best parameter value. 

* By using the less features and a smaller n_estimator (number of trees), our random forest model will perform faster but still with good model scores.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df = data.copy()

display(df.head())
df.info()

# EDA:

First of all I want to see the target feature's counts.

In [None]:
sns.countplot(df['DEATH_EVENT'], palette=['blue', 'red'])
plt.title('Target Feature Counts', fontsize=20);

It looks like the dataset is unbalanced.

What are the percentages of the male and the female patients?

In [None]:
plt.pie(df['sex'].value_counts().values, 
        labels=['Men', 'Women'], 
        colors=['cyan', 'pink'], 
        autopct='%1.f%%', 
        shadow=True, 
        startangle=45, 
        textprops={'fontsize':25});

In [None]:
sns.countplot(df['smoking'],
              palette=['orange', 'brown'])
plt.title('Smokers and Non-smokers Counts', fontsize=20);

What is the age distribution of the patients?

In [None]:
print('Age Statistics of the Patients' + '\n\n' + str(df.age.describe()))

In [None]:
sns.boxplot(df.age)
plt.title('Age Statistics of the Patients', fontsize=20);

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(df['age'], hue=df['smoking'], palette=['blue', 'red'], alpha=0.7)
plt.title("Age and Smoking", fontsize=20)
plt.xticks(rotation=90)
plt.yticks(list(range(0,27,3)))
plt.grid();

In [None]:
plt.figure(figsize=(11,11))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation of the Features', fontsize=20);

From the heatmap above, we can see the strongly correlated features with the 'DEATH_EVENT'. Those features are **'age', 'ejection_fraction', 'serum_creatinine', 'serum_sodium', 'time'**. Later in this study I will investigate the important features from the random forest model. And we will see the similarity of the features.

For the ease of calculations I want to scale the dataset.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.head()

Now it's time to seperate the features and the target variable.

In [None]:
X = df_scaled.drop('DEATH_EVENT', axis=1).values
y = df_scaled['DEATH_EVENT'].values

Split our dataset into train and test samples.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

print('X_train : ', X_train.shape)
print('y_train : ', y_train.shape)
print('X_test  : ', X_test.shape)
print('y_test  : ', y_test.shape)

## Default Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Defining the model creation and evaluation function, so we don't have to write it again and again.

def model_and_eval(max_features, n_estimators, random_state):
    rf = RandomForestClassifier(max_features=max_features, random_state=random_state)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    
    # Printing the model scores:
    print('Mean accuracy  : %.2f' % accuracy_score(y_test, y_pred))
    print('Mean precision : %.2f' % precision_score(y_test, y_pred))
    print('Mean recall    : %.2f' % recall_score(y_test, y_pred))
    print('Mean f1 score  : %.2f' % f1_score(y_test, y_pred))
    
    # Creating the confusion matrix:
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, annot_kws={"fontsize":20}, fmt='d', cbar=False, cmap='PuBu')
    plt.title('Confusion Matrix of the Model', color='navy', fontsize=15)
    plt.xlabel('Predicted Values')
    plt.ylabel('Actual Values');

Let's see our default model. I used default max_features and n_estimators values, and used random_state to have the same results each run.

In [None]:
model_and_eval(max_features='auto', n_estimators=100, random_state=10)

## Model Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators' : [10, 15, 20, 30, 50, 100, 150], # Number of decision trees
              'max_features' : [0.5, 2, 5, 10, 12]}            # Number of features to consider at each split

rf = RandomForestClassifier(random_state=10)

gs = GridSearchCV(rf, param_grid, cv=10, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Parameter ', gs.best_params_)

Let's see the tuned model with the best parameters:

In [None]:
model_and_eval(max_features=gs.best_params_['max_features'], n_estimators=gs.best_params_['n_estimators'], random_state=10)

## Feature Importance and Increasing the Performance

I want to see the important features for the random forest model.

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

ft_imp = pd.Series(rf.feature_importances_, index=df.iloc[:,:12].columns).sort_values()

ft_imp.plot(kind='barh')
plt.title('Feature Importance', fontsize=20);

Let's use only the important features to decrease our model's calculations and increase our model's performance.

In [None]:
X_new = df_scaled[['time', 'serum_creatinine', 'ejection_fraction', 'age', 'creatinine_phosphokinase', 'platelets', 'serum_sodium', 'smoking']]

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=10)

In [None]:
from sklearn.model_selection import GridSearchCV

n_estimators = list(range(1, 101))

param_grid = {'n_estimators' : n_estimators,
              'max_features' : [2, 5, 10, 12]}

rf = RandomForestClassifier(random_state=42)

gs = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)

gs.fit(X_train, y_train)

scores = gs.cv_results_['mean_test_score']

print('Best Parameter ', gs.best_params_)

In [None]:
best_x = gs.best_params_['n_estimators']
best_y = gs.cv_results_['mean_test_score'][gs.best_params_['n_estimators']-1]

plt.figure(figsize=(15,5))
sns.lineplot(n_estimators, scores[:100], color='navy')
plt.plot(best_x, best_y, marker='o', markersize=8, color="red", label='best_param')
plt.xlabel('n_estimators')
plt.ylabel('Accuracy')
plt.title('Random Forest n_estimators and Accuracy Plot', fontsize=20)
plt.xticks(np.arange(0, 100, 5), rotation=45)
plt.grid();

Grid search cross validation gave us '44' as the best n_estimator. But we can see from the graph above that it's unnecessary to build our random forest model with 44 decision tress. It is wiser to choose a lower n_estimator with a good mean accuracy, so we don't have to use up that much computation power for our model. I am going to choose '25' as my optimum n_estimator.

In [None]:
model_and_eval(max_features=2 , n_estimators=25 , random_state=10)

**As you can see from above, we could obtain similar accuracy, precision, recall and f1 score with only 8 features and 25 tress. That means, with less computation power, we can still obtain the similar results.**