# About the Classification Modelling Technique To Predict Heart Failure
![](https://www.nhlbi.nih.gov/sites/default/files/styles/16x9_crop/public/2021-02/Heart%20failure%20-%20shutterstock_1663310782.jpg?h=8854d737&itok=C_OOM84X)

This analysis is primarily for rookies who are just begining their predictive modeling journey using the pycaret.classification Module.

In this analysis we will learn:

**Getting Data:** How to import data from PyCaret repository
**Setting up Environment:** How to setup an experiment in PyCaret and get started with building classification models
**Create Model:** How to create a model, perform stratified cross validation and evaluate classification metrics
**Tune Model:** How to automatically tune the hyper-parameters of a classification model
**Plot Model:** How to analyze model performance using various plots
**Finalize Model:** How to finalize the best model at the end of the experiment
**Predict Model:** How to make predictions on new / unseen data

# Getting the data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
data.head()

**Let's look at the shape of the data**

In [None]:
data.shape

# Setting Up Environment

**Installing PyCaret**

**Pre-requisites **
* Python 3.x
* Latest version of pycaret
* Internet connection to load data from pycaret's repository
* Basic Knowledge of Binary Classification

In [None]:
!pip install pycaret

**Calling the classification model**

In [None]:
from pycaret.classification import *

In [None]:
exp_clf101 = setup(data = data, target = 'DEATH_EVENT', session_id=123) 

Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup() is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:

**session_id :** A pseduo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.

**Target Type :** Binary or Multiclass. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.

**Label Encoded :** When the Target variable is of type string (i.e. 'Yes' or 'No') instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment no label encoding is required since the target variable is of type numeric.

**Original Data :** Displays the original shape of the dataset. In this experiment (22800, 24) means 22,800 samples and 24 features including the target column.

**Missing Values :** When there are missing values in the original data this will show as True. For this experiment there are no missing values in the dataset.

**Numeric Features :** The number of features inferred as numeric. In this dataset, 14 out of 24 features are inferred as numeric.

**Categorical Features :** The number of features inferred as categorical. In this dataset, 9 out of 24 features are inferred as categorical.

**Transformed Train Set :** Displays the shape of the transformed training set. Notice that the original shape of (22800, 24) is transformed into (15959, 91) for the transformed train set and the number of features have increased to 91 from 24 due to categorical encoding

**Transformed Test Set :** Displays the shape of the transformed test/hold-out set. There are 6841 samples in test/hold-out set. This split is based on the default value of 70/30 that can be changed using the train_size parameter in setup.
Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation (in this case there are no missing values in the training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in setup() are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.

*Source: PyCaret Tutorial*

**Comparing Model**

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1 and Kappa accross the folds (10 by default) of all the available models in the model library.

I have used the AUC as the metric of choice to sort the model efficacy, you can select your choice of metric.

In [None]:
compare_models(sort='AUC')

It's such a breeze to work with PyCaret with a half a line of code it has created 15 classification models that one can choose from based on their choice of metric. From this point onward I will work only with the Random Forest model "rf", given that it showed the best result based on my choice of metric.

In [None]:
rf = create_model('rf')

# Tuning the Model

When a model is created using the create_model() function it uses the default hyperparameters. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold.

**Note:** tune_model() does not take a trained model object as an input. It instead requires a model name to be passed as an abbreviated string similar to how it is passed in create_model(). All other functions in pycaret.classification require a trained model object as an argument.

*Source: PyCaret Tutorial*

In [None]:
tuned_rf = tune_model(rf, optimize = 'AUC')

**Post optimization we can clearly see that the AUC metric improve by 2.5%**

# Plot the Model

In [None]:
plot_model(tuned_rf, plot = 'auc')

# Precision Recall Curve

In [None]:
plot_model(tuned_rf, plot = 'pr')

# Feature Importance

In [None]:
plot_model(tuned_rf, plot='feature')

In [None]:
interpret_model(tuned_rf)

In [None]:
interpret_model(tuned_rf, plot='correlation')

In [None]:
interpret_model(tuned_rf, plot = 'reason', observation = 10)

# Confusion Matrix

In [None]:
plot_model(tuned_rf, plot = 'confusion_matrix')

**Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.**

In [None]:
evaluate_model(tuned_rf)

# Finalizing the Model

Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling etc. 

This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset and for the purposes of this analysis we have just used one model i.e. Random Forest based on the AUC metric and we will be finalizing this model for prediction purposes.


In [None]:
final_rf = finalize_model(tuned_rf)

In [None]:
predict_model(final_rf);

In [None]:
prediction = predict_model(final_rf, data = data)
prediction.head()

The Label and Score columns are added onto the prediction dataset.
Label is the prediction and score is the probability of the prediction. 
Notice that predicted results are concatenated to the original dataset while all the transformations are automatically performed in the background.

# Saving the Model

We have now finished the experiment by finalizing the tuned_rf model which is now stored in final_rf variable. We have also used the model stored in final_rf to predict the outcomes. This brings us to the end of our analysis, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [None]:
save_model(final_rf,'Final RF Model 08Feb2020')

# Loading the Saved Model

In [None]:
saved_final_rf = load_model('Final RF Model 08Feb2020')

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below I have applied the loaded model to predict.

In [None]:
new_prediction = predict_model(saved_final_rf, data=data)
new_prediction.head()