# <center>Is Your Heart Healthy ?</center><br>
<img src = "https://h2hcardiaccenter.com/blog/wp-content/uploads/2018/07/shutterstock_556072003-1160x650-1024x574.jpg"></img><br>
#### <div align='right'>Made by: **Asad Mahmood</div>**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score as acs
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

<a id="toc"></a>

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Table of Contents</center></h2>

1. [About the Dataset](#Intro)
2. [Task](#Obj)
3. [Exploratory Data Analysis](#EDA)
    1. Data Exploration
    2. Visual Exploration
4. [Model Building](#Model)
    1. Train and Test Split
    2. Lazy Prediction
    3. Fine Tuning Best Model
5. [Evaluation](#Eval)

<a name="Intro"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>About the Dataset</center></h3>


**Cardiovascular diseases** (CVDs) are the number 1 cause of death globally, taking an estimated **17.9** million lives each year, which accounts for **31%** of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

The dataset can be found at: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

[Return to TOC](#toc)

<a name="Obj"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Task</center></h3>

Create a model for predicting mortality caused by Heart Failure using claasification techniques of your choice.

[Return to TOC](#toc)

<a name="EDA"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Exploratory Data Analysis</center></h3>

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Data Exploration</center></h4>

In [None]:
# Reading in datasets
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
# Shape of dataset
df.shape

In [None]:
# Checking for null values in %
round((df.isnull().sum()/len(df))*100)

**There are no null values in the columns so, no special pre-processing is required.**

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Visual Exploration</center></h4>

### Histogram Func

In [None]:
HEIGHT = 500
WIDTH = 900
NBINS = 50
SCATTER_SIZE=700


def plot_histogram(dataframe, column, color, bins, title, width=WIDTH, height=HEIGHT):
    '''
        Description:
        ----------
        This function plots a histogram.

        Parameters
        ----------
        dataframe : pandas dataframe
            Complete Dataframe
        column : Name of column as a str, that will be used in x axis 
            Example: "age"
        color: Name of column as str, that will be represented by color
            Example: "sex"
        bins: Numeric amount, to determine the width of bars
        title: Title of the plot, can be feeded directly as str
            Example: "Plot for xyz"
        width and height: Both are numeric

        Output
        ------
        A histogram plot
    '''
    figure = px.histogram(
        dataframe, 
        column, 
        color=color,
        nbins=bins, 
        title=title, 
        width=width,
        height=height
    )
    figure.update_layout({
            'plot_bgcolor': 'rgba(0, 0, 0, 0)',
            'paper_bgcolor': 'rgba(0, 0, 0, 0)',
        })
    figure.show()

### Violin Plot Func

In [None]:
def plot_violin(dataframe, X, y, title, width=WIDTH, height=HEIGHT):
    fig = px.violin(
        dataframe, 
        X, 
        y, 
        points = 'all',
        title = title,
        width = width,
        height = height,
        box = True
    )
    fig.update_layout({
        'plot_bgcolor': 'rgba(0, 0, 0, 0)',
        'paper_bgcolor': 'rgba(0, 0, 0, 0)',
    })
    fig.show()

In [None]:
plot_histogram(df, 'age', 'sex', NBINS, 'Patients Age and Sex Plot')

In [None]:
plot_violin(df, 'sex', 'age', 'Patients Age Death Distribution')

**Its seems that there are more males than females in this data and both genders have a greater density around the age of 55 to 65.**

In [None]:
plot_histogram(df, 'age', 'DEATH_EVENT', NBINS, 'Patients Age Death Plot')

In [None]:
plot_violin(df, 'DEATH_EVENT', 'age', 'Patients Age Death Distribution')

**Deaths increases as the age increases and there are fewer rows of data when death occurs**

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Feature Extraction</center></h4>

Using two algos for feature selection.

+ **ExtraTreesClassfier:** The purpose of the ExtraTreesClassifier is to fit a number of randomized decision trees to the data, and in this regard is a from of ensemble learning. Particularly, random splits of all observations are carried out to ensure that the model does not overfit the data.
<br>
+ **Step forward and backward feature selection:** This is a “wrapper-based” feature selection method, where the feature selection is based on a specific machine learning algorithm (in this case, the RandomForestClassifier). 
    - For forward-step selection, each individual feature is added to the model one at a time, and the features with the highest ROC_AUC score are selected as the best features. 
    - When conducting backward feature selection, this process happens in reverse — whereby each feature is dropped from the model one at a time, i.e. the features with the lowest ROC_AUC scores are dropped from the model.
<br>

The reason I'm using both of these is to demonstrate both of these strategies in action and also to double check my selection.

In [None]:
def feature_Select(x, y):
    # Lib import
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import RandomForestClassifier
    
    models = [
        ('Extra Trees Classifier:', ExtraTreesClassifier()),
        ('Random Forest Classifier:', RandomForestClassifier()),
        ]
    for name, model in models:

        model.fit(x,y)
        feat_importances = pd.Series(model.feature_importances_, index=x.columns).sort_values(ascending=False)

        # Displaying values
        figure = px.bar(feat_importances,
                        x = feat_importances.values, 
                        y = feat_importances.keys(), 
                        text = np.round(feat_importances.values, 2),
                        title = name + ' Feature Selection Plot')
        figure.update_layout({
            'plot_bgcolor': 'rgba(0, 0, 0, 0)',
            'paper_bgcolor': 'rgba(0, 0, 0, 0)',
        })
        figure.show()

In [None]:
# Feature Selection

x = df.iloc[:, :-1]
y = df.iloc[:,-1]

In [None]:
feature_Select(x, y)

I will be choosing top 3 features that have the highest importance score i.e "time", "serum_creatinine" and "ejection_fraction"

[Return to TOC](#toc)

<a name="Model"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Model Building</center></h3>

In [None]:
# Selecting x and y
x = df.iloc[:, [4,7,11]].values
y = df.iloc[:,-1].values

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Train and Test Split</center></h4>

In [None]:
# Splitting the dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state =0)

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Feature Scaling</center></h4>

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Lazy Prediction</center></h4>

In [None]:
!pip install lazypredict

In [None]:
!pip install --upgrade pandas

In [None]:
from lazypredict.Supervised import LazyClassifier

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

print(models)

I am selecting the top two most accurate models and going to fine tune them so as to get an even better results.<br>

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Fine Tuning Model</center></h4>

In [None]:
def det_Leaves(X_train, X_test, y_train, y_test):

    acc = []
    
    for leaves in range(10,20):
        clf = DecisionTreeClassifier(max_leaf_nodes = leaves, random_state=0, criterion='entropy')
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        acc.append(acs(y_test, y_pred))
    
    plt.plot(list(range(10,20)), acc)
    plt.show()

In [None]:
def det_Estimators(X_train, X_test, y_train, y_test):
    
    acc = []
    for estimators in range(15,25):
        clf = ExtraTreesClassifier(n_estimators = estimators, random_state=0, criterion='entropy')
        #clf = RandomForestClassifier(n_estimators = estimators, random_state=0, criterion='entropy')
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        acc.append(acs(y_test, y_pred))
    plt.figure(figsize=(15, 5))
    plt.title('Estimators')
    plt.plot(list(range(15,25)), acc)
    plt.show()

### 1. ExtraTreesClassifier

In [None]:
det_Estimators(X_train, X_test, y_train, y_test)

**Selected parameters:**
+ Leaves: 18
+ Estimators: 16

In [None]:
classifier = ExtraTreesClassifier(n_estimators = 17, criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)


y_pred_ext = classifier.predict(X_test)
acs(y_test, y_pred_ext)

### 2. DecisionTreesClassifier

In [None]:
det_Leaves(X_train, X_test, y_train, y_test)

In [None]:
classifier = DecisionTreeClassifier(max_leaf_nodes = 20, criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)


y_pred_dtc = classifier.predict(X_test)
acs(y_test, y_pred_dtc)

Accuracies were improved of the models but ExtraTreesClassifer wins with the highest accuracy. Now, I will move on to evaluating the models in terms of specificty, sensitivity and etc.

[Return to TOC](#toc)

<a name="Eval"></a>

<a name="Model"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Evaluation</center></h3>

### Selected Model for Evaluation: Extra Trees Classifier

In [None]:
def model_eval(y_test, y_pred):
    '''
    Description:
    ----------
    This function plots a confusion matrix.
    
    Parameters
    ----------
    y_test : True Y values of test set
    y_pred : Predicted Y values

    Output
    ------
    Labelled confusion matrix 
    '''
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt
    
    # Calculate confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred)

    # Visualize it
    group_names = ['True Neg','False Pos','False Neg','True Pos']
    group_counts = ["{0:0.0f}".format(value) for value in
                    cf_matrix.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cf_matrix.flatten()/np.sum(cf_matrix)]

    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    
    ## Set size of confusion matrix
    plt.figure(figsize = (8,5))
    
    ## Plot the heatmap
    sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Greens', cbar=False)

In [None]:
model_eval(y_test, y_pred_ext)

[Return to TOC](#toc)