# Heart failure

In this notebook, we work on Cardiovascular diseases (CVDs) to understand its behavior. Cardiovascular diseases (CVDs) are the first cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs. We are going to use 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as **tobacco use**, **unhealthy diet and obesity**, **physical inactivity and harmful use of alcohol** using population-wide strategies.

To do so, we will talk:
1. [Load package and prepare data](#load)
2. [Exploratory data analysis](#eda)
3. [Data visualization](#viz)
4. [Modelling](#mod)

    4.1 [Predict heart failure](#heart)
     
    4.2 [Find relevant informative attributes that causes heart disease.](#rel) 
     
Let's start

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import normaltest, anderson
import scipy
from warnings import filterwarnings

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_val_predict, cross_validate
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, RobustScaler 
from xgboost import XGBRFClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import RocCurveDisplay, confusion_matrix
from sklearn.metrics import plot_roc_curve, roc_auc_score, classification_report, accuracy_score, f1_score
from sklearn.metrics import recall_score, plot_confusion_matrix, precision_score, plot_precision_recall_curve, classification_report
    
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector 
import xgboost
from sklearn.decomposition import PCA

In [None]:
sns.set(style='whitegrid')
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 10000)
filterwarnings('ignore')
pd.plotting.register_matplotlib_converters()
%matplotlib inline
print("Setup Complete")

<a id='load'> </a>

# Load and prepare data

In [None]:
cvd = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
cvd.tail()

In [None]:
cvd.info()

# Attributes explanations

1. **age**: Age of the patient (years).
2. **anaemia**: Decrease of red blood cells or hemoglobin (boolean).     
3. **creatinine_phosphokinase (cpk)**: Level of the CPK enzyme in the blood (mcg/L).
4. **diabetes**: If the patient has diabetes (boolean).
5. **ejection_fraction**: Percentage of blood leaving (%).
6. **high_blood_pressure**: If a patient has hypertension (boolean).
7. **platelets**: Platelets in the blood (kiloplatelets/mL).
8. **serum_creatinine**: Level of creatinine in the blood (mg/dL).
9. **serum_sodium**: Level of sodium in the blood. (mEq/L).
10. **smoking**: If the patient smokes (boolean).
11. **time**: Follow-up period (days).
12. **DEATH_EVENT**: If the patient died during the follow-up period (target boolean). 

<a id='eda'></a>

# Exploratory data analysis

In [None]:
no_bool_cols = ['creatinine_phosphokinase', 'platelets', 'ejection_fraction', 'serum_sodium',
        'serum_creatinine', 'time', 'age']

In [None]:
#we describe only not categorical attribute.
cvd[no_bool_cols].describe()

In [None]:
cvd[no_bool_cols].mode()

Many cadiovascular patients have:
1. **60 years old**.
2. **187 and 250 days**.
3. **1 mg/dL creatinine in the blood**.
4. **136 mEq/L sodium in the blood**.
5. **35% of blood leaving**.
6. **263358 kiloplatelets/mL**.
7. **582 mcg/L CPK enzyme in the blood**.

In [None]:
cvd[no_bool_cols].corr()

**correlation shows that all numeric attributes are independant.**

In [None]:
cvd['DEATH_EVENT'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['DEATH_EVENT'].value_counts()/sum(cvd['DEATH_EVENT'].value_counts()))*100

**Data is imbalanced. 68% for 0 (no) and 32% for (1) yes.**

In [None]:
cvd['sex'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['sex'].value_counts()/sum(cvd['sex'].value_counts()))*100

**65% of cardiovascular patient are females and 35% of patient are man.**

In [None]:
cvd['anaemia'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['anaemia'].value_counts()/sum(cvd['anaemia'].value_counts()))*100

**In 100 cardiovascular patients, there exist 57 patients who have not anemia but 43 patients have.**

In [None]:
cvd['diabetes'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['diabetes'].value_counts()/sum(cvd['diabetes'].value_counts()))*100

**In 100 cardiovascular patients, we have 42 patients (yes) and 58 patients (no).**

In [None]:
cvd['high_blood_pressure'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['high_blood_pressure'].value_counts()/sum(cvd['high_blood_pressure'].value_counts()))*100

**65% of cardivascular patients have not hypertension but 35% of cardiovasc. patient have hypertension.**

In [None]:
cvd['smoking'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(cvd['smoking'].value_counts()/sum(cvd['smoking'].value_counts()))*100

**69% of patient does not smoke. but 32% of patient smokes.**

In [None]:
cvd[no_bool_cols].skew()

In [None]:
cvd[no_bool_cols].kurtosis()

**Only time and age have positive kurtosis (intliers)**.

In [None]:
cvd[no_bool_cols].plot(kind='box', subplots=True, layout=(3,3), figsize=(15,20), title='Numeric attributes')
plt.show()

<a id='viz'> </a>

# Visualization

Now, we create multibarplot, multiboxplot, multihistogram. And study each numeric attribute with all categorical attributes.

In [None]:
def  multibarplot(column=None):
    figu = plt.figure(figsize=(20,15))
    figu.subplots_adjust(wspace=0.2, hspace=0.2)
    cat = ['high_blood_pressure', 'diabetes', 'sex', 'anaemia','smoking']
    for i, u in enumerate(cat):
        ax = figu.add_subplot(2,3,i+1)
        sns.barplot(hue=cat[i], y=column, data=cvd, x="DEATH_EVENT", ax=ax)
        ax.set_title(f'Cardiovasc. disease:  {column}-death_event/{cat[i]}.')
    plt.show()
    
def  multiboxplot(column=None):
    figu = plt.figure(figsize=(20,15))
    figu.subplots_adjust(wspace=0.2, hspace=0.2)
    cat = ['high_blood_pressure', 'diabetes', 'sex', 'anaemia','smoking']
    for i, u in enumerate(cat):
        ax = figu.add_subplot(2,3,i+1)
        sns.boxplot(hue=cat[i], y=column, data=cvd, x="DEATH_EVENT", ax=ax)
        ax.set_title(f'Cardiovasc. disease:  {column}-death_event/{cat[i]}.')
    plt.show()
    
def multi2Dhistogram(column=None):
    fig= plt.figure(figsize=(15,10))
    fig.subplots_adjust(wspace=0.2, hspace=0.2)
    cols = ['creatinine_phosphokinase', 'platelets', 'ejection_fraction', 'serum_sodium',
        'serum_creatinine', 'time', 'age']
    cols = list(set(cols) - set([column]))
    for i, u in enumerate(cols):
        ax = fig.add_subplot(2,3, i+1)
        sns.histplot(x=column, y=u, data=cvd, hue='DEATH_EVENT', ax=ax, bins=10, stat='density')
        ax.set_title(f'Histogram: {column}-{u}.')
    plt.show()
    
def histogram_attribute():
    fig= plt.figure(figsize=(20,20))
    fig.subplots_adjust(wspace=0.2, hspace=0.2)
    cols = ['creatinine_phosphokinase', 'platelets', 'ejection_fraction', 'serum_sodium',
        'serum_creatinine', 'time', 'age']
    
    for i, u in enumerate(cols):
        ax = fig.add_subplot(3,3, i+1)
        sns.histplot(x=u, data=cvd, hue='DEATH_EVENT', ax=ax, bins=10, kde=True)
        ax.set_title(f'Histogram: {u}.')
    plt.show()
    
def countplot(cols=None):
    fig= plt.figure(figsize=(20,20))
    fig.subplots_adjust(wspace=0.2, hspace=0.2)
    for i, u in enumerate(cols):
        ax = fig.add_subplot(2,3, i+1)
        sns.countplot(x=u, data=cvd, hue='DEATH_EVENT', ax=ax)
        ax.set_title(f'Countplot: {u}.')
    plt.show()
    
def barplot_attribute():
    fig= plt.figure(figsize=(20,20))
    fig.subplots_adjust(wspace=0.2, hspace=0.2)
    cols = ['creatinine_phosphokinase', 'platelets', 'ejection_fraction', 'serum_sodium',
        'serum_creatinine', 'time', 'age']
    
    for i, u in enumerate(cols):
        ax = fig.add_subplot(3,3, i+1)
        sns.barplot(y=u, data=cvd, x='DEATH_EVENT', ax=ax)
        ax.set_title(f'Barplot: {u}.')
    plt.show()

## Histogram for all numeric attributes.

In [None]:
histogram_attribute()

**Some attribute at DEATH_EVENT=1 seems to come from poisson process or normal process. We can check it later.**

In [None]:
barplot_attribute()

### Age attribute.

In [None]:
#boxplot
plt.figure(figsize=(8,5))
sns.boxplot(x="DEATH_EVENT", y="age", data=cvd)
plt.title('Cardiovascular disease: Age-Death_Event.')
plt.show()

**Insights**

1. **Age median of cv patients where DEATH_EVENT = 0 is 60 years old**.
2. **Age median of cv patients where DEATH_EVENT = 1 is 65 years old**.

In [None]:
multiboxplot('age')

**Insights**

We denote **AgeMed** the age median of cv's patient. 

1. **AgeMed**(DEATH_EVENT = 0 | {smoking=0 or smoking=1}, {sex=0 or sex=1}, {diabetes=0 or diabetes=1}, {anaemia=0 or anaemia = 1}, {high_blood_pressure = 0 or high_blood_pressure = 1}) **is less or equal to 60 years old**.
2. **AgeMed**(DEATH_EVENT = 1 | {smoking=0 or smoking=1}, {sex=0 or sex=1}, {diabetes=0 or diabetes=1}, {anaemia=0 or anaemia = 1}, {high_blood_pressure = 0 or high_blood_pressure = 1}) **is greater or equal to 60 years old**.

Question: What is the probability that a cardiovascular patient over 60 years old will die? (answer after).

#### Age estimation. 

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(y="age", data=cvd, x="DEATH_EVENT")
plt.title('Cardiovascular disease: Age-Death_Event.')
plt.show()

**This plot shows the estimation of age with confidence interval(95%) when DEATH_EVENT = 0 and DEATH_EVENT = 1. We can see that the central tendency (mean estimated) for DEATH_EVENT = 0 is less than 60 years. But for DEATH_EVENT = 1 is greater than 60 years old.**

In [None]:
multibarplot('age')

**We can see  same behavior for different categorical attributes**. 

### Time attribute

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x="DEATH_EVENT", y="time", data=cvd)
plt.title('Cardiovascular disease: Time-Death_Event.')
plt.show()

**Only two patients where DEATH_EVENT = 1 have time > 200 days.**

In [None]:
multiboxplot(column='time')

#### Time estimation.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(y="time", data=cvd, x="DEATH_EVENT")
plt.title('Cardiovascular disease: Time-Death_Event.')
plt.show()

**All patients where DEATH_EVENT = 1 have central tendency (average time) less than or equal to 75 days.**

In [None]:
multibarplot('time')

### Other visualization

In [None]:
bool_cols = list(set(cvd.columns) - set(no_bool_cols))

In [None]:
countplot(bool_cols)

**We note**

nb_dp: number of death's patient

1. nb_dp(DEATH_EVENT == 1 | aneamia == 0) > nb_dp(DEATH_EVENT == 1 | anaemia == 1) 
2. nb_dp(DEATH_EVENT == 1 | high_blood_pressure == 0) > nb_dp(DEATH_EVENT == 1 | high_blood_presure == 1)
3. nb_dp(DEATH_EVENT == 1 | diabetes == 0) > nb_dp(DEATH_EVENT == 1 | diabetes == 1) 
4. nb_dp(DEATH_EVENT == 1 | smoking == 0) > nb_dp(DEATH_EVENT == 1 | smoking == 1) 


### Death's patient visualization.

In [None]:
death = cvd[cvd['DEATH_EVENT'] == 1]

In [None]:
death.drop(columns=['DEATH_EVENT'], inplace=True)

In [None]:
death[no_bool_cols].describe()

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(death.corr(), annot=True)
plt.title('Correlation heatmap.')
plt.show()

In [None]:
def hist_death_patient(value=None, cols=no_bool_cols, data=death):
    fig= plt.figure(figsize=(20,15))
    fig.subplots_adjust(wspace=0.2, hspace=0.3)
    for i, u in enumerate(cols):
        ax = fig.add_subplot(2, 4, i+1)
        sns.histplot(x=u, data=data, hue=value, ax=ax, stat='probability', kde=True, cumulative=True)
        ax.set_title(f'Death histogram: {value}|{u}.')
    plt.show()

In [None]:
for c in range(len(bool_cols)):
    if bool_cols[c] == 'DEATH_EVENT':
        continue
    print(bool_cols[c])
    hist_death_patient(value=bool_cols[c])

**Note**

The cumulative distribution function of a real-valued random variable X is the function given by
> $F_X(x) = P(X \leq x)$

## What are numeric attributes that seem to come from poisson process or Normal process for death of the cardiovascular population?

In this part, we are making test hypothesis to know what attribute come from poisson process or normal process in the death of the cardiovasc. population.  

In [None]:
def NormalTest(cols=no_bool_cols, data=death):
    print('In the Death cardiovascular population.\n H0: x come from a normal distribution.\n')
    norma_attr = []
    for u in cols:
        k2, p = normaltest(data[u])  
        if p < 0.001:
            print(f'For {u}: The null hypothesis can be rejected. pvalue={p}')
        else:
            print(f'For {u}: The null hypothesis cannot be rejected. pvalue={p}')
            norma_attr.append(u)
            
    print(f'\n The attributes that come from normal process are: {norma_attr}.')
    return norma_attr

In [None]:
 normal_cols = NormalTest()

### Let's study platelets, ejection_fraction, age.

By theses three normal attributes, what attribute that causes heart diseases.

In [None]:
def plot_curve(data=None, label=None):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    ax1.plot(sorted(data)[::-1], 'o')
    ax1.set_xlabel('Patient')
    ax1.set_ylabel(label)
    ax1.set_title(label)
    sns.histplot(data, stat='probability', kde=True, ax=ax2)
    ax2.set_xlabel(label)
    ax2.set_ylabel('Probability')
    ax2.set_title('Probability distribution function.')

In [None]:
def multiscatterplot(column=None, cols=None, data=None, label=None):
    fig= plt.figure(figsize=(15,10))
    fig.subplots_adjust(wspace=0.2, hspace=0.2)
    cols = list(set(cols) - set([column]))
    for i, u in enumerate(cols):
        ax = fig.add_subplot(2,3, i+1)
        sns.scatterplot(x=column, y=u, data=cvd, hue=label)
        ax.set_title(f'scatter: {column}-{u}.')
    plt.show()

#### Patelets (kilopatelets/mL)

In [None]:
death['platelets_less_255300'] = death['platelets'].apply(lambda x: 'yes' if x < 255300 else 'no')

In [None]:
death['platelets_less_255300'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(death['platelets_less_255300'].value_counts()/sum(death['platelets_less_255300'].value_counts()))*100

1. **50% of death's patient have platelets greater than 255300 kiloplatelets/mL**
2. **50% of death's patient have platelets less than 255300 kiloplatelets/mL**

We can said that platelets does not cause fatality of patient.

In [None]:
plot_curve(data=death['platelets'], label='platelets')

#### Ejection_fraction(%)

**EF** is a measurement expressed as a percentage of how much blood the left ventricle pumps out with each contraction.

**How much blood is pumped out?**
1. **Normal Ejection Fraction = 50-70%**. (confortable during activity).
2. **Bordeline Ejection Fraction = 41-49%** (symptom may become noticeable during activity).
3. **Reduced Ejection Fraction <= 40%** (symptom may become noticeable during activity).

In [None]:
def condition(x):
    if (x>= 50 and x <= 70):
        return 'normal'  
    elif (x>= 41 and x <= 49):
        return 'bordeline'
    else:
        return 'reduced'
death['EF_decision'] = death['ejection_fraction'].apply(condition)

In [None]:
death['EF_decision'].value_counts().plot(kind='pie')
plt.show()

In [None]:
(death['EF_decision'].value_counts()/sum(death['EF_decision'].value_counts()))*100

1. **80.21% of cardiovasc. patient die with reduced ejection fraction (low blood pumped in heart).**
2. **14.58% of cardivasc. patient die with normal ejection fraction (other cause).**
3. **5.21% of cardiovasc. patient die with bordeline ejection fraction (other cause).**

Below, we find why 2)

In [None]:
death_patient_with_normal_EF = death[death['EF_decision'] == 'normal']

In [None]:
pol = []
for _, u in enumerate(bool_cols):
    if u == 'DEATH_EVENT':
        continue
    v = death_patient_with_normal_EF[u].value_counts()
    pol.append((v/sum(v))*100)

In [None]:
pd.DataFrame(pol)

The two attributes that patient die with a normal EF are **diabetes** and **anaemia**.  

In [None]:
plot_curve(data=death['ejection_fraction'], label='ejection_fraction')

#### Age (year)

In [None]:
plot_curve(data=death['age'], label='age')

### Let's study other numeric attributes.  

In [None]:
num_cols = list(set(no_bool_cols) - set(normal_cols))
num_cols

### Time (days)

In [None]:
plot_curve(data=death['time'], label='time')

In [None]:
followUP_less75D = death[death['time'] < 75]

In [None]:
def timeLess75D_histogram(r):
    fig= plt.figure(figsize=(20,10))
    fig.subplots_adjust(wspace=0.2, hspace=0.3)
    cols = ['creatinine_phosphokinase', 'platelets', 'ejection_fraction', 'serum_sodium',
        'serum_creatinine', 'age']
    
    for i, u in enumerate(cols):
        ax = fig.add_subplot(2, 3, i+1)
        sns.histplot(x=u, data=followUP_less75D, hue=r, ax=ax, bins=10, kde=True)
        ax.set_title(f'Time < 75 days: {u}.')
    plt.show()

In [None]:
print(f'The total cardiovasc. patient die with time < 75 days = {len(followUP_less75D)}/{len(death)}.\n')
for  k in bool_cols:
    if k == 'DEATH_EVENT':
        continue
        
    print(f'Time < 75 days: {k}.')
        
    timeLess75D_histogram(k)
    print()

**Insight**

In sex group::
> **age < 70 years old**: 12 men die against 9 women.

> **age >= 70 years old**: 6 men die against 16 woman.

> **12 men have ejection_fraction < 40% against 24 women.** in the same condition.

> **10 men against  19 women have serum_sodium <= 135 mEq/L** At this condition, patient have sodium low in blood (Hyponatremia).

**concl**. In cardiovascular disease, we can said that women are more affected by this disease.

<a id='mod'> </a>

# Modelling

<a id='heart'> </a>

## Predict heart failure

In this section, that is what we are going to work.
1. split data to train and test set
2. create class tools with multiple function (scaling data with RobustScaler, smote+enn)
3. find best learner
4. do selection informative attribute for the best learner
5. make grid search to find best hyperparameter

Okay, let's go

In [None]:
class tools:
    """
    This class contains all function for classification where target are unbalanced.
    """
    
    def __init__(self, xtrain=None, ytrain=None):
        self.xtrain = xtrain # train data
        self.ytrain = ytrain # train target data
        
        # list of different learner for classification
        self.clas_model = {'KNeighborsClassifier': KNeighborsClassifier(),
                'RandomForestClassifier': RandomForestClassifier(random_state=42),
                'GradientBoostingClassifier': GradientBoostingClassifier(random_state=42),
                'XGBoostClassifier': XGBRFClassifier(random_state=42, eval_metric='logloss'),
                'AdaboostClassifier': AdaBoostClassifier(random_state=42),
                'ExtraTreesClassifier': ExtraTreesClassifier(random_state=42),
                'MLPClassifier':MLPClassifier(random_state=42),
                'LogisticRegression': LogisticRegression(random_state=42),
                'RidgeClassifier': RidgeClassifier(random_state=42),
                'SVC': SVC(random_state=42),
                'LinearSVC': LinearSVC(random_state=42),
                'DecisionTree': DecisionTreeClassifier(random_state=42),
                'GaussianNB': GaussianNB()}
        
    def classification_learner_selection(self):

        """
            This function compute differents score measure like cross validation,
            auc, accuracy, recall, precision and f1.
            reg_model: dictionary type containing different model algorithm.     
        """ 
    
        result = {}
        matrix = []
        
        #
    
        # we take each classification model
        for cm in list(self.clas_model.items()):
        
            name = cm[0] #name of learner
            model = cm[1] # learner
            
            pipe = Pipeline([('smoteenn', SMOTEENN(random_state=42)),
                            (name, model)])

            cvs = cross_validate(pipe, self.xtrain, self.ytrain, cv=10,scoring='roc_auc',
                                return_train_score=True, return_estimator=True,
                                n_jobs=-1)
            
            ypred = cross_val_predict(pipe, self.xtrain, self.ytrain, cv=10) #prediction cv
            report = classification_report(self.ytrain, ypred)
            
            cvs_mean = cvs['test_score'].mean() #mean of cv score
            cvs_std = cvs['test_score'].std() #std of cv score
        
            result[name] = {'cvs_ROC_AUC': cvs_mean, 'report': report}
        
            print(f'{name} model done; score mean +/- std. dev: {round(cvs_mean, 3)} +/- {round(cvs_std, 3)} !!!')
            
        return result

In [None]:
target  = cvd['DEATH_EVENT']
data = cvd.drop(columns=['time', 'DEATH_EVENT'])

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(data, target, stratify=target,
                                                random_state=42,
                                                test_size=0.2)

In [None]:
#scaling
scaler = RobustScaler()

In [None]:
xtrain_scaled = scaler.fit_transform(xtrain)

In [None]:
xtest_scaled = scaler.transform(xtest)

### Find best learner

In [None]:
toolModel = tools(xtrain_scaled, ytrain)

In [None]:
res = toolModel.classification_learner_selection()

In [None]:
pd.DataFrame(res).iloc[0,:].sort_values(ascending=False)

In [None]:
dcol = pd.DataFrame(res).columns
for i in range(len(dcol)):
    print(dcol[i])
    print(pd.DataFrame(res).iloc[1, i])
    print()

The Best learner we choose are:
1. XGBoostClassifier
2. GradientBoosting


Okay, we have three best learners, we can make some combination to see how model performs better
1. model1: XGBoostClassifier + GradientBoost
2. model2: GradientBoost + XGBoostClassifier

Okay, let's go

**The best learner for this problem is XGBoost**.

### Find best hyperparameter

In [None]:
#pipe = Pipeline([('smoteenn', SMOTEENN(random_state=42)),
 #                           ('xgboost', XGBRFClassifier(random_state=42, eval_metric='logloss'))])

In [None]:
#param_grid = {'xgboost__learning_rate':[1, 0.1, 0.01], 'xgboost__max_depth': [3, 4, 5], 'xgboost__n_estimators':[100, 1000],
 #        'xgboost__gamma':[0.5, 1.0, 1.5], 'xgboost__subsample':[0.6, 0.8, 1], 'xgboost__colsample_bytree':[0.6, 0.8, 1]}

In [None]:
#grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
#grid_search.fit(xtrain_scaled, ytrain)

In [None]:
#print(f"Best parameters: {grid_search.best_params_}")
#print(f"Best cross-validation score: {grid_search.best_score_:.2f}")

In [None]:
#print(f'Test set score: {grid_search.score(xtest_scaled, ytest)}.')

## Prediction and model evaluation

In [None]:
learner = XGBRFClassifier(random_state=42, gamma=1.5, colsample_bytree=0.8, max_depth=3, 
                                                       n_estimators=1000, subsample=0.6, eval_metric='logloss')

model = Pipeline([('smoteenn', SMOTEENN(random_state=42, )),
                            ('model',learner)])

In [None]:
model.fit(xtrain_scaled, ytrain)

In [None]:
ypred = model.predict(xtest_scaled)

In [None]:
print(f'ROC_AUC score: {roc_auc_score(ytest, ypred)}.')

In [None]:
print(classification_report(ytest, ypred))

In [None]:
plot_roc_curve(model, xtest, ytest)
plt.show()

In [None]:
plot_confusion_matrix(model, xtest_scaled, ytest)
plt.grid(False)
plt.show()

In [None]:
print(f'f1 score {f1_score(ytest, ypred)}')
print(f'precision score: {precision_score(ytest, ypred)}')
print(f'recall score: {recall_score(ytest, ypred)}')
print(f'accuracy score: {accuracy_score(ytest, ypred)}')

<a id='rel'> </a>

## Find relevant informative attributes that causes heart disease.

D'ont forget that we use only data where patient have DEATH_EVENT = 1. Do not forget that each dot is a patient

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
patient = death.drop(columns=['platelets_less_255300','EF_decision', 'time'])

In [None]:
patient.tail()

In [None]:
search_info = Pipeline([('scaler', RobustScaler()), ('pca', PCA(n_components=0.95))])

In [None]:
patient_pca = search_info.fit_transform(patient)

In [None]:
pca = search_info['pca']

In [None]:
pca.components_.shape

In [None]:
df = pd.DataFrame(pca.components_, columns=patient.columns, index =  ['PC'+str(i) for i in range(7)])

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df, annot=True, center=0, fmt='0.3g', cmap='viridis')
plt.show()

**Insight**

1. PC0 is strongly correlated with creatinine kinase. It show how increase creatinine kinase in the patient 
2. PC1 is strongly correlated with serum_creatinine. It show how increase serum_creatinine.
3. PC2 is strongly correlated with ejection_fraction and serum_sodium. I PC2 decrease also EF and serum sodium decrease.
4. PC3 is strongly correlated with EF (positive) and platelets (negative). It show the opposites between EF and platelets.(if EF increase then platelets decrease vis versa)
5. PC4 is strongly correlated with platelets and serum_sodium. I show how platelets and serum_sodium are opposites. If platelets increase then serum_sodium decrease.
6. PC5 is correlated with age. if PC5 increase then age increase also
7. PC6 is correlated with anaemia. It show how increase anaemia.

We can said that attributes important in this data are:

1. creatinine kinase
2. serum_creatine
3. EF
4. serum_sodium
5. platelets
6. age 
7. anaemia.

where, 

1. PC0 --> creatinine kinase
2. PC1 --> serum creatinine
3. PC5 --> age
4. PC6 --> anaemia
5. PC2 --> decreasing of EF and serum_sodium
6. PC3 --> opposition between EF and platelets
7. PC4 --> opposition between platelets and serum_sodium

In [None]:
patient_pc = pd.DataFrame(patient_pca, columns=['PC'+str(i) for i in range(7)])
patient_pc.shape

In [None]:
patient_pc.tail()

In [None]:
#create function
def visualize_decomposition(comp1=None, comp2=None, data=patient_pc):
    
    plt.figure(figsize=(15,5))
    sns.scatterplot(x=comp1, y=comp2, data=data)
    plt.xlabel(comp1)
    plt.ylabel(comp2)
    plt.title(f'Visualization {comp1} and {comp2}.')
    plt.show()

In [None]:
visualize_decomposition(comp1='PC0', comp2='PC1')

**What is creatinine kinase and serum creatinine?**

**Creatinine kinase** is an enzyme found in the heart, brain, skeletal muscle, and other tissues. Increased amounts of CK are released into the blood when there is muscle damage. Higher amount of serum CK can indicate muscle damage.

**Serum creatinine** is a measure of how well your kidneys are performing their of filtering waste from your blood. Higer serum creatinine levels in the blood indicate that the kidneys are not functioning properly.

Now,  we are created three sets A, B, C

> A: **a set that describe the death's patient having positive score for PC0 i.e high creatinine kinase and negative score for PC1 i.e low SC (PC0>0 and PC1 <0)**.

> B: **a set that describe the death's patient having positive score for PC1 i.e high serum creatinine and negative score for PC0 i.e low CK (PC0 < 0 and PC1 > 0)**.

> C: **a set that describe the death's patient having negative score for PC0 and PC1 i.e low CK and low Creatinine (PC0 < 0 and PC1 < 0)**.

We have only two death's patients having high CK and Serum Creatinine (cause 100%).

In [None]:
A_set = patient_pc[(patient_pc['PC0']>0) &  (patient_pc['PC1']<0)]
B_set = patient_pc[(patient_pc['PC0']<0) &  (patient_pc['PC1']>0)]
C_set = patient_pc[(patient_pc['PC0']<0) &  (patient_pc['PC1']<0)]

In [None]:
print(f'P(A_set) = {(len(A_set)/len(patient_pc))*100}%\nP(B_set) = {(len(B_set)/len(patient_pc))*100}%',
     f'\nP(C_set) = {(len(C_set)/len(patient_pc))*100}%')

**Conclusion**:
Two attributes that patient die mostly are
1. **Creatinine Kinase**
2. **serum_creatinine**

**BE FREE TO SHARE, DOWNLOAD AND UPVOTE. THANKS FOR READY!**