**HEART FAILURE PREDICTION USING XGBOOST**

This notebook include the following techniques for predicting the heart failure:

- Outlier detection and removal using Z-SCORE
- Feature Scaling using MINMAX
- Data Resampling using ADASYN
- Feature Engineering
- Hyperparameter tuning
- Ensemble XGBoost Model

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, classification_report

In [None]:
dataset = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
dataset

In [None]:
dataset.describe()

The description of the dataset reveals the basic statistics and percentiles of the each numeric column. Using this, the outliers can identified by taking a look at the mean and std-deviation of the columns and comparing it to the 75th percentile and the max value. If the max value is much larger than the standard deviation of the column for the 75th percentile then there are outliers definite outliers in the dataset, which is the case here 

**OUTLIER DETECTION AND REMOVAL**

**Z SCORE** 

The Z-Score can be used to identify and remove outliers in a dataset. It indicates how many standard deviations away a data point is from the mean. The formula to find the Z-Score for a feature is:

  Z = (x-μ)/σ
                                                        
If the z score of a data point is more than 3, it indicates that the data point is different from the others. Such a data point is an outlier and should be removed

In [None]:
z = np.abs(stats.zscore(dataset))
dataset = dataset[((z < 3)).all(axis=1)]

In [None]:
dataset.isnull().sum()

Checking for missing data reveals that there is no data point missing in this dataset, a good thing!

**FEATURE SCALING**

**MIN MAX SCALING**

A normalization technique is required to scale the continuous features to a range. For this the minmax scaler has been used. Learn more about min max scaler here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
minmax = MinMaxScaler()
dataset[['age','creatinine_phosphokinase','ejection_fraction',
            'platelets','serum_creatinine','serum_sodium','time']] = minmax.fit_transform(dataset[['age','creatinine_phosphokinase','ejection_fraction',
            'platelets','serum_creatinine','serum_sodium','time']])

In [None]:
y = dataset['DEATH_EVENT']
x = dataset.drop('DEATH_EVENT', axis=1)
y.value_counts()

On investigating the dataset balance, we find that the dataset is imbalanced

**DATA RESAMPLING**

**ADASYN**

This is a oversampling technique of minority class to address the class imbalance issue. This method is similar to SMOTE but it generates different number of samples depending on an estimate of the local distribution of the class to be oversampled. Read the related paper to understand more on this : https://www.researchgate.net/publication/224330873_ADASYN_Adaptive_Synthetic_Sampling_Approach_for_Imbalanced_Learning


In [None]:
resample = ADASYN(sampling_strategy='all', random_state=42)
x_resample, y_resample = resample.fit_resample(x,y)

y_resample.value_counts()

Splitting the data as 80% for training and 20% for testing

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_resample, y_resample, test_size=0.2, random_state=42)

**FEATURE ENGINEERING**

**RECURSIVE FEATURE ELIMINATION**


Recursive feature elimination is a wrapper method used for feature selection and engineering. It uses a classifier as base estimator, based on which the features are recursively eliminated

In [None]:
estimator = RandomForestClassifier()
feature_selection = RFECV(estimator, step=2, cv=5)
feature_selection = feature_selection.fit(x_train, y_train)
mask = np.array(feature_selection.support_)

In [None]:
x_train = x_train.loc[:, mask]
x_test = x_test.loc[:, mask]
x_train

**XGBoost**

XGBoost is a variant of gradient boosting. Recently it has proven to be a great success. Learn more about it here : https://arxiv.org/pdf/1603.02754.pdf

In [None]:
run_gs = True

if run_gs:
    parameter_grid = {
                 'max_depth' : [1,2, 3, 6, 8],
                 'gamma': [0,0.2, 0.4, 0.8, 1.5],
                 'use_label_encoder' : [False], 
                 'random_state' : [1], 
                 'eval_metric' : ['logloss']
                 }
    model = XGBClassifier()
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(model,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation,
                               verbose=1
                              )

    grid_search.fit(x_train, y_train)
    model = grid_search
    parameters = grid_search.best_params_
    
    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))

In [None]:
clf = XGBClassifier(gamma=0.8, max_depth=3, eval_metric ='logloss' ,use_label_encoder=False, random_state=1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

In [None]:
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred, target_names=target_names))