<a id="introduction"></a>  
# Introduction

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

<img src= "https://health.clevelandclinic.org/wp-content/uploads/sites/3/2018/08/GettyImages-944106494.jpg" width="1100">

## Attribute Information

1. Age: age of the patient [years]
2. Sex: sex of the patient [M: Male, F: Female]
3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
4. RestingBP: resting blood pressure [mm Hg]
5. Cholesterol: serum cholesterol [mm/dl]
6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
12. HeartDisease: output class [1: heart disease, 0: Normal]

We will be predicting 'HeartDisease' using various Machine Learning algorithms first which will be optimized too. After that, we will make a neural network for the predictions.

In [99]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import warnings
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="dataframe"></a>
# Accessing the Dataframe

In [100]:
df = pd.read_csv('heart.csv')

In [101]:
df.info()

In [102]:
df.head(10)

In [103]:
for col in df.columns:
    print(f'{col} has {df[col].nunique()} unique values.')

In [104]:
df.duplicated().sum()

In [105]:
cat_cols = ['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
num_cols = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

In [106]:
for col in cat_cols:
    print(f'The unique values in {col} are: {df[col].unique()}')

<a id="missing-values"></a>  
# Checking for Missing Values

In [107]:
df.isnull().sum()

Great! We don't have to deal with null values. We have separated the categorical and numerical columns and so we can now begin EDA.  
<a id="eda"></a>  
# EDA

In [108]:
corr = df.corr(method = 'spearman')
plt.figure(figsize=(20,6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='Blues')
plt.title('Spearman Correlation Heatmap')
plt.show()

In [109]:
pairplot_cols = num_cols
pairplot_cols.append('HeartDisease')
figure = plt.figure(figsize=(20,10))
sns.pairplot(df[pairplot_cols], hue='HeartDisease', palette='GnBu')
plt.show()

Something that I noticed in the cholesterol column is that there are a lot of entries with value as 0. Let's see how many such entries are there.

In [110]:
df.loc[df['Cholesterol'] == 0, 'Cholesterol'].count()

In [111]:
df.loc[(df['Cholesterol'] == 0) & (df['HeartDisease'] == 1), 'Cholesterol'].count()

There are 172 values with cholesterol value 0 and I think this has been done to fill the missing data. Out of 172 values, 152 have heart disease. Let's just remove this column because these 172 values are basically missing values imputed with 0. We will remove it later using column transformer.

In [112]:
num_cols.remove('Cholesterol')

In [113]:
fig, axes = plt.subplots(4, 3, figsize=(20,25))
for i, col in zip(range(4), num_cols):
    sns.stripplot(ax=axes[i][0], x='HeartDisease', y=col, data=df, palette='GnBu', jitter=True)
    axes[i][0].set_title(f'{col} Stripplot')
    sns.histplot(ax=axes[i][1], x=col, data=df, kde=True, bins=10, palette='GnBu', hue='HeartDisease', multiple='dodge')
    axes[i][1].set_title(f'{col} Displot')
    sns.boxplot(ax=axes[i][2], x='HeartDisease', y=col, data=df, palette='GnBu', hue='HeartDisease')
    axes[i][2].set_title(f'{col} Boxplot')

There are some outliers in the numerical columns. Let's replace them with the threshold values using interquantile range.

<a id="outliers"></a>  
## Replacing Outliers

In [114]:
def outlier_limits(df, col_name, q1 = 0.25, q3 = 0.75):
    quartile1 = df[col_name].quantile(q1)
    quartile3 = df[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_limits(df, variable, q1 = 0.25, q3 = 0.75):
    low_limit, up_limit = outlier_limits(df, variable, q1 = q1, q3 = q3)
    df.loc[(df[variable] < low_limit), variable] = low_limit
    df.loc[(df[variable] > up_limit), variable] = up_limit
    
for variable in df[num_cols].columns:
    replace_with_limits(df, variable)

<a id="eda-continued"></a>  
## EDA (Continued)

In [115]:
fig, axes = plt.subplots(2, 2, figsize=(20,15))
for i, col in zip(range(4), num_cols):
    sns.boxplot(ax=axes[i//2][i%2], x='HeartDisease', y=col, data=df, palette='GnBu', hue='HeartDisease')
    axes[i//2][i%2].set_title(f'{col} Boxplot')

Oldpeak still has a lot of outliers. We will look into it later when checking multicollinearity.

In [116]:
fig, axes = plt.subplots(2, 3, figsize=(20,12))
for i, col in zip(range(6), cat_cols):
    sns.histplot(ax=axes[i//3][i%3], x=col, data=df, palette='GnBu', hue='HeartDisease', multiple='dodge', bins='auto')
    axes[i//3][i%3].set_title(f'{col} Countplot')

In [117]:
fig, axes = plt.subplots(2, 3, figsize=(20,12))
for i, col in zip(range(6), cat_cols):
    sns.stripplot(ax=axes[i//3][i%3], x=col, y='Age', data=df, palette='GnBu', hue='HeartDisease', jitter=True)
    axes[i//3][i%3].set_title(f'{col} Countplot')

Let's now do some EDA for more than one feature, keeping the hue same as our target column.

In [118]:
eda_num_cols = ['RestingBP', 'MaxHR', 'Oldpeak']

In [119]:
fig, axes = plt.subplots(1, 3, figsize=(20,7))
for i, col in zip(range(3), eda_num_cols):
    sns.scatterplot(ax=axes[i], x='Age', y=col, hue="HeartDisease", style="Sex", data=df.iloc[0:889,:], palette="GnBu")

All features seem to be useful ahead as you can see that some categorical variables have certain values that have a lot of people with heart disease. The features aren't correlated which is good. With this, we are done with our EDA.

<a id="preprocessing"></a>  
# Data Preprocessing

In [120]:
num_cols.remove('HeartDisease')

In [121]:
print(cat_cols)
print(num_cols)

We want to apply different types of encoding techniques for different variables. Let's use column transformer to achieve this.

In [123]:
X = df.iloc[:,:11]
y = df['HeartDisease']
X.head()

In [124]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
preprocessor1 = ColumnTransformer(
    transformers = [
        ('onehotcat', OneHotEncoder(), ['ChestPainType', 'ST_Slope', 'RestingECG', 'Sex', 'ExerciseAngina']),
        ('num', MinMaxScaler(), num_cols),
    ],
    remainder = 'passthrough',
)
preprocessor2 = ColumnTransformer(
    transformers = [
        ('pca', pca, num_cols),
        ('dropper', 'drop', ['x0_ASY', 'x1_Down', 'x2_LVH', 'x3_F', 'x4_N', 'Age', 'RestingBP', 'MaxHR', 'Oldpeak', 'Cholesterol'])
    ],
    remainder = 'passthrough',
)

In [125]:
preprocessor1_features = ['x0_ASY', 'x0_ATA', 'x0_NAP', 'x0_TA', 'x1_Down', 'x1_Flat', 'x1_Up', 'x2_LVH', 'x2_Normal', 'x2_ST', 'x3_F', 'x3_M', 'x4_N', 'x4_Y', 'Age', 'RestingBP', 'MaxHR', 'Oldpeak', 'Cholesterol', 'FastingBS']
final_features = ['PC-1', 'PC-2', 'x0_ATA', 'x0_NAP', 'x0_TA', 'x1_Flat', 'x1_Up', 'x2_Normal', 'x2_ST', 'x3_M', 'x4_Y', 'FastingBS']

When using column transformer, the items in array will be in the order in which they get encoded. So the OHE variables come first, followed by scaled variables and then the variables that didn't require preprocessing.  
We will be dropping 1 feature from each feature that was OHE to avoid dummy variable trap which causes multicollinearity.

In [126]:
X = pd.DataFrame(preprocessor1.fit_transform(X), columns=preprocessor1_features)
X = pd.DataFrame(preprocessor2.fit_transform(X), columns=final_features)
X.head()

You can use the code below for VIF to check whether multicollinearity between features exists or not (features with VIF>5 show multicollinearity). We will run it to see which features show multicollinearity and then go back up to the column transformer to add a PCA step.

In [127]:
import statsmodels.api as sm
def calculate_vif(data):
    vif_df = pd.DataFrame(columns = ['Feature', 'VIF'])
    x_var_names = X.columns
    for i in range(0, x_var_names.shape[0]):
        y = X[x_var_names[i]]
        x = X[x_var_names.drop([x_var_names[i]])]
        r_squared = sm.OLS(y,x).fit().rsquared
        vif = round(1/(1-r_squared),2)
        vif_df.loc[i] = [x_var_names[i], vif]
    return vif_df.sort_values(by = 'VIF', axis = 0, ascending=False, inplace=False)

calculate_vif(X)

The numerical columns show multicollinearity. Rather than removing the numerical columns, it is better to reduce their dimension using PCA which will also reduce VIF. Numerical features do contain important information. We can add the preprocessor2 which contains the PCA step to our pipeline.

<a id="ml"></a>  
# Machine Learning  
We will split the data first, apply different models and optimize them.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
def fit(clf, params, cv=10, X_train=X_train, y_train=y_train):
    grid = GridSearchCV(clf, params, cv=KFold(n_splits=cv), n_jobs=1, verbose=1, return_train_score=True, scoring='accuracy', refit=True) #verbose and n_jobs help us see the computation time and score of a cv. Higher the value of verbose, more the information printed out.
    grid.fit(X_train, y_train)
    return grid

def make_predictions(model, X_test=X_test):
    return model.predict(X_test)

def best_scores(model):
    # print(f'The mean cross validation test score is: {model.cv_results_.mean_test_score}') #for some reason this wasn't working for me even though the attribute exists so lets just leave it.
    print(f'The best parameters are: {model.best_params_}')
    print(f'The best score that we got is: {model.best_score_}')
    return None

def plot_confusion_matrix(y_pred):
    print('00: True Negatives\n01: False Positives\n10: False Negatives\n11: True Positives\n')
    conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.matshow(conf_matrix, cmap='GnBu', alpha=0.75)
    for i in range(conf_matrix.shape[0]):
        for j in range(conf_matrix.shape[1]):
            ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='large') 
    plt.xlabel('Predictions', fontsize=14)
    plt.ylabel('Actuals', fontsize=14)
    plt.title('Confusion Matrix', fontsize=14)
    plt.show()
    return None

def check_scores(y_pred):
    print('Precision: %.3f' % precision_score(y_test, y_pred))
    print('Recall: %.3f' % recall_score(y_test, y_pred))
    print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
    print('F1 Score: %.3f' % f1_score(y_test, y_pred))
    print('ROC-AUC Score: %.3f' % roc_auc_score(y_test, y_pred))
    return None

Accuracy = $\frac{TP + TN}{TP + TN + FP + FN}$  
Precision = $\frac{TP}{TP + FP}$  
Recall = $\frac{TP}{TP + FN}$  
F1 Score = $\frac{2 * Precision * Recall}{Precision + Recall}$

<a id="lr"></a>  
## Logistic Regression

In [None]:
lr_params = {'C':[0.001,.009,0.01,.09,1,5,10,25], 'penalty':['l1', 'l2']} #lasso and ridge regression
lr_clf = LogisticRegression(solver='saga', max_iter=5000)
lr_model = fit(lr_clf, lr_params)

Time elapsed: 1.4s

In [None]:
best_scores(lr_model)

In [None]:
lr_y_pred = make_predictions(lr_model)
check_scores(lr_y_pred)

In [None]:
plot_confusion_matrix(lr_y_pred)

In [None]:
lr_feature_scores = lr_model.best_estimator_.coef_[0].tolist()
lr_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': lr_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=lr_fi, palette='GnBu')
plt.show()

The best parameters are: {'C': 0.09, 'penalty': 'l2'}  
The best score that we got is: 0.855460940392447  
Precision: 0.873  
Recall: 0.873  
Accuracy: 0.859  
F1 Score: 0.873  
ROC-AUC Score: 0.857

<a id="gnb"></a>  
## Gaussian Naive Bayes

In [None]:
gnb_params = {'priors': [None], 'var_smoothing': np.logspace(0,-9, num=100)}
gnb_clf = GaussianNB()
gnb_model = fit(gnb_clf, gnb_params)

Time elapsed: 5.8s

In [None]:
best_scores(gnb_model)

In [None]:
gnb_y_pred = make_predictions(gnb_model)
check_scores(gnb_y_pred)

In [None]:
plot_confusion_matrix(gnb_y_pred)

The best parameters are: {'priors': None, 'var_smoothing': 0.08111308307896872}  
The best score that we got is: 0.8554794520547946   
Precision: 0.898  
Recall: 0.863  
Accuracy: 0.870  
F1 Score: 0.880  
ROC-AUC Score: 0.870

<a id="knn"></a>  
## K-NNs

In [None]:
knns_params = {'n_neighbors': list(range(1, 31)), 'weights': ['uniform', 'distance'], 
               'metric': ['euclidean', 'manhattan']}
knns_clf = KNeighborsClassifier()
knns_model = fit(knns_clf, knns_params)

Time elapsed: 36.2s

In [None]:
best_scores(knns_model)

In [None]:
knns_y_pred = make_predictions(knns_model)
check_scores(knns_y_pred)

In [None]:
plot_confusion_matrix(knns_y_pred)

The best parameters are: {'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'distance'}  
The best score that we got is: 0.8541651240281377     
Precision: 0.877  
Recall: 0.912  
Accuracy: 0.880  
F1 Score: 0.894  
ROC-AUC Score: 0.877

<a id="svm"></a>  
## SVMs

In [None]:
svm_params = {'C':[1,10,100,1000], 'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear','rbf']}
svm_clf = SVC()
svm_model = fit(svm_clf, svm_params)

Time elapsed: 2.9 min

In [None]:
best_scores(svm_model)

In [None]:
svm_y_pred = make_predictions(svm_model)
check_scores(svm_y_pred)

In [None]:
plot_confusion_matrix(svm_y_pred)

The best parameters are: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}  
The best score that we got is: 0.8567937800814512    
Precision: 0.847  
Recall: 0.922  
Accuracy: 0.864  
F1 Score: 0.883  
ROC-AUC Score: 0.857

<a id="dt"></a>  
## Decision Trees

In [None]:
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': range(1,10), 
             'min_samples_leaf': range(1,5)}
dt_clf = DecisionTreeClassifier()
dt_model = fit(dt_clf, dt_params)

Time elapsed: 4.7s

In [None]:
best_scores(dt_model)

In [None]:
dt_y_pred = make_predictions(dt_model)
check_scores(dt_y_pred)

In [None]:
plot_confusion_matrix(dt_y_pred)

The best parameters are: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4}  
The best score that we got is: 0.8350055534987042  
Precision: 0.843  
Recall: 0.843  
Accuracy: 0.826  
F1 Score: 0.843  
ROC-AUC Score: 0.824

In [None]:
fig = plt.figure(figsize=(50,40))
tree.plot_tree(dt_model.best_estimator_, feature_names=final_features,  class_names=['0','1'], filled=True, fontsize=14, rounded=True)
plt.show()

<a id="rf"></a>  
## Random Forests

In [None]:
rf_params = {'criterion' :['gini', 'entropy'], 'min_samples_leaf': [3, 4, 5], 
             'min_samples_split': [8, 10, 12], 'n_estimators': [100,250,500,600,700,800,900,1000]}
rf_clf = RandomForestClassifier()
rf_model = fit(rf_clf, rf_params)

Time elapsed: 27.4 min

In [None]:
best_scores(rf_model)

In [None]:
rf_y_pred = make_predictions(rf_model)
check_scores(rf_y_pred)

In [None]:
plot_confusion_matrix(rf_y_pred)

In [None]:
rf_feature_scores = rf_model.best_estimator_.feature_importances_.tolist()
rf_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': rf_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=rf_fi, palette='GnBu')
plt.show()

The best parameters are: {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 12, 'n_estimators': 100}  
The best score that we got is: 0.8567937800814514  
Precision: 0.853  
Recall: 0.853  
Accuracy: 0.837  
F1 Score: 0.853  
ROC-AUC Score: 0.835

<a id="ada"></a>  
## AdaBoost

In [None]:
ada_params = {'n_estimators': [100, 200, 400, 500, 600, 800, 1000, 2000], 
              'learning_rate': [0.01, 0.1, 0.2, 0.5, 1]}
ada_clf = AdaBoostClassifier()
ada_model = fit(ada_clf, ada_params)

Time elapsed: 10 min

In [None]:
best_scores(ada_model)

In [None]:
ada_y_pred = make_predictions(ada_model)
check_scores(ada_y_pred)

In [None]:
plot_confusion_matrix(ada_y_pred)

In [None]:
ada_feature_scores = ada_model.best_estimator_.feature_importances_.tolist()
lr_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': ada_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=lr_fi, palette='GnBu')
plt.show()

The best parameters are: {'learning_rate': 0.01, 'n_estimators': 400}  
The best score that we got is: 0.843169196593854   
Precision: 0.842  
Recall: 0.833  
Accuracy: 0.821  
F1 Score: 0.837  
ROC-AUC Score: 0.819

<a id="gb"></a>  
## Gradient Boosting

In [None]:
gb_params = {"loss": ["exponential"], "learning_rate": [0.001, 0.0025, 0.005, 0.0075, 0.01],
             "max_depth": [4, 6, 8, 10], "max_features": ["log2", "sqrt"], 
             "n_estimators": [100, 250, 400, 500, 600, 750, 1000]}
gb_clf = GradientBoostingClassifier()
gb_model = fit(gb_clf, gb_params, cv=10)

Time elapsed: 97.5 min

In [None]:
best_scores(gb_model)

In [None]:
gb_y_pred = make_predictions(gb_model)
check_scores(gb_y_pred)

In [None]:
plot_confusion_matrix(gb_y_pred)

In [None]:
gb_feature_scores = gb_model.best_estimator_.feature_importances_.tolist()
gb_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': gb_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=gb_fi, palette='GnBu')
plt.show()

The best parameters are: {'learning_rate': 0.005, 'loss': 'exponential', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 750}  
The best score that we got is: 0.8567752684191042   
Precision: 0.848  
Recall: 0.873  
Accuracy: 0.842  
F1 Score: 0.860  
ROC-AUC Score: 0.839

<a id="lightgbm"></a>  
## LightGBM

In [None]:
lgbm_params = {'num_leaves':[5, 10, 15, 20, 25], 'min_child_samples':[5, 10, 15],
               'learning_rate':[0.001, 0.0025, 0.005, 0.0075, 0.01], 'objective': ['binary']}
lgbm_clf = lgb.LGBMClassifier()
lgbm_model = fit(lgbm_clf, lgbm_params)

Time elapsed: 42s

In [None]:
best_scores(lgbm_model)

In [None]:
lgbm_y_pred = make_predictions(lgbm_model)
check_scores(lgbm_y_pred)

In [None]:
plot_confusion_matrix(lgbm_y_pred)

In [None]:
lgbm_feature_scores = lgbm_model.best_estimator_.feature_importances_.tolist()
lgbm_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': lgbm_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=lgbm_fi, palette='GnBu')
plt.show()

The best parameters are: {'learning_rate': 0.01, 'min_child_samples': 5, 'num_leaves': 10, 'objective': 'binary'}  
The best score that we got is: 0.8526656793780083  
Precision: 0.817    
Recall: 0.873    
Accuracy: 0.821    
F1 Score: 0.844    
ROC-AUC Score: 0.814

<a id="xgboost"></a>  
## XGBoost

In [None]:
xgb_params = {'max_depth': range (2, 10, 1), 'n_estimators': [50, 100, 250, 400, 500, 600, 750, 1000],
              'learning_rate': [0.001, 0.0025, 0.005, 0.0075, 0.01],
              'objective': ['binary:hinge', 'binary:logistic', 'binary:logitraw']
}
xgb_clf = xgb.XGBClassifier()
txgb_model = fit(xgb_clf, xgb_params)

Time elapsed: 173 min

In [None]:
best_scores(xgb_model)

In [None]:
xgb_y_pred = make_predictions(xgb_model)
check_scores(xgb_y_pred)

In [None]:
plot_confusion_matrix(xgb_y_pred)

In [None]:
xgb_feature_scores = xgb_model.best_estimator_.feature_importances_.tolist()
xgb_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': xgb_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=xgb_fi, palette='GnBu')
plt.show()

The best parameters are: {'learning_rate': 0.005, 'max_depth': 3, 'n_estimators': 750, 'objective': 'binary:logistic'}  
The best score that we got is: 0.85272121436505  
Precision: 0.838  
Recall: 0.863  
Accuracy: 0.832  
F1 Score: 0.850  
ROC-AUC Score: 0.828

<a id="cat"></a>  
## CatBoost

In [None]:
cb_params = {'depth': [4, 6, 8, 10], 'learning_rate': [0.001, 0.0025, 0.005, 0.0075, 0.04],
             'n_estimators': [10, 25, 50, 75, 100], 'loss_function': ['Logloss', 'CrossEntropy']}
cb_clf = CatBoostClassifier()
cb_model = fit(cb_clf, cb_params)

In [None]:
best_scores(cb_model)

In [None]:
cb_y_pred = make_predictions(cb_model)
check_scores(cb_y_pred)

In [None]:
plot_confusion_matrix(cb_y_pred)

In [None]:
cb_feature_scores = cb_model.best_estimator_.feature_importances_.tolist()
cb_fi = pd.DataFrame({'Feature': final_features, 'Feature Importance': cb_feature_scores})
plt.figure(figsize=(10,6))
sns.barplot(x='Feature Importance', y='Feature', data=cb_fi, palette='GnBu')
plt.show()

The best parameters are: {'depth': 6, 'learning_rate': 0.04, 'loss_function': 'Logloss', 'n_estimators': 100}  
The best score that we got is: 0.8622362088115514  
Precision: 0.840  
Recall: 0.873  
Accuracy: 0.837  
F1 Score: 0.856  
ROC-AUC Score: 0.833

<a id="nn"></a>    
# Feedforward Neural Networks

In [None]:
from keras.models import Sequential
from keras.layers import Dense, LeakyReLU
from keras import metrics
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [None]:
nn_params = {'batch_size': [200, 400], 'activation': ['relu', 'tanh', 'sigmoid'], 
             'kernel_initializer': ['HeNormal', 'GlorotNormal'],
             'neurons': [8,9,10], 'epochs': [500, 750] 
            }
learning_rate = [0.001, 0.01]
nn_params['learning_rate'] = learning_rate

def create_network(learning_rate=0.01, activation='tanh', kernel_initializer='HeNormal', neurons=9):
    model = Sequential()
    model.add(Dense(12, input_dim=12, kernel_initializer=kernel_initializer)) #input layer shouldn't have activation function
    model.add(Dense(neurons, kernel_initializer=kernel_initializer, activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['binary_accuracy'])
    return model

In [None]:
network = KerasClassifier(build_fn=create_network, epochs=500, verbose=2)
nn_model = fit(network, nn_params, cv=5)

In [None]:
best_scores(nn_model)

In [None]:
plt.plot(nn_model.best_estimator_.model.history.history['binary_accuracy'], color='Green')
plt.plot(nn_model.best_estimator_.model.history.history['loss'], color='Blue')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Accuracy', 'Loss'], loc='center right')
plt.show()

In [None]:
nn_y_pred = nn_model.predict(X_test)
nn_y_pred = nn_y_pred > 0.5 #0.5 being the threshold value
check_scores(nn_y_pred)

In [None]:
plot_confusion_matrix(nn_y_pred)

The best parameters are: {'activation': 'relu', 'batch_size': 200, 'epochs': 500, 'kernel_initializer': 'GlorotNormal', 'learning_rate': 0.001, 'neurons': 9}  
The best score that we got is: 0.8623054701332586  
Precision: 0.880  
Recall: 0.931  
Accuracy: 0.891  
F1 Score: 0.905  
ROC-AUC Score: 0.886