Some of scikit-learn ensemble methods are compared using **Heart Disease UCI** dataset. Features are given below.

* age
* sex
* chest pain type (4 values)
* resting blood pressure
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl
* resting electrocardiographic results (values 0,1,2)
* maximum heart rate achieved
* exercise induced angina
* oldpeak, ST depression induced by exercise relative to rest
* the slope of the peak exercise ST segment
* number of major vessels colored by flourosopy (values 0,1,2,3,4)
* thal (values 0,1,2,3)

Target is the column we want to predict.

In [None]:
import numpy as np
import pandas as pd

import seaborn as sea

from sklearn.model_selection import train_test_split
from sklearn.utils import compute_class_weight
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier

import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout 
from tensorflow.keras.models import Sequential
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

import matplotlib.pyplot as plt

In [None]:
sea.set_style("darkgrid")

## Load and Analyze Data

In [None]:
data = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

data.head(10).style.set_precision(2). \
                    set_properties(**{"min-width": "50px"}). \
                    set_properties(**{"color": "#111111"}). \
                    set_properties(**{"text-align": "center"}). \
                    set_table_styles([
                          {"selector": "th",
                           "props": [("font-weight", "bold"),
                                     ("font-size", "12px"),
                                     ("text-align", "center")]},
                          {"selector": "tr:nth-child(even)",
                           "props": [("background-color", "#f2f2f2")]},
                          {"selector": "tr:nth-child(odd)",
                           "props": [("background-color", "#fdfdfd")]},
                          {"selector": "tr:hover",
                           "props": [("background-color", "#bcbcbc")]}])


Features are assigned to **data_X** and corresponding labels to **data_Y**.

In [None]:
# disable SettingWithCopyWarning
pd.options.mode.chained_assignment = None

data_X = data.loc[:, data.columns != "target"]
data_Y = data[["target"]]

Pandas **info()** shows column (feature) data types and number of non-null values.

In [None]:
print("\ndata_X info:\n")
data_X.info()
print("\ndata_Y info:\n")
data_Y.info()

There are 303 examples (rows) and 13 features (columns) in the dataset. Features and target are numeric.

In [None]:
print("Target classes ", data_Y["target"].unique())

weights = compute_class_weight("balanced",
                        classes = data_Y["target"].unique().ravel(),
                        y = data_Y["target"]);

print("Class weights ", weights)

Target takes 2 values. The predictive model will be a binary classifier. Class weights are close, dataset is balanced.

## Split Data

Dataset is split as training and test sets.

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(data_X, data_Y,
                                                    test_size=0.2,
                                                    random_state=0)

train_X.reset_index(drop=True, inplace=True);
test_X.reset_index(drop=True, inplace=True);
train_Y.reset_index(drop=True, inplace=True);
test_Y.reset_index(drop=True, inplace=True);

feature_names = train_X.columns

## Data Visualization

For each feature, a histogram and a violin plot (feature vs target) are drawn.

In [None]:
fig, axes = plt.subplots(len(train_X.columns), 2, figsize=(10,50))

for i, f in enumerate(train_X.columns):
    sea.distplot(train_X[f], kde = False, color = "#167c02",
                 hist_kws = dict(alpha=0.7), ax=axes[i][0]);
    sea.violinplot(x=train_Y["target"], y=train_X[f],
                 palette = ["#5294e3", "#a94157"], ax=axes[i][1]);

## Standardization

StandardScaler is only fit to training data to prevent data leakage.

In [None]:
scaler = StandardScaler()

# fit to train_X
scaler.fit(train_X)

# transform train_X
train_X = scaler.transform(train_X)
train_X = pd.DataFrame(train_X, columns = feature_names)

# transform test_X
test_X = scaler.transform(test_X)
test_X = pd.DataFrame(test_X, columns = feature_names)

## Correlation Analysis

In [None]:
corr_matrix = pd.concat([train_X, train_Y], axis=1).corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=np.bool))

plt.figure(figsize=(10,8))
sea.heatmap(corr_matrix,annot=True, fmt=".2f",
            vmin=-1, vmax=1, linewidth = 1,
            center=0, mask=mask,cmap="RdBu_r");

There aren't any serious correlation issues between features.

## Feature Importance

Permutation feature importance method is used to evaluate the effect of each feature on the classification task. It is defined as the decrease in model performance when a specific feature is shuffled. Process is repeated a number of times for each feature and mean decrease is taken into account.

In [None]:
lr = LogisticRegression(max_iter = 1000)
lr.fit(train_X, train_Y.values.ravel())

s = permutation_importance(lr, train_X, train_Y, n_repeats = 200,
                           scoring = "accuracy", random_state = 0)

for i in s.importances_mean.argsort()[::-1]:
    print("{:10}\t{: .4f}\t{: .4f}".format(feature_names[i],
                                           s.importances_mean[i],
                                           s.importances_std[i]))

The least important 3 features are dropped. Note that correlation of dropped features with target are also low (refer to correlation matrix). 

In [None]:
drop = ["slope", "age", "fbs"]
for d in drop:
    train_X.drop(d, axis=1, inplace=True)
    test_X.drop(d, axis=1, inplace=True)

feature_names = train_X.columns

## Ensembles

* [**Bagging Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) The base estimators are fit to random subsets of training data. Base estimator is selectable and default is decision tree. During subset creation, samples may be drawn with or without replacement depending on the value of bootstrap parameter. The decisions of the base estimators are aggregated either with hard or soft voting. If base estimators don't have predict_proba method, default aggregation scheme is hard voting. Bagging is generally used to reduce variance. In scikit-learn implementation BaggingClassifier, it is also possible to subsample features for each base estimator. The name bagging comes from Bootstrap Aggregating.
<br><br>
* [**Random Forest Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) An implementation of bagging classifier where base estimator is fixed as decision tree.
<br><br>
* [**AdaBoost Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) In bagging, the base estimators are trained and used in parallel independent of each other. In boosting, the base estimators are trained and used sequentially one after another. AdaBoost starts with a base estimator. Weights of the samples are adjusted for the following base estimator such that samples incorrectly classified by previous classifier have higher weights. In this way, the importance of incorrectly classified samples increases for the following base estimators. Base estimator is selectable and the classifier to be used as base estimator should have sample weighting ability for the obvious reason. The name AdaBoost comes from adaptive sample weights.
<br><br>
* [**Gradient Tree Boosting Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) Base estimators are trained and used sequentially. This time subsequent classifiers fit on the negative gradient of loss function of the previous ones. The base estimator is fixed as regression tree.
<br><br>
* [**Voting Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) A number of machine learning models are aggregated either by hard or soft voting. One of the differences with bagging is that in bagging, same base estimators are used wherease in voting classifier, same or different type of models can be used. If same base estimator is used, then ensemble is homogeneous, if different base estimators are used then ensemble is heterogeneous.
<br><br>
* [**Stacking Classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html) Multilayer structure with base estimators in the first layer and a final (meta) estimator in the second layer. Base layer can be homogeneous or heterogeneous. Unlike voting classifier, the contribution of each base estimator to final decision is assessed by the meta estimator. 

A neural network is chosen to be used as base estimator for BaggingClassifier. Neural network model is designed with Keras and TensorFlow backend. It is wrapped with scikit-learn API **tf.keras.wrappers.scikit_learn.KerasClassifier**.

Decision tree classifier is used as base estimator for AdaBoostClassifier and VotingClassifier. For StackingClassifier, support vector classifier with linear kernel is used as base estimator and logistic regression is used as meta learner.

In [None]:
def create_base():
   
    model = Sequential()
    
    model.add(Dense(128, input_dim=len(feature_names), activation="relu"))
    model.add(Dropout(0.1))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.1))
    model.add(Dense(1, activation="sigmoid"))
    
    model.compile(loss="binary_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(0.01))
    
    return model

base_nn = KerasClassifier(build_fn=create_base,
                       epochs=60, batch_size = 16)
base_nn._estimator_type = "classifier"
base_dt = DecisionTreeClassifier(random_state = 0) 
base_svm = SVC(kernel="linear", random_state = 10)

Models are defined and stored in a list.

In [None]:
model_names = ["BaggingClassifier",
               "RandomForestClassifier",
               "AdaBoostClassifier",
               "GradientBoostingClassifier",
               "VotingClassifier",
               "StackingClassifier"]

models = []

models.append(BaggingClassifier(base_estimator = base_nn,
                                n_estimators=10, random_state=0))

models.append(RandomForestClassifier(n_estimators = 20,
                                     random_state = 11))

models.append(AdaBoostClassifier(base_estimator = base_dt,
                                 n_estimators=10, random_state=10))

models.append(GradientBoostingClassifier(n_estimators = 500,
                                         random_state=8))

models.append(VotingClassifier(estimators=[
                    ('dt_0', base_dt),('dt_1', base_dt),('dt_2', base_dt)],
                    voting='soft'))

models.append(StackingClassifier(estimators=[
            ('svm_0', base_svm), ('svm_1', base_svm), ('svm_2', base_svm)],
            final_estimator=LogisticRegression(max_iter=1000),
            stack_method="predict", cv=5))

## Training

Cross validated accuracies are computed on training set.

In [None]:
scores = []
for m in models:
    scores.append(cross_val_score(m, train_X, train_Y.values.ravel(),
                        scoring = "accuracy", cv = 5, n_jobs = 1))

Mean cross validated accuracies are shown below.

In [None]:
mean_scores = np.mean(scores, axis=1)
   
for i in mean_scores.argsort()[::-1]:
    print("{:30}\t{:.4f}".format(model_names[i], mean_scores[i]))

After seeing cross validation results, the models are trained on full training data.

In [None]:
for m in models:
    m.fit(train_X, train_Y.values.ravel())

## Testing

Trained models make predictions on test set and results are evaluated.

In [None]:
acc = []

for m in models:
    y_pred = m.predict(test_X)
    acc.append(accuracy_score(test_Y, y_pred))
    
for i in np.array(acc).argsort()[::-1]:
    print("{:30}\t{:.4f}".format(model_names[i], acc[i]))

Test accuracies are higher than cross validation results. Remember, in cross validation, training was performed on training split and validation was performed on test split of a fold. On the other hand, the models used to predict on test set were trained on whole training data.