# Classifier Ensembles - Guided

Let's use the Heart Disease UCI dataset from [Kaggle](https://www.kaggle.com/ronitf/heart-disease-uci) to explore how to implement ensemble classifiers using `sklearn.ensemble`. In this exercise we are not going go try and interpret out classifiers much (in general, ensemble methods obscure this somewhat anyway as we will discuss further in the Online Practice), and we are not going to deal with any other issues that might be present or strategies that might be available. We will simply explore the different ensemble methods available and see if we can use any to improve the classifier performance.

# Load data

In [None]:
# import the usual libraries


In [None]:
# load in the data


In [None]:
# check the class balance


And so we see that the data is fairly balanced.

In [None]:
# quick high-level check of the data


And so it appears that we will we have no cleaning to do.

# Preprocessing

In [None]:
# split into features and class


In [None]:
# import train_test_split


# perform a train/test split with a 70:30 ratio


# Imports

Let's import a range of different classifiers that we can try to use in our ensembles, some of which we have studied in detail, and some we have not! 

In [None]:
# import a range of classification algorithms

Import various functions we will need:

In [None]:
# import cross_val_score and GridSearchCV 


# import StandardScaler for certain algorithms


# import make_pipeline to conveniently package up the scaler with certain algorithms


# import accuracy_score


# Voting Classifiers

Let's try constructing a **voting classifier** using a few different classifiers. Let's leave all of the model arguments as default for now - we will return to this point shortly.

> In the following we will evaluate our model using `cross_val_score`, effectively like computing the accuracy on a validation set, as we may sometimes want to set hyperparameter values or choose between ensembles. As always, the final evaluation would eventually be on the test set at the very end of an analysis.

In [None]:
# import VotingClassifier


In [None]:
# setup a SVC, DecisionTreeClassifier and KNeighborsClassifier


# construct a hard VotingClassifier using these learners


# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy


This is not doing very well! It's not even performing as well as the best learner. This may seem contradictory to our original motivating example of the oracles, but remember that the underlying calculation there implicitly assumed that all learners are completely *independent*, which in reality can never be met.


Let's try changing the voting strategy to "soft" and play with the weights a bit based on which individual learners above are performing better:

In [None]:
# setup a SVC, DecisionTreeClassifier and KNeighborsClassifier


# construct a soft VotingClassifier using these learners, and set the weights


# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy
for clf in (clf1, clf2,  clf3,  eclf):
    scores = cross_val_score(clf, X_train, y_train, scoring = 'accuracy', cv = 5)
    print("%s: %f " % (clf.__class__.__name__, scores.mean()))

So this is certainly a bit better, and now our voting classifier is performing slightly better than the individual learners.

Let's try a different voting ensemble, using log reg rather than SVM, and this time let's correctly scale the data that k-NN uses (using the `make_pipeline` function):

In [None]:
# setup a LogisticRegression, DecisionTreeClassifier and KNeighborsClassifier


# construct a hard VotingClassifier using these learners, and pipeline the k-NN with a scaler

                       

# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy
for clf in (clf1, clf2, make_pipeline(StandardScaler(), clf3), eclf):
    scores = cross_val_score(clf, X_train, y_train, scoring = 'accuracy', cv = 5)
    print("%s: %f " % (clf.__class__.__name__, scores.mean()))

So we are getting some real improvement this time! Our voting classifier ensemble performs almost 2% better than the best learner.

### Aside: Setting Hyperparameter Values

So far we have not tried to fix any of the hyperparameter values of the individual learners. We can take different approaches to choosing the individual learner hyperparameters which will not in general produce the same hyperparameter values,: 

1. You could first grid search the individual learners to find the best values before they get put into the ensemble 
2. You can grid search the ensemble as a whole - for this, you access the desired argument via a double underscore in the parameter grid as we will demonstrate below

Let us take the second approach. For example, in the code below we grid search to find the `max_depth` of the decision tree, the `C` parameter of the log reg model, and the k-NN hyperparmaters `n_neighbors` and `metric`:

In [None]:
# setup a LogisticRegression, DecisionTreeClassifier and KNeighborsClassifier
clf1 = LogisticRegression(max_iter = 10000)
clf2 = DecisionTreeClassifier(random_state = 1)
clf3 = KNeighborsClassifier()

# construct a hard VotingClassifier using these learners, and pipeline the k-NN with a scaler
eclf = VotingClassifier(estimators = [('lr', clf1), 
                                      ('dt', clf2), 
                                      ('knn', make_pipeline(StandardScaler(), clf3))
                                     ],
                                      voting = 'hard',
                       )

In [None]:
# define the parameter grid, noting the double underscore notation to access elements of the ensemble


# pass the voting classifier and the param_grid to a grid search and use 5 folds


# fit the grid search to the training set


# print the best parameter values


# save the best model

In [None]:
# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy
for clf in (clf1, clf2, make_pipeline(StandardScaler(), clf3), eclf):
    scores = cross_val_score(clf, X_train, y_train, scoring = 'accuracy', cv = 5)
    print("%s: %f " % (clf.__class__.__name__, scores.mean()))

In this case, there isn't any real improvement over using the default arguments of the learners in the voting classifier.

# Stacking Ensembles

Rather than trying to endlessly tweak weights etc., you can instead train a meta-classifier via **stacking**. Here we have to choose a classification algorithm for our final "meta-classifier", which uses the individual predictions from the various learners as inputs to another learning model.

Let's use the same 3 individual learners from before, and try some different meta-classifiers via the `final_estimator` argument - in practice one would usually use something like a log reg or SVM model for this meta-classifier:

In [None]:
# import StackingClassifier


# setup a LogisticRegression, DecisionTreeClassifier and KNeighborsClassifier


# define the learners in the ensemble


# create a stacking ensemble with a LogisticRegression meta-classifier


# fit the ensemble to the training data 


# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy
for clf in (clf1, clf2,  make_pipeline(StandardScaler(), clf3), stack):
    scores = cross_val_score(clf, X_train, y_train, scoring = 'accuracy', cv = 5)
    print("%s: %f " % (clf.__class__.__name__, scores.mean()))

Now let's try it with a SVC meta-classifier:

In [None]:
# setup a LogisticRegression, DecisionTreeClassifier and KNeighborsClassifier
clf1 = LogisticRegression(max_iter = 10000)
clf2 = DecisionTreeClassifier(random_state = 123)
clf3 = KNeighborsClassifier()

# define the learners in the ensemble
estimators = [('lr', clf1), 
              ('dt', clf2), 
             ('knn', make_pipeline(StandardScaler(), clf3))]

# create a stacking ensemble with a SVC meta-classifier
stack = StackingClassifier(
    estimators = estimators,
    final_estimator = SVC())

# fit the ensemble to the training data 
stack.fit(X_train, y_train)

# for each learner and the ensemble as a whole, use cross_val_score with 5 folds to gauge the accuracy
for clf in (clf1, clf2,  make_pipeline(StandardScaler(), clf3), stack):
    scores = cross_val_score(clf, X_train, y_train, scoring = 'accuracy', cv = 5)
    print("%s: %f " % (clf.__class__.__name__, scores.mean()))

So both of these are giving some improvement to our performance.

In principle you can use any classifier for the meta-classifier here, and in fact you could choose between these options for the `final_estimator` by using `GridSearchCV` if you want.

Rather than simply collecting a few learners into an ensemble, in practice we often like to use *many* learners in an ensemble - one of the most common ways to do this is via **bagging**.

# Bagging

Recall that bagging reduces the variance (without increasing bias) i.e. it can help reduce overfitting. Bagging performs best with algorithms that have *high variance*, such as unpruned decision trees - you saw in the Introduction for Classification I that decision trees, as non-parametric models, can easily overfit if left unrestrained.

The main arguments to play with here are `base_estimator`, `max_samples`, and `n_estimators`:

In [None]:
# build a decision tree model with default arguments
model = DecisionTreeClassifier()

# fit the tree to the training set
model.fit(X_train, y_train)

# use cross_val_score to compute the accuracy of the model with 5 folds
cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

Let's try bagging some decision trees together - let's construct an ensemble of 500 trees, each one only seeing a random 0.6 of the dataset (for extra diversity):

In [None]:
# import BaggingClassifier


In [None]:
# construct a bagging ensemble with 500 trees, each one only seeing 0.6 of the dataset


In [None]:
# use cross_val_score to compute the accuracy of the ensemble with 5 folds
cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

So there is a significant increase compared to a single tree - about 5%!

As mentioned, this works best for high variance models like trees. Let's see how it performs on log reg:

In [None]:
# build a log reg model with default arguments (except set max_iter = 10000)
model = LogisticRegression(max_iter = 10000)

# fit the model to the training set
model.fit(X_train, y_train)

# use cross_val_score to compute the accuracy of the model with 5 folds
cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

In [None]:
# construct a bagging ensemble with 500 models, each one only seeing 0.6 of the dataset
bag = BaggingClassifier(base_estimator = LogisticRegression(),
                        max_samples = 0.6, n_estimators = 500, n_jobs=-1, random_state = 123)

# use cross_val_score to compute the accuracy of the ensemble with 5 folds
cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

And on Naive Bayes:

In [None]:
# build a GaussianNB model with default arguments 
model = GaussianNB()

# fit the model to the training set
model.fit(X_train, y_train)

# use cross_val_score to compute the accuracy of the model with 5 folds
cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

In [None]:
# construct a bagging ensemble with 500 models, each one only seeing 0.6 of the dataset

bag = BaggingClassifier(base_estimator = GaussianNB(),
                        max_samples = 0.6, n_estimators = 500, n_jobs = -1, random_state = 123)

cross_val_score(model, X_train, y_train, scoring = 'accuracy', cv = 5).mean()

So for some other classification algorithms there is not much of a difference.

# Challenge

Try to get the best accuracy score on the dataset, using any of these techniques including voting classifiers, bagging or stacking. You can use any classification algorithm - `LogisticRegression`, `DecisionTreeClassifier`, `SVC`, `KNeighborsClassifier`, `GaussianNB`, or even other ensembles like `RandomForestClassifier` as learners in your ensemble. You can try to optimise the hyperparameters or keep things simple.

Good luck!