# Model Ensembles

In this notebook you will learn how to implement some ensembles models in scikit-learn.

## The Dataset

In this notebook you will be working with the Breast Cancer Wisconsin
[dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29).

This dataset is a classic and easy binary classification dataset. Its features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

In the original dataset you have 32 attributes:
1. An id;
1. The diagnosis (M = malignant, B = benign);
1. 32 real-valued features computed for each cell nucleus:
    1. radius (mean of distances from center to points on the perimeter);
    1. texture (standard deviation of gray-scale values);
    1. perimeter;
    1. area;
    1. smoothness (local variation in radius lengths);
    1. compactness (perimeter^2 / area - 1.0);
    1. concavity (severity of concave portions of the contour);
    1. concave points (number of concave portions of the contour);
    1. symmetry;
    1. fractal dimension ("coastline approximation" - 1).

The dataset is made of 569 instances and has a dimensionality of 30. Of these instances, 212 belong to the malignant
class and 357 to the benign class.

To load this dataset we will use the `load_brest_cancer` function of scikit-learn.

In [6]:
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

xs = dataset.data
ys = dataset.target

print('The dimensionality of the dataset is', xs.shape[1])

The dimensionality of the dataset is 30


Let's have a quick look at the loaded dataset.

In [7]:
xs

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [8]:
ys

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

We now perform the train-test split. We will use 80% for the training set and 20% for the test set.

In [9]:
from sklearn.model_selection import train_test_split

xs_train, xs_test, ys_train, ys_test = train_test_split(xs, ys, test_size=0.20, random_state=42)

Since all variables are continues, we perform a min-max scaling of both, the training set and test set.

In [10]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(xs_train)

xs_train = scaler.transform(xs_train)
xs_test = scaler.transform(xs_test)

# Stacking

The key idea of stacking is to train multiple classifiers and then stack them together using a meta-learner or a voting
mechanism. In this case we will use a SVM classifier, a naive Bayes classifier, a decision tree classifier and
a neural network, all stacked together using a majority voting mechanism.

Let's start with the definition of these 4 classifiers.

In [11]:
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

svm_clf = SVC(C=0.1)
nb_clf = MultinomialNB()
dt_clf = DecisionTreeClassifier()
mlp_clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

We then define the voting mechanism by using the `VotingClassifier` of scikit-learn.

In [12]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('svm', svm_clf), ('nb', nb_clf), ('df', dt_clf), ('mlp', mlp_clf)],
    voting='hard')

To use this voting classifier we need to provide an id for each classifier and the classifiers themselves, and a voting
parameter which can be set to `hard` or `soft`. If this is set to ‘hard’, it uses the predicted class labels for a
majority rule voting, if this is set to ‘soft’, it predicts the class label based on the argmax of the sums of the
predicted probabilities.

We then fit the voting classifier.

In [13]:
voting_clf.fit(xs_train, ys_train)

VotingClassifier(estimators=[('svm', SVC(C=0.1)), ('nb', MultinomialNB()),
                             ('df', DecisionTreeClassifier()),
                             ('mlp',
                              MLPClassifier(alpha=1e-05,
                                            hidden_layer_sizes=(5, 2),
                                            random_state=1, solver='lbfgs'))])

However, we do not need to forget to fit also each individual classifier independently.
We will do this in the following for loop.

In [14]:
for clf in (svm_clf, nb_clf, dt_clf, mlp_clf):
    clf.fit(xs_train, ys_train)

We can now measure the train and test accuracy of each classifier, including the ensembled one.

In [15]:
from sklearn.metrics import accuracy_score

for clf in (svm_clf, nb_clf, dt_clf, mlp_clf, voting_clf):
    print(clf.__class__.__name__)
    ys_pred = clf.predict(xs_train)
    print('\ttrain:', accuracy_score(ys_train, ys_pred))
    ys_pred = clf.predict(xs_test)
    print('\ttest:', accuracy_score(ys_test, ys_pred))

SVC
	train: 0.9494505494505494
	test: 0.956140350877193
MultinomialNB
	train: 0.8395604395604396
	test: 0.8508771929824561
DecisionTreeClassifier
	train: 1.0
	test: 0.9385964912280702
MLPClassifier
	train: 1.0
	test: 0.9122807017543859
VotingClassifier
	train: 1.0
	test: 0.9649122807017544


# Bagging

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original
dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

In the following example we will perform the bagging of kNN classifiers.

In [16]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

bag_clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=1.0)

The parameters `max_samples` and `max_features` control the fraction of examples and features considered in
each replica of the dataset.

We now fit this classifier and evaluate its accuracy on the train and test sets.

In [17]:
bag_clf.fit(xs_train, ys_train)

print(bag_clf.__class__.__name__, '(kNN)')
ys_pred = bag_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = bag_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

BaggingClassifier (kNN)
	train: 0.978021978021978
	test: 0.956140350877193


## Random Forest

Random forest is a famous ensemble based model that fits a number of decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is
controlled with the `max_samples` parameter if `bootstrap=True`, otherwise the whole dataset is used to build each tree.

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=10)

We now fit this classifier and evaluate its accuracy on the train and test sets.

In [19]:
rf_clf.fit(xs_train, ys_train)

print(rf_clf.__class__.__name__)
ys_pred = rf_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = rf_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

RandomForestClassifier
	train: 0.9978021978021978
	test: 0.956140350877193


# Boosting (AdaBoost)

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then
fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances
are adjusted such that subsequent classifiers focus more on difficult cases.

For this classifier we will boost 200 times decision trees of depth 1, aka decision stumps.

In [20]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200)

The parameter `n_estimators` controls the maximum number of estimators at which boosting is terminated.
Of course, in case of a perfect fit, the learning procedure is stopped earlier.

We now fit this classifier and evaluate its accuracy on the train and test sets.

In [21]:
ada_clf.fit(xs_train, ys_train)

print(ada_clf.__class__.__name__, '(DecisionStumps)')
ys_pred = ada_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = ada_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

AdaBoostClassifier (DecisionStumps)
	train: 1.0
	test: 0.9736842105263158


Note that in all the examples above we have not performed any validation.
How would you perform the validation of these classifiers?