We often use Ensemble methods near the end of a project,
once you have already built a few good predictors, to combine them into an even better predictor.

In this chapter we will discuss the most popular Ensemble methods, including bagging, boosting,
stacking, and a few others.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0)

#Supressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [53]:
from sklearn.datasets import make_blobs, make_classification

#dataset = make_blobs(n_samples=10000, centers=5, n_features=8, random_state=2, cluster_std=0.7)
dataset = make_classification(n_samples=10000, n_features=5, n_informative=5, n_redundant=0, n_repeated=0, 
                           n_classes=5, n_clusters_per_class=1, flip_y=0, class_sep=1.0, 
                           shuffle=True, random_state=2)

X = pd.DataFrame(dataset[0])
y = pd.DataFrame(dataset[1])

In [66]:
# dataset = pd.read_csv('WineQuality.csv')

# dataset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [67]:
# dataset.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     3
total sulfur dioxide    0
density                 0
pH                      0
sulphates               1
alcohol                 0
quality                 0
dtype: int64

In [68]:
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# features = dataset.columns
# dataset = imputer.fit_transform(dataset)

# #COnverting back to dataframe as Imputer outputs a numpy array
# dataset = pd.DataFrame(dataset, columns=features)

In [69]:
# dataset.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [63]:
#X = dataset.iloc[:,:-1]
X.head()

Unnamed: 0,0,1,2,3,4
0,3.735157,1.425879,0.519925,0.072837,-4.537123
1,1.248768,1.305598,1.269169,0.998288,-2.718522
2,1.758948,-1.944062,3.385539,1.1821,1.377778
3,-3.085835,-0.080441,-0.542692,-0.410335,-0.119133
4,-0.02897,-1.470322,0.060245,3.345129,0.208522


In [75]:
#y = dataset.iloc[:,-1:]

print(y.iloc[:,0].unique())
y.head()

[4 2 1 3 0]


Unnamed: 0,0
0,4
1,4
2,2
3,1
4,3


In [88]:
#Spillitng training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [89]:
from sklearn.tree import DecisionTreeClassifier

decisiontree_clf = DecisionTreeClassifier()
decisiontree_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [90]:
decisiontree_pred = decisiontree_clf.predict(X_test)

In [91]:
from sklearn.metrics import confusion_matrix
decisiontree_cm = confusion_matrix(y_test, decisiontree_pred)

decisiontree_cm

array([[331,   9,  34,  19,  11],
       [  5, 385,   1,  24,   0],
       [ 19,   4, 334,  19,   6],
       [ 13,  44,  13, 330,  32],
       [  8,   3,  14,  14, 328]], dtype=int64)

In [95]:
decisiontree_clf.score(X_test, y_test)

0.854

In [96]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,decisiontree_pred)

0.854

## Bagging

In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of a base estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. 

In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator (resp. BaggingRegressor), taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets. In particular, max_samples and max_features control the size of the subsets (in terms of samples and features), while bootstrap and bootstrap_features control whether samples and features are drawn with or without replacement. When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting oob_score=True.

In [98]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)

In [99]:
bag = BaggingClassifier(
    knn_clf, 
    max_samples=.5, 
    max_features=4, 
    n_jobs=2,
    oob_score=True)

In [100]:
bag.fit(X_train, y_train)

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=4,
         max_samples=0.5, n_estimators=10, n_jobs=2, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [108]:
bag.oob_score_

0.8835

In [109]:
y_pred = bag.predict(X_test)

accuracy_score(y_test, y_pred)

0.891

## Random Forest :  A speacial type of Bagging Classifier

Random forests are somewhat special. They happen to be so frequently used a bagging method that they have become their own method. They are in that way the same as a classic Supervised Estimator with all the base functionality, plus a little extra bagging goodness.

In [110]:
from sklearn.ensemble import RandomForestClassifier

randomforest_clf = RandomForestClassifier()
randomforest_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [111]:
randomforest_pred = randomforest_clf.predict(X_test)

In [112]:
randomforest_clf.score(X_test, y_test)

0.8925

In [113]:
randomforest_cm = confusion_matrix(y_test, randomforest_pred)

randomforest_cm

array([[353,   4,  19,  23,   5],
       [  8, 396,   0,  11,   0],
       [ 20,   5, 349,   7,   1],
       [ 15,  43,   9, 355,  10],
       [  6,   2,  14,  13, 332]], dtype=int64)

## Boosting

### 1. AdaBoost

The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced in 1995 by Freund and Schapire.

The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights w_1, w_2, ..., w_N to each of the training samples. Initially, those weights are all set to w_i = 1/N, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

- Boosting in general is about building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
- AdaBoost was first boosting algorithm.
- AdaBoost can be used for both classification & regression

#### Algorithm
- Core concept of adaboost is to fit weak learners ( like decision tree ) sequantially on repeatedly modifying data.
- Initially, each data is assigned equal weights.
- A base estimator is fitted with this data.
- Weights of misclassified data are increased & weights of correctly classified data is decreased.
- Repeat the above two steps till all data are correctly classified or max number of iterations configured.
- Making Prediction : The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

### From Book
In Adaboost, the first base classifier (such as a Decision Tree) is trained
and used to make predictions on the training set. The relative weight of misclassified training instances is
then increased. A second classifier is trained using the updated weights and again it makes predictions on
the training set, weights are updated, and so on.

Disadavtage - 
There is one important drawback to this sequential learning technique: it cannot be parallelized (or only partially), since each
predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as
bagging or pasting.

### Add pic here
Figure 7-8 shows the decision boundaries of five consecutive predictors on the moons dataset (in this
example, each predictor is a highly regularized SVM classifier with an RBF kernel14). The first classifier
gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job
on these instances, and so on. The plot on the right represents the same sequence of predictors except that
the learning rate is halved (i.e., the misclassified instance weights are boosted half as much at every
iteration). As you can see, this sequential learning technique has some similarities with Gradient Descent,
except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost
adds predictors to the ensemble, gradually making it better.

Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME16 (which stands for
Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two
classes, SAMME is equivalent to AdaBoost. Moreover, if the predictors can estimate class probabilities
(i.e., if they have a predict_proba() method), Scikit-Learn can use a variant of SAMME called
SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and
generally performs better.

The following code trains an AdaBoost classifier based on 500 Decision Stumps using Scikit-Learn’s
AdaBoostClassifier class (as you might expect, there is also an AdaBoostRegressor class). A
Decision Stump is a Decision Tree with max_depth=5 — in other words, a tree composed of a single
decision node plus two leaf nodes.

In [114]:
from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=500)

In [115]:
adaboost.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=500, random_state=None)

In [116]:
y_pred = adaboost.predict(X_test)

accuracy_score(y_test, y_pred)

0.8575

### 2. Gradient Tree Boosting

Gradient Boosting is on eof the most popular Boosting algorith used. Like AdaBoost, Gradient Boosting
works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However,
instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the
new predictor to the residual errors made by the previous predictor.

Let’s go through a simple regression example using Decision Trees as the base predictors (of course
Gradient Boosting also works great with regression tasks). This is called Gradient Tree Boosting, or
Gradient Boosted Regression Trees (GBRT). First, let’s fit a DecisionTreeRegressor to the training set
(for example, a noisy quadratic training set):

In [146]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [147]:
#Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:

y2 = y - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)


#Then we train a third regressor on the residual errors made by the second predictor:

y3 = y2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)


#Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

ValueError: Unable to coerce to Series, length must be 1: given 7500

Thankfully, module sklearn.ensemble provides methods for both classification and regression via gradient boosted regression trees.

The following code creates the same ensemble as the previous one:

In [149]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=1.0, loss='deviance', max_depth=2,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=3,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such
as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually
generalize better.

In order to find the optimal number of trees, you can use early stopping (see Chapter 4). A simple way to
implement this is to use the staged_predict() method: it returns an iterator over the predictions made
by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a
GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the
optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

In [150]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y)

gbrt = GradientBoostingClassifier(max_depth=20, n_estimators=120, learning_rate=0.1)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors)
print('Best number of estimatores is : {}'.format(bst_n_estimators))

gbrt_best = GradientBoostingClassifier(max_depth=20,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

y_pred = gbrt_best.predict(X_test)

acc_score = accuracy_score(y_test, y_pred)
print('Accuracy for best fit model is {}'.format(acc_score))

print(y_pred)

Best number of estimatores is : 51
Accuracy for best fit model is 0.8804
[3 1 1 ... 4 4 2]


## Voting Classifier

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesse

Types of Voting Classifier

- Soft Voting Classifier, different weights configured to different estimator
- Hard Voting Classifier, all estimators have equal weighage

In [140]:
from sklearn.ensemble import VotingClassifier

In [141]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

lr = LogisticRegression()
rf = RandomForestClassifier()
gnb = GaussianNB()
svm = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', lr), 
                ('rf', rf), 
                ('gnb', gnb),
               ('svm', svm)], 
    voting='hard')

In [142]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFo...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [144]:
#Let’s look at each classifier’s accuracy on the test set:

from sklearn.metrics import accuracy_score

for clf in (lr, rf, gnb, svm, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.7584
RandomForestClassifier 0.9028
GaussianNB 0.7848
SVC 0.9316
VotingClassifier 0.8756


In [165]:
#The voting classifier slightly outperforms all the individual classifiers.