## Examples

#### Basic ensemble learning

Here we build from scratch a model made using ensemble learning. We fit three different classifiers on the data, then the ensemble model outputs the majority vote as label, given the input data. This way of voting is also called *hard voting*.  

In [1]:
# generate dataset: MoonDataSet
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

data = make_moons(1000, random_state=67)
X = data[0]
y = data[1]

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.2, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_cl = SVC()

voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svc", svm_cl)],
    voting="hard"
)

voting_clf.fit(X_train, y_train)

In [3]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_cl, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.87625
RandomForestClassifier 0.9925
SVC 1.0
VotingClassifier 0.99375


#### Bagging and Pasting

The following code performs bagging using decision trees as weak learners. This is NOT a random forest, because making one of them requires sampling from the possible features at each split for each tree.  

In [4]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), # weak learner model: Decision Tree
    n_estimators=500, # number of weak learners 
    max_samples=100, # The number of samples to draw from the training set to train each weak learner. By default it is equal to the test dataset size
    bootstrap=True, # create bootstrap samples (aka: bagging)
    n_jobs=2
)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.98625


#### Out-of-bag evaluation

For performing out of bag evaluation (something like cross validation error for bagged models) you just have set the parameter of the *BaggingClassifier*, named *oob_score*, to *True*. 

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), # weak learner model: Decision Tree
    n_estimators=500, # number of weak learners
    bootstrap=True, # create bootstrap samples (aka: bagging)
    n_jobs=2,
    oob_score=True
)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.99

#### Random Forests 

In [6]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_rf))

0.9925


The random forest class has also the attribute *feature_importance*. This means that at every feature an importance value is associate. The score is given by considering the average Gini impurity diminishment given by a feature, when it is chosen for perofrming a split during the training of weak learners. 

In [7]:
from sklearn.datasets import load_iris

iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)


sepal length (cm) 0.08988388392532216
sepal width (cm) 0.02437890184416353
petal length (cm) 0.4335271292052299
petal width (cm) 0.4522100850252844


It seems like the most important features are:

petal length (cm) 0.45825310463423896

petal width (cm) 0.42142698193259726

#### AdaBoost

Implementing an AdaBoost classifier. This ensemble algorithm trains the weak learners sequentially, over a modified datasets, where each instance is weighted based upon how hard the past models found that instance to correctly classify. The more difficult, the more weight the instance has. The weight also corresponds to how much importance is given from the model to a specific training instance. The more weight, the more attention is given.

In [8]:
from sklearn.ensemble   import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), # we use decision trees as weak learners, imposing the depth equal to 1
    n_estimators=200, # number of weak learners to train
    algorithm="SAMME", # training algorithm
    learning_rate=0.5 # the alpha
)

ada_clf.fit(X_train, y_train) 

y_pred_rf = ada_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_rf))



1.0


#### Gradient Boosting (with decision trees)

In sklearn the *GradientBoostinRegressor* exists. This is an implementation of a boosting algorithm opimized for using decision trees as weak learners. Boosting trains the weak learners sequentially, using as label the residual errors the previous model made. 

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train, y_train)

y_pred_rf = gbrt.predict(X_test)
print(accuracy_score(y_test, y_pred_rf))

0.9725


The following code stops the training and creation of sequential models when the validation error does not improve for five iterations in a row. Infact knowing a priori how many weak learnenrs are need for the task is pretty difficult. Setting an early stop confition like this one can help us out dynamically.

In [14]:
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1,120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_test)
    val_error = mean_squared_error(y_test, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping 


## Exercises