# Ensembles classifiers using trees on the Iris Dataset

In this notebook we apply several ensemble methods to the Iris dataset using tree classifiers and plot the resulting decision surfaces. Note that this notebook has been created using the material from http://scikit-learn.org/stable/modules/ensemble.html

First we load all the required libraries.

In [7]:
import numpy as np
import matplotlib.pyplot as plt

# Constructs a new estimator with the same parameters.
# Clone does a deep copy of the model in an estimator 
# without actually copying attached data. It yields a 
# new estimator with the same parameters that has not 
# been fit on any data.
from sklearn import clone 

from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,BaggingClassifier)
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Next we define some of the parameters required to run the experiments like the number of estimators used in each ensembles and the random seed to be able to reproduce the results.

In [8]:
# Set the random seed to be able to repeat the experiment
random_seed = 42 

Load the dataset

In [9]:
iris = load_iris()

Set the models to be compared
- Simple Decision Tree
- Bagging
- Random Forest
- AdaBoost

In [10]:
models = {'Decision Tree':DecisionTreeClassifier(),
          'Bagging':BaggingClassifier(DecisionTreeClassifier()),
          'Random Forest':RandomForestClassifier(),
          'Ada Boost':AdaBoostClassifier(DecisionTreeClassifier())}

For each model, we apply 10-fold stratified crossvalidation and compute the average accuracy and the corresponding standard deviation

In [11]:
scores = {}
for pair in ([0, 1], [0, 2], [2, 3]):
    for model_name in models:
        # We only take the two corresponding features
        X = iris.data[:, pair]
        y = iris.target

        clf = models[model_name];
        score = cross_val_score(clf,X,y,cv=StratifiedKFold(n_splits=10,shuffle=True,random_state=random_seed))
        scores[(model_name,str(pair))]=(np.average(score),np.std(score))

Then, we print for every variable pair the performance of all the models

In [12]:
for pair in ([0, 1], [0, 2], [2, 3]):
    print 'Attributes: ',iris.feature_names[pair[0]],' & ',iris.feature_names[pair[1]]
    for model_name in models:
        print '\t%26s\t%.3f +/- %.3f'%(model_name,scores[(model_name,str(pair))][0],scores[(model_name,str(pair))][1])
    print('\n')

Attributes:  sepal length (cm)  &  sepal width (cm)
	                   Bagging	0.713 +/- 0.085
	                 Ada Boost	0.693 +/- 0.112
	             Random Forest	0.733 +/- 0.094
	             Decision Tree	0.660 +/- 0.092


Attributes:  sepal length (cm)  &  petal length (cm)
	                   Bagging	0.940 +/- 0.047
	                 Ada Boost	0.913 +/- 0.052
	             Random Forest	0.927 +/- 0.055
	             Decision Tree	0.913 +/- 0.052


Attributes:  petal length (cm)  &  petal width (cm)
	                   Bagging	0.947 +/- 0.050
	                 Ada Boost	0.960 +/- 0.033
	             Random Forest	0.973 +/- 0.044
	             Decision Tree	0.947 +/- 0.040


