# DSA5101 - Introduction to Big Data for Industry


**Prepared by *Dr Li Xiaoli*** 

#   Ensemble methods     


##  The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability/robustness over a single estimator.

## Two families of ensemble methods are usually distinguished:

> In averaging methods, the driving principle is to build several estimators *independently* and then to average their predictions. On average, the combined estimator could be better than any of the single base estimator because its variance is reduced. Examples: Bagging methods, Random Forest Classifier (special cases of bagging, Forests of randomized trees).

> By contrast, in boosting methods, base estimators are built *sequentially* and one tries to reduce the bias of the combined estimator. Later estimators tried to correct the unaddressed errors left by previous estimators. The motivation is to combine several weak models to produce a powerful ensemble.

# 1. Random Forest (and Decision Tree)

## RandomForestClassifier (Forests of randomized trees) 

## A simple example

In [1]:
from sklearn.ensemble import RandomForestClassifier
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimators=10)  
clf = clf.fit(X, Y)

In [2]:
# help(RandomForestClassifier)

## Parameters

> **n_estimators** : integer, optional (default=10)
    *The number of trees in the forest.*
 
> **criterion** : string, optional (default="gini")

  > The function to *measure the quality of a split. Supported criteria are
 "gini" for the Gini impurity and "entropy" for the information gain.*
 
> **max_features** : int, float, string or None, optional (default="auto").
  **The number of features to consider when looking for the best split:**
>> * If int, then consider `max_features` features at each split.
>> * If float, then `max_features` is a percentage/ratio and
  `int(max_features * n_features)` features are considered at each split.
>> * If "auto", then `max_features=sqrt(n_features)`.
>> * If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
>> * If "log2", then `max_features=log2(n_features)`.
>> * If None, then `max_features=n_features`.


> **max_depth**: integer or None, optional (default=None)

> The maximum depth of the tree. If None, then nodes are expanded until
   all leaves are pure or until all leaves contain less than min_samples_split samples.
   
>  min_samples_split : int, float, optional (default=2)

> The minimum number of samples required to split an internal node:
>>...

In [3]:
# Randomly generate 10 points/examples that will be equally divided among 
#3 clusters. Each point/example has 2 features, i.e. 10 2-D vectors. 

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=3, n_features=2, random_state=0)

# X : array of shape [n_samples, n_features]
# The generated samples.
    
# y : array of shape [n_samples]
# The integer labels for cluster membership of each sample.

In [4]:
# help (make_blobs)

In [4]:
print(X.shape)

(10, 2)


In [5]:
print(y.shape)

(10,)


In [6]:
X
# 10 points that will be equally divided among 3 clusters.

array([[ 1.12031365,  5.75806083],
       [ 1.7373078 ,  4.42546234],
       [ 2.36833522,  0.04356792],
       [ 0.87305123,  4.71438583],
       [-0.66246781,  2.17571724],
       [ 0.74285061,  1.46351659],
       [-4.07989383,  3.57150086],
       [ 3.54934659,  0.6925054 ],
       [ 2.49913075,  1.23133799],
       [ 1.9263585 ,  4.15243012]])

In [8]:
y
# 3 labels: 0, 1, 2
# We are solving a three class classification problem

array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])

###       Decision Tree

In [9]:
print ("We will build Decision Tree models")
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


# samples_generator: generate training data that consist of 10,000 examples 
# from 100 clusters, each example is reresented as 10-d vectors
# VERY COMVINIENT!
X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)

# DecisionTreeClassifier 
# max_depth: The maximum depth of the tree.
# max_depth=None: nodes are expanded until all leaves are pure 
# or until all leaves contain less than min_samples_split samples.

# min_samples_split: The minimum number of samples required to split 
# an internal node.
# min_samples_split=1, we split the internal node until it has 1 sample

# Build DT classifier, with the maximum depth, and split unitl contain 1 sample if not pure

# min_samples_split : int, float, optional (default=2)
# The minimum number of samples required to split an internal node (its value is >=2):
#  - If int, then consider `min_samples_split` as the minimum number. 
#  - If float, then `min_samples_split` is a percentage and 
# ceil(min_samples_split * n_samples) is the minimum number of samples 
# for each split.
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)  
#clf = DecisionTreeClassifier(max_depth=None, min_samples_split=3,random_state=0) 
#clf = DecisionTreeClassifier(max_depth=None,random_state=0)  
scores = cross_val_score(clf, X, y) #compute the performance of DT classifier
print (scores.mean())                       

We will build Decision Tree models
0.9823000000000001


###       Random Forests

In [12]:
#Random Forest Classifier
print ("We will build Random Forests models")
# The number of trees in the forest: 10
# nodes are expanded until all leaves are pure 
clf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)
RFscores = cross_val_score(clf, X, y)
print (RFscores.mean())                       

We will build Random Forests models
0.9997


In [11]:
#####################################################################
#       Random Forests  - You can change max_depth to see effect    #
#####################################################################
clf = RandomForestClassifier(n_estimators=10, max_depth=15, random_state=0)
RFscores = cross_val_score(clf, X, y) # default 5-fold cross validation
print (RFscores.mean())        

0.9997999999999999


In [14]:
# ? cross_val_score 
#Evaluate a score by cross-validation

In [15]:
#############################################################
#       Random Forests  - You can change number of trees    #
#############################################################
clf = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=0)
RFscores = cross_val_score(clf, X, y)
print (RFscores.mean())    

1.0


### This example shows Random Forests can perform better than decision tree model.
### More trees typically lead to better results

# 2. AdaBoost

In [16]:
#########################################
#               AdaBoost                #
#########################################
# https://en.wikipedia.org/wiki/AdaBoost
print ("AdaBoost")
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

iris = load_iris()
clf = AdaBoostClassifier(n_estimators=500)  
# Build AdaBoost classier with 100 base classifiers
scores = cross_val_score(clf, iris.data, iris.target)
print(scores.mean())                           

AdaBoost
0.9466666666666665


# 3. Gradient Tree Boosting for Classification 

In [18]:
#####################################################
#   Gradient Tree Boosting for Classification       #
#####################################################
print ("Gradient Tree Boosting for Classification")
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

Gradient Tree Boosting for Classification


In [19]:
# Generates data for binary classification used in Hastie et al. 2009, 
# Example 10.2. # by default, we generate 12,000 data points (each with 10 features), 
# with binary label 1 and -1
X, y = make_hastie_10_2(random_state=0)

# n_samples : int, optional (default=12000)

#    Returns
#    -------
#    X : array of shape [n_samples, 10]
#        The input samples.
    
#    y : array of shape [n_samples]
#        The output values.

In [20]:
# help (make_hastie_10_2)

In [21]:
X

array([[ 1.76405235,  0.40015721,  0.97873798, ..., -0.15135721,
        -0.10321885,  0.4105985 ],
       [ 0.14404357,  1.45427351,  0.76103773, ..., -0.20515826,
         0.3130677 , -0.85409574],
       [-2.55298982,  0.6536186 ,  0.8644362 , ..., -0.18718385,
         1.53277921,  1.46935877],
       ...,
       [ 0.19986465,  0.26134578, -0.1279868 , ..., -0.51718289,
         0.07969414,  1.01612661],
       [-0.15167316, -1.42519962,  1.07092211, ..., -1.20676602,
        -1.04746487,  0.0075881 ],
       [-0.09708998,  0.78044425,  0.22108152, ...,  2.53170549,
        -0.03572203,  0.17320019]])

In [22]:
y

array([ 1., -1.,  1., ..., -1., -1.,  1.])

In [23]:
X.shape
# We generate 12000 examples, each with 10 features

(12000, 10)

In [24]:
y.shape
# We generate 12000 labels

(12000,)

In [25]:
# segment data into training and testing
X_train, X_test = X[:2000], X[2000:] #Data: First 2000 are training data, while last 10000 for test data

# X_train.shape=(2000, 10):2000 examples, each with 10 features
# X_test.shape=(10000, 10): 10000 examples, each with 10 features 

y_train, y_test = y[:2000], y[2000:] #Label: First 2000 are training labels, while last 10000 for test labels

In [26]:
X_train.shape

(2000, 10)

In [27]:
X_test.shape

(10000, 10)

In [28]:
clf = GradientBoostingClassifier(n_estimators=500, learning_rate=1.0, max_depth=10, random_state=0).fit(X_train, y_train)
print (clf.score(X_test, y_test))
print("\n\n")

0.836





# 4. Gradient Tree Boosting for Regression

In [29]:
#####################################################
#       Gradient Tree Boosting for Regression       #
#####################################################
print ("Gradient Tree Boosting for Regression")

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

Gradient Tree Boosting for Regression


In [31]:
X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0) #generate 1200 example, with each 10 features
X_train, X_test = X[:200], X[200:]  # first 200 as training
y_train, y_test = y[:200], y[200:]  # 201-1200 as testing

In [32]:
# help (make_friedman1)

In [33]:
X_train.shape

(200, 10)

In [34]:
X_test.shape

(1000, 10)

In [35]:
y_train.shape

(200,)

In [37]:
y  #y are continous/numeric values

array([18.40631483, 19.60677754, 14.74407804, ..., 11.46321759,
        6.49856896, 20.32981295])

In [38]:
#est is our regression model
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
#performance in test set in terms of mean squared error
print (mean_squared_error(y_test, est.predict(X_test)))  

5.009154859960321


In [42]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, est.predict(X_test))

1.8404283706219466

In [41]:
from sklearn.metrics import r2_score
r2_score(y_test, est.predict(X_test))

0.8058912110043196

In [43]:
# Classifier, which features are important
#The feature importance scores of a fit gradient boosting model 
#can be accessed via the feature_importances_ property:
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0).fit(X, y)
print ("The feature importance scores for 10 features")
print (clf.feature_importances_)

The feature importance scores for 10 features
[0.10684213 0.10461707 0.11265447 0.09863589 0.09469133 0.10729306
 0.09163753 0.09718194 0.09581415 0.09063242]


# 5. VotingClassifier

In [44]:
#################################################
#               VotingClassifier                #
#################################################
print ("VotingClassifier")

###############################################
# Majority Class Labels (Majority/Hard Voting)
print ("Majority Class Labels (Majority/Hard Voting)")
from sklearn import datasets
#from sklearn.metrics import cross_validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression   # classifier 1
from sklearn.naive_bayes import GaussianNB            # classifier 2       
from sklearn.ensemble import RandomForestClassifier   # classifier 3
from sklearn.ensemble import VotingClassifier

VotingClassifier
Majority Class Labels (Majority/Hard Voting)


In [46]:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

# Build 3 individual classifiers 
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

# Voting-ensemble classifier (4th classifier)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

#Loop for 4 classifiers with their corresponding labels
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.4f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.953 (+/- 0.0400) [Logistic Regression]
Accuracy: 0.940 (+/- 0.0389) [Random Forest]
Accuracy: 0.913 (+/- 0.0400) [naive Bayes]
Accuracy: 0.953 (+/- 0.0400) [Ensemble]


In [None]:
#change %0.2f to %0.3f and see effect