<h1><center>Ensemble Learning in Python</center></h1>

You all know that the field of machine learning keeps getting better and better with time. Predictive models form the core of machine learning. Better the accuracy better the model is and so is the solution to a particular problem. In this post, you are going to learn about something called <b><i>Ensemble learning</i></b> which is a very powerful technique to improve the performance of your machine learning model. 

In this post you will cover:
<ul>
<li>What is Ensemble learning?</li>
<li>How it improves the performance of a machine learning model?</li>
<li>Different Ensemble learning methods</li>
<li>Pitfalls of Ensembles</li>
<li>A Pythonic implementation of different Ensemble learning methods with a real test dataset</li>
<li>Further studies on Ensemble learning</li>
</ul>

So, let's get started.

<h2>What is Ensemble learning?</h2><br>
In the world of Statistics and Machine Learning, Ensemble learning techniques attempt to make the performance of the predictive models better by improving their accuracy. Ensemble Learning is a process using which multiple machine learning models (such as classifiers) are strategically constructed to solve a particular problem. <br>

Let's take a real example to build the intuition.

Suppose, you want to invest in a company XYZ. You not sure about its performance though. So, you look for advice on whether the stock price will increase more than 6% per annum or not? You decide to approach various experts having diverse domain experience:

<ul>
<li><b><u>Employee of Company XYZ:</u></b> This person knows the internal functionality of the company and have the insider information about the functionality of the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. <b>In the past, he has been right 70% times.</b></li><br>

<li><b><u>Financial Advisor of Company XYZ:</u></b> This person has a broader perspective on how companies strategy will fair of in this competitive environment. However, he lacks a view on how the company’s internal policies are fairing off. <b>In the past, he has been right 75% times.</b></li><br>

<li><b><u>Stock Market Trader:</u></b> This person has observed the company’s stock price over past 3 years. He knows the seasonality trends and how the overall market is performing. He also has developed a strong intuition on how stocks might vary over time. <b>In the past, he has been right 70% times.</b></li><br>

<li><b><u>Employee of a competitor:</u></b> This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject. <b>In the past, he has been right  60% of times.</b></li><br>

<li><b><u>Market Research team in same segment:</u></b> This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. <b>In the past, they have been right 75% of times.</b></li><br>

<li><b><u>Social Media Expert:</u></b> This person can help us understand how has company XYZ has positioned its products in the market. And how are the sentiment of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing. <b>In the past, he has been right 65% of times.</b></li>

Given the broad spectrum of access you have, you can probably combine all the information and make an informed decision.

In a scenario when all the 6 experts/teams verify that it’s a good decision(assuming all the predictions are independent of each other), you will get a combined accuracy rate of

1 - (30% . 25% . 30% . 40% . 25% . 35%) = 1 - 0.07875 = <b>99.92125%</b>

The assumption used here that all the predictions are completely independent is slightly extreme as they are expected to be correlated. However, you can see how we can be so sure by combining various predictions together.

Well, Ensemble learning is no different.
<br><br>Ensemble is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model. In the above example, the way we combine all the predictions together will be termed as Ensemble learning.
<br><br>Moreover, Ensemble based models can be incorporated in both of the two scenarios i.e. when data is of large volume and when data is too little.

Let’s now understand how do you actually get different set of machine learning models. Models can be different from each other for a variety of reasons:


<ul>
<li>There can be difference in the population of data</li>
<li>There can be different modelling technique useed.</li>
<li>There can be dfferent hypothesis.</li>
</ul>

Imagine that you are playing trivial pursuit. When you play alone, there might be some topics you are good at, and some that you know next to nothing about. If we want to maximize our trivial pursuit score, we need build a team to cover all topics. This is the basic idea of an ensemble: combining predictions from several models averages out idiosyncratic errors and yield better overall predictions.



Following picture shows an example schematics of an ensemble.

<img src = "https://www.dataquest.io/blog/content/images/2018/01/network-1.png"></img>

In the picture above, An input array <b>X</b> is fed through two preprocessing pipelines and then to a set of base learners <b>f<sup>(i)</sup></b>. The ensemble combines all base learner predictions into a final prediction array P. <a href = "http://ml-ensemble.com/">Source</a>

Now, the important question is how to combine predictions. In the trivial pursuit example, it is easy to imagine that team members might make their case and majority voting decides which to pick. Machine learning is remarkably similar in classification problems: taking the most common class label prediction is equivalent to a majority voting rule. But there are many other ways to combine predictions, and more generally you can use a model to learn how to best combine predictions.

The following diagram presents a basic Ensemble structure:

<img src = "https://www.dataquest.io/blog/content/images/2018/01/ensemble_network.png"></img>

Here, Data is fed to a set of models, and a meta learner combine model predictions. 

<h2>Model error and reducing this error with Ensembles:</h2>
<br>
The error emerging from any machine model can be broken down into three components mathematically. Following are these component:

<center><b>Bias + Vairance + Irreducible error</b></center>

Why is this important in the current context? To understand what really goes behind an ensemble model, you need to first know what causes error in the model. You will briefly get introduced to these errors.


<b>Bias error</b> is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends.

<b>Variance</b> on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training. Following diagram will give you more clarity (Assume that red spot is the real value and blue dots are predictions) :

<img src = "https://www.analyticsvidhya.com/wp-content/uploads/2015/07/variance_bias.png"></img><span><a href = "https://www.analyticsvidhya.com/blog/2015/08/introduction-ensemble-learning/">Source</a></span>

Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

A good model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to well manage this trade-off.

Now that you are familiar with the basics of Ensemble learning let's look at different Ensemble learning techniques:

<h2>Different types of Ensemble learning methods:</h2>
<br>Although there are several types of Ensemble learning methods but the following three are the most-used ones in the industry.

<h4>Bagging based Ensemble learning:</h4>
<br>Bagging is one of the Ensemble construction techniques which is also known as <b><i>Bootstrap Aggregation</i></b>. Bootstrap establishes the foundation of Bagging technique. Bootstrap is a sampling technique in which we select “n” observations out of a population of “n” observations. But the selection is completely random i.e. each observation can be selected from the original population so that each observation is equally likely to selected in each iteration of the bootstrapping process. After the bootstrapped samples are formed, separate models are trained with the bootstrapped samples. In real experiments, the bootstrapped samples are drawn from the training set and the sub-models are tested using the testing set. The final output prediction is combined across the predictions of all of the sub-models. 

The following infographic gives a brief idea of Bagging:

<img src = "https://www.analyticsvidhya.com/wp-content/uploads/2015/07/bagging.png"></img>

<h2>Boosting based Ensemble learning:</h2>
<br>Boosting is a form of <i>sequential learning</i> technique. The algorithm works by training a model with the entire training set and subsequent models are constructed by fitting the residual error values of
the initial model. In this way, Boosting attempts to give higher weight to those observations that were poorly estimated by the
previous model. Once the sequence of the models is created the predictions made by models are weighted by their accuracy scores and the results are combined to create a final estimation. Models that are typically used in Boosting technique are XGBoost, GBM, ADABoost etc.

<h2>Voting based Ensemble learning:</h2>
<br>Voting is one of the simplest Ensemble learning techniques in which predictions from multiple models are combined. The method starts with creating two or more separate models with same dataset. Then a Voting based Ensemble model can be used to wrap the previous models and aggregate the predictions of those models. After the Voting based Ensemble model is constructed, it can be used to make prediction on new data. The predictions made by the sub-models can be assigned weights. <i>Stacked aggregation</i> is a technique which can be used to learn how to weigh these predictions in the best possible way.

The following infographic best describes Voting-based Ensembles:

<img src = "https://www.analyticsvidhya.com/wp-content/uploads/2015/07/stacking-297x300.png"></img>

Well, the time has come when you apply these concepts to strengthen your intuition and confidence. Let's do it in Python.

<h2>A case study in Python</h2>

The dataset you are going to be using for this case study is popularly known as <b><i>Wisconsin Breast Cancer dataset</i></b>. The task related to it is Classification. 

The dataset contains a total number of 10 features labelled in either <i>benign</i> or <i>malignant</i> classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values. 

The dataset can be downloaded from <a href = "archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29">here</a>.

You will implement the Ensembles using the might scikit-learn library.

Let's first import all the Python dependencies you will be needing for this case study.

In [27]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler

Let's load the dataset in a DataFrame object.

In [35]:
data = pd.read_csv('cancer.csv')
data.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


 The column "Sample code number" is just an indicator and it's of no use in the modelling. So, let's drop it:


In [36]:
data.drop(['Sample code number'],axis = 1, inplace = True)

In [18]:
data.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


You can see that the column is dropped now. Let's get some statistics about the data using Panda's describe() and info() functions:

In [9]:
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
Clump Thickness                699 non-null int64
Uniformity of Cell Size        699 non-null int64
Uniformity of Cell Shape       699 non-null int64
Marginal Adhesion              699 non-null int64
Single Epithelial Cell Size    699 non-null int64
Bare Nuclei                    699 non-null object
Bland Chromatin                699 non-null int64
Normal Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(9), object(1)
memory usage: 54.7+ KB


As mentioned earlier, the dataset contains missing values. The column named "Bare Nuclei" contains them. Let's verify.

In [19]:
data['Bare Nuclei']

0       1
1      10
2       2
3       4
4       1
5      10
6      10
7       1
8       1
9       1
10      1
11      1
12      3
13      3
14      9
15      1
16      1
17      1
18     10
19      1
20     10
21      7
22      1
23      ?
24      1
25      7
26      1
27      1
28      1
29      1
       ..
669     5
670     8
671     1
672     1
673     1
674     1
675     1
676     1
677     1
678     1
679     1
680    10
681    10
682     1
683     1
684     1
685     1
686     1
687     1
688     1
689     1
690     1
691     5
692     1
693     1
694     2
695     1
696     3
697     4
698     5
Name: Bare Nuclei, dtype: object

You can spot some "?"s in it right? Well, these are your missing values and you will be imputing them with <i>Mean Imputation</i>. But first, you will replace those "?"s with 0's.

In [37]:
data.replace('?',0, inplace=True)

In [25]:
data['Bare Nuclei']

0       1
1      10
2       2
3       4
4       1
5      10
6      10
7       1
8       1
9       1
10      1
11      1
12      3
13      3
14      9
15      1
16      1
17      1
18     10
19      1
20     10
21      7
22      1
23      0
24      1
25      7
26      1
27      1
28      1
29      1
       ..
669     5
670     8
671     1
672     1
673     1
674     1
675     1
676     1
677     1
678     1
679     1
680    10
681    10
682     1
683     1
684     1
685     1
686     1
687     1
688     1
689     1
690     1
691     5
692     1
693     1
694     2
695     1
696     3
697     4
698     5
Name: Bare Nuclei, dtype: object

The "?"s are replaced with 0's now. Let's do the missing value treatment now.

In [38]:
# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values

# Now impute it
imputer = Imputer()
imputedData = imputer.fit_transform(values)

Now if you take a look at the dataset itself you will see that all the ranges of the features of the dataset is not the same. This may cause a problem. A small in a feature might not affect the other. To address this problem, you will normalize the ranges of the features to a uniform range, in this case 0 - 1.

In [42]:
scaler = MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)

Wonderful! 

You have performed all the preprocessing that was required in order to perform your Ensembling experiments.

You will start with Bagging based Ensembling. In this case you will use a Bagged Decision Tree.

In [29]:
# Bagged Decision Trees for Classification - necessary dependencies

from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier



You have imported the depdendencies for the Bagged Decision Trees. 

In [43]:
# Segregate the features from the labels
X = normalizedData[:,0:9]
Y = normalizedData[:,9]

In [47]:
kfold = model_selection.KFold(n_splits=10, random_state=7)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=7)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.9571428571428573


Let's see what you did in the above cell.

First, you initialized a 10-fold cross-validation fold. After that, you instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. Then you evaluated your model.

You model performed pretty well. It yielded an accuracy of <b>95.71%</b>.

Brilliant! Let's implement the other ones. 

<i>(If you want a quick refresher on cross-validation then this is the <a href="https://www.youtube.com/watch?v=CRqLeHpACVI">link</a> to go for.)</i>

In [52]:
# AdaBoost Classification

from sklearn.ensemble import AdaBoostClassifier
seed = 7
num_trees = 70
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.9557142857142857


In this case, you did a AdaBoost classification (with 70 trees) which is based on Boosting type of Ensembling. The model gave you an accuracy of <b>95.57%</b> for a 10-fold cross validation.

Finally it's time for you to implement the Voting-based Ensemble technique.

In [None]:
# Voting Ensemble for Classification

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

0.9642857142857142

You implemented a Voting based Ensemble model where you took Logistic Regression, Decision Tree and Support Vector Machine for voting purpose. The model performed the best so far with an accuracy of <b>96.42%</b> for a 10-fold cross-validation. 

Now, let's get you familiarized with some common pitfalls of Ensemble learning.

<h2>Pitfalls of Ensemble learning</h2>

In general it is not true that it will always perform better. There are several ensemble methods, each with its own advantages/weaknesses. Which one to use and then depends on the problem at hand.

For example, if you have models with high variance (they over-fit your data), then you are likely to benefit from using bagging. If you have biased models, it is better to combine use them with Boosting. There are also different strategies to form ensembles. The topic is just too wide to cover it in one answer.

But the point is: if you use the wrong ensemble method for your setting, you are not going to do better. For example, using Bagging with a biased model is not going to help.

Also, if you need to work in a probabilistic setting, ensemble methods may not work either. It is known that Boosting (in its most popular forms like AdaBoost) delivers poor probability estimates. That is, if you would like to have a model that allows you to reason about your data, not only classification, you might be better off with a graphical model.

So, in this post you got introduced to Ensemble learning technique. You covered its basics, how it improves your model's performance. You covered its three main types.

Also, you implemented these three types in Python with the help of scikit-learn and in this course of action you gained a bit knowledge about necessary preprocessing steps.

That's quite a feat! Well done! In this final section, I suggest some further undertakings on Ensembles which you might want to consider.

<h2>Take it further:</h2>

<ul>
<li>Try other Boosting based Ensemble techniques viz. Gradient Boosting, XGBoost.</li>
<li>Play with the different parameter settings that scikit-learn offers in Ensembles and then try to find why a particular setting performed well. This will make your understanding even stronger. <a href = "scikit-learn.org/stable/modules/ensemble.html">link</a></li>
<li>Try Ensemble learning on a variety of datasets to understand where you should and where you should not apply Ensemble learning. For finding datasets Kaggle, UCI Repository etc. are good places to search.</li>
</ul>

I hope you enjoyed this tutorial. Let me know your doubts (if you have any) in the comments section.