Reference

http://scikit-learn.org/stable/modules/ensemble.html <br>
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

<font color = 'red'> Bagging </font> and <font color = 'red'> Boosting </font> are two of the most commonly used techniques in machine learning.
<br><br>
<font color = 'red'> Bagging algorithms: </font> 
<br>
1> Bagging meta-estimator <br>
2> Random forest
<br><br>
<font color = 'red'> Boosting algorithms: </font> 
<br>
1> AdaBoost  <br>
2> GBM  <br>
3> XGBM  <br>
4> Light GBM  <br>
5> CatBoost
<br><br>
<font color = 'red' size = '3pt'> Random Forests: </font> <br>
In random forests (see <font color = 'blue'>RandomForestClassifier </font> and <font color = 'blue'>RandomForestRegressor </font>  classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, Y)
clf

<font color = 'red' size = '3pt'> Comparison of Algorithm </font>

In [11]:

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier


X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0)

scores = cross_val_score(clf, X, y)
print("DT:", scores.mean())

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print("RF:", scores.mean())  

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print("Extra Tree:", scores.mean())



DT: 0.9794087938205586
RF: 0.9996078431372549
Extra Tree: 0.99989898989899


<font color = 'red' size = '3pt'> AdaBoost: </font> <br>
The module sklearn.ensemble includes the popular boosting algorithm AdaBoost.<br>
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

iris = load_iris()
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, iris.data, iris.target)
scores.mean()  

0.9599673202614379