# Ensembles
Most popular methods for combining the predictions from different models:

* **Bagging** - Building multiple models (usually of the same model type) from different subsamples of the training data.
* **Boosting** - Building multiple models (usually of the same model type) in sequence, each of which learns to fix the prediction errors of a prior model.
* **Voting** - Building multiple models (usually of different model types) and using simple statistics (e.g. calculating the mean) to combine the predictions.

See http://scikit-learn.org/stable/modules/ensemble.html

In [1]:
import pandas
from sklearn import cross_validation
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

  from numpy.core.umath_tests import inner1d


In [2]:
# these examples use the Pima Indian diabetes dataset
url = "pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

In [3]:
# separate array into features (X) and label (y) parts
X = array[:,0:8]
y = array[:,8]

## Bagging Algorithms
Also known as Bootstrap Aggregation. Takes multiple samples from the training data (with replacement) and trains a model for each of these samples. The final output prediction is averaged across the predictions of all of the sub-models.

### Bagged Decision Trees
Applies the bagging method, using decision trees without pruning. This example uses CART and creates 100 trees.

In [4]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100

model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7538448393711552


### Random Forest
An extension of bagged decision trees, but instead of greedily choosing the best split point for each tree, only a random subset of fetures are considered for each split. This example creates 100 trees and split points are chosen from random selection of 3 features.

In [5]:
num_folds = 10
num_instances = len(X)
seed = 8

num_trees = 100
max_features = 3

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7616883116883117


### Extremely Randomized (Extra) Trees
The randomness goes one step further (compared to Random Forest). Instead of looking for the most discriminative thresholds for splitting, they are randomly determined for each subset feature and the best of these randomly-generated thresholds is chosen as the splitting rule. This approach reduces the variance of the model but slightly increases the bias.

In [6]:
num_folds = 10
num_instances = len(X)
seed = 8

num_trees = 100
max_features = 7

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7655331510594668


## Boosting Algorithms
Create a sequence of models that attempt to correct the erros of the models befor them in the sequence. The models then make predictions (which may be weighted by their demonstrated accuracy) and the results combined to get the final output prediction.

### AdaBoost
Works by weighting instanecs in the dataset based on how difficult they are to classify, therefore enabling the algorithm to pay more attention to them in constructing subsequent models. This example constructs 30 decision trees in sequence.

In [7]:
num_folds = 10
num_instances = len(X)
seed = 8
num_trees = 30

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.760457963089542


### Stochastic Gradient Boosting (Gradient Boosting Machines)
Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. 

Advantages are strong predictive power, robustness to outliers and ability to handle heterogeneous features (i.e. of mixed data types). Disadvantage is scalability (due to sequential nture of boosting, can hardly be parallelized).

In [8]:
num_folds = 10
num_instances = len(X)
seed = 7
num_trees = 100

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7669002050580999


## Voting Ensemble
One of the simplest ways to combine the predictions from multiple ML algorithms. A Voting Classifier can be used to "wrap" all your different models and average the predictions across these models to make predictions for new data.

More advanced methods can learn how to best "weight" the predictions from the sub-models, but this is called stacking (stacked aggregation).

In [9]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))

# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_validation.cross_val_score(ensemble, X, y, cv=kfold)
print(results.mean())

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


0.7317156527682844


  if diff:
