# Bagged Trees Classifier

Bagging (short for boostrat-aggregating) is an ensembling method that aggregates the predictions of many weak learners to make a stronger supervised machine learning model. It's commonly used with decision trees, as they're weak learners with high variance. This model, the BaggedTreesClassifier, is trained by iteratively 
1. Randomly subsampling the observations in the dataset
2. Training a DecisionTreeClassifier on the subsample
3. Storing the DecisionTreeClassifier object in an array 

The model predicts new observations by
1. Predidcting the outcome using each tree in the array
2. Using a majority vote method to determine the bagged classifier's prediction (for integer encoded labels, it uses the mode function)

In [7]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from scipy.stats import mode

In [12]:
class BaggedTreesClassifier:
    
    def __init__(self, n_estimators, n_samples):
        self.n_estimators = n_estimators
        self.trees = []
        self.n_samples = n_samples
    
    def fit(self, X, y):
        for i in range(self.n_estimators):         
            # Randomly subsample observation-wise and feature-wise
            sample_inds = np.random.choice(X.shape[0], size=self.n_samples, replace=False)
            X_sample = X[sample_inds,:]
            y_sample = y[sample_inds]
    
            # Fit the Decision Tree on the subsampled data
            tree = DecisionTreeClassifier()
            tree.fit(X_sample, y_sample)
            self.trees.append(tree)
            
    def predict(self, X):
        all_tree_predictions = np.array([t.predict(X) for t in self.trees])
        forest_predictions = np.squeeze(mode(all_tree_predictions, axis=0).mode)
        return forest_predictions

In [13]:
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [14]:
clf = BaggedTreesClassifier(n_estimators=100, n_samples=30)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [15]:
print('Accuracy score: {}'.format(accuracy_score(y_pred, y_test)))
print('Confusion matrix:')
print(confusion_matrix(y_pred, y_test))
      

Accuracy score: 0.9666666666666667
Confusion matrix:
[[ 8  0  0]
 [ 0 12  0]
 [ 0  1  9]]
