# Preventing Overfitting using Out-of-Bag Samples
In this notebook, we will show that we are correctly using out of bag samples for sklearn's random forest implementation. Furthermore, we will show that this method was not affective in correctly estimating conditional entropy.

We first use an extremely simple dataset just to show that we are correctly using out-of-bag for the decision tree classifier.

In [1]:
X = [0, 1, 2, 3, 4, 5]
y = [0, 0, 0, 1, 1, 1]

For our estimator, we are using Sklearn's Bagging Classifier on Decision trees.

In [35]:
from sklearn.ensemble.forest import _generate_unsampled_indices
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np

X = np.array(X).reshape(-1, 1)
model = BaggingClassifier(DecisionTreeClassifier(), 
                              n_estimators = 2, 
                              max_samples= .5, 
                              bootstrap = True)
model.fit(X, y)
X

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5]])

Let's select the first classifier and check to see that we can correctly access the out of bag samples. To do so, we use this function called "_generate_unsampled_indices" that gets the unused indices in X. There's no good way of verifying that they are indeed unused, since the classifier does not store the in-bag samples. However, we can do the following tests max_samples = 1. (use all), max_samples = 1 (only use one) to do some quick validation. 

Let's first access a single classifier

In [12]:
classifier = model[0]
unsampled_indices = _generate_unsampled_indices(classifier.random_state, len(X))
print(unsampled_indices)

[1 3 5]


In [102]:
model = BaggingClassifier(DecisionTreeClassifier(), 
                              n_estimators = 2, 
                              max_samples= 0.2, 
                              bootstrap = False)
model.fit(X, y)
tree = model[0]
unsampled_indices = _generate_unsampled_indices(tree.random_state, X.shape[0])
print(unsampled_indices)
node_counts = tree.tree_.n_node_samples
print(node_counts)

[2 5]
[1]
