# Random Forest Independence Test v2 

In this notebook, we modify our algorithm to average over H(Y|Xi) instead of averaging over trees. 
We will try out two methods: 
1. manually calculating the posterior distribution
2. using random forest's approximation of the class probabilities

In [63]:
# have it so it splits training for you
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import graphviz
from scipy.stats import entropy
#TODO: clean up code better
#TODO: modularize and other stuff

# manual one
def estimate_conditional_entropy(X, y, n_trees = 10, max_depth = None, bootstrap = True):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    model = RandomForestClassifier(bootstrap = bootstrap, n_estimators =n_trees, max_depth = max_depth, random_state = 0)
    model.fit(X_train, y_train)
    class_counts = np.zeros((X_test.shape[0], model.n_classes_))
    for tree_in_forest in model:
        # get number of training elements in each partition
        node_counts = tree_in_forest.tree_.n_node_samples
        # get counts w.r.t. testing data now
        partition_counts = np.asarray([node_counts[x] for x in tree_in_forest.apply(X_test)])
        # get probability
        class_probs = tree_in_forest.predict_proba(X_test)
        # why are there decimals?!
        # bootstrap approximation in sklearn
        elems = np.multiply(class_probs, partition_counts[:, np.newaxis])
        class_counts += elems
    probs = class_counts/class_counts.sum(axis=1, keepdims=True)
    entropies = -np.sum(np.log(probs)*probs, axis = 1)
    return np.mean(entropies)

def estimate_conditional_entropy_rf(X, y, n_trees = 10, max_depth = None, bootstrap = True):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    model = RandomForestClassifier(n_estimators = n_trees, max_depth = max_depth, random_state = 0, bootstrap = bootstrap)
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)
    entropies = -np.sum(np.log(probs)*probs, axis = 1)
    return np.mean(entropies)

SKlearn Bootstrapping uses some weird approximation thing:
    

In [23]:
x = [0]*320 + [1]*320 + [2]*320 + [3]*320
y = [0, 1, 0, 1, 0]*64 + [ 1, 1, 1, 1, 0]*64 + [1, 0, 1, 0, 1]*64 + [0, 0, 0, 0, 1]*64
X = np.array(x).reshape(-1, 1)
y = np.array(y)

# Hand calculations  
H(X) = 1.386294  
H(Y) = 0.693147  
H(X, Y) = 1.97300  
H(Y|X) = .5867  

In [67]:
estimate_conditional_entropy( X, y, 100, bootstrap = True)

0.5902739280344956

You can adjust the size of the data. The more data the better it does. However, it doesn't do as well as the previoius algorithm which uses weighted conditional entropy and first averages across trees.

In [68]:
estimate_conditional_entropy_rf(X, y, 100, bootstrap = True)

0.5902189569467394

# Improvement 1
We can turn bootstrapping off because there is some approximation going on:
https://stats.stackexchange.com/questions/130206/sklearn-tree-export-graphviz-values-do-not-add-up-to-samples

In [69]:
estimate_conditional_entropy( X, y, 100, bootstrap = False)

0.5888136355677847

In [70]:
estimate_conditional_entropy_rf(X, y, 100, bootstrap = False)

0.5888136355677847

# Improvement 2
We can use all the data. This is by far where most of the error is coming from. Makes no sense to compare conditional entropy of test dataset to entire dataset. If we want to measure conditional entrop of our sample dataset, we should just use everything. What is important is just that random forest was able to capture dependences.

How does this affect robustness? I.e. sample data is dependent but actually not dependent.

In [71]:
# manual one
def estimate_conditional_entropy(X, y, n_trees = 10, max_depth = None, bootstrap = True):
    model = RandomForestClassifier(bootstrap = bootstrap, n_estimators =n_trees, max_depth = max_depth, random_state = 0)
    model.fit(X, y)
    class_counts = np.zeros((X, model.n_classes_))
    for tree_in_forest in model:
        # get number of training elements in each partition
        node_counts = tree_in_forest.tree_.n_node_samples
        # get counts w.r.t. testing data now
        partition_counts = np.asarray([node_counts[x] for x in tree_in_forest.apply(X)])
        # get probability
        class_probs = tree_in_forest.predict_proba(X)
        # why are there decimals?!
        # bootstrap approximation in sklearn
        elems = np.multiply(class_probs, partition_counts[:, np.newaxis])
        class_counts += elems
    probs = class_counts/class_counts.sum(axis=1, keepdims=True)
    entropies = -np.sum(np.log(probs)*probs, axis = 1)
    return np.mean(entropies)

def estimate_conditional_entropy_rf(X, y, n_trees = 10, max_depth = None, bootstrap = True):
    model = RandomForestClassifier(n_estimators = n_trees, max_depth = max_depth, random_state = 0, bootstrap = bootstrap)
    model.fit(X, y)
    probs = model.predict_proba(X)
    entropies = -np.sum(np.log(probs)*probs, axis = 1)
    return np.mean(entropies)