In [None]:
!pwd

## Existence Of Node Clusters

Here we demonstrate that in random forest that has been trained on some set of data, the nodes can be reasonably organized into clusters.

First, we must train or load a forest:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scanpy as sc
from scipy.stats import multivariate_normal

import sys
# sys.path.append('/localscratch/bbrener1/rusty_forest_v3/src')
sys.path.append('../../')
import rusty_axe.lumberjack as lumberjack

data_location = "/home/boris/battle/rf_5/data/iris/"

iris = np.loadtxt(data_location + "iris.tsv")
iris_features = np.loadtxt(data_location + "header.txt",dtype=str)

iris_forest = lumberjack.fit(
    iris,
    trees=100,
    ifs=3,
    ofs=3,
    ss=100,
    leaves=10,
    depth=3,
    dispersion_mode='ssme',
    sfr=0,
    norm='l1',
    reduction = 4,
    standardize='true',
    reduce_input='true',
    reduce_output='true',
)

shuffled = iris.copy()

for column in shuffled.T:
    np.random.shuffle(column)

shuffled_forest = lumberjack.fit(
    shuffled,
    trees=100,
    ifs=3,
    ofs=3,
    ss=100,
    leaves=10,
    depth=3,
    dispersion_mode='ssme',
    sfr=0,
    norm='l1',
    reduction = 4,
    standardize='true',
    reduce_input='true',
    reduce_output='true',
)

In [None]:
iris_class = np.loadtxt(data_location + "class.txt",dtype='str')
iris_header = np.zeros(150,dtype=int)
iris_header[iris_class == "Iris-setosa"] = 0
iris_header[iris_class == "Iris-versicolor"] = 1
iris_header[iris_class == "Iris-virginica"] = 2

A Random Forest is a collection of decision trees, and a decision tree is a collection of individual decision points, commonly known as "Nodes"

To understand Random Forests and Decision Trees, it is important to understand how Nodes work. Each individual node is a (very crappy) regressor, eg. each Node makess a prediction based on a rule like "If Gene 1 has expression > 10, Gene 2 will have expression < 5", or "If a house is < 5 miles from a school, it will cost > $100,000". A very important property of each node, however, is that it can also have children, which are other nodes. When a node makes a prediction like "If Gene 1 has expression > 10 then Gene 2 has expression < 5", it can pass all the samples for which Gene 1 is > 10 to one of its children, and all the samples for which Gene 1 < 10 to the other child. After that, each one of its children can make a different prediction, which results in compound rules.

This is how a decision tree is formed. A decision tree with a depth of 2 might contain a rule like "If Gene 1 > 10 AND Gene 3 > 10, THEN Gene 2 and Gene 4 are both < 2, which would represent one of the "Leaf" nodes that it has. Leaf nodes are nodes with no children. 

Individual decision trees, then, are somewhat crappy predictors, but they're better than individual nodes. In order to improve the performance of decision trees, we can construct a Random Forest. To construct a random forest, we can train many decision trees on bootstraps of a dataset

If many decision trees are combined and their predictions averaged together, you have a Random Forest, which is a pretty good kind of regressor. 

A practical demonstration might help:

In [None]:

iris_forest.reset_split_clusters()
iris_forest.interpret_splits(depth=3,mode='partial',metric='cosine',relatives=True,k=20)


shuffled_forest.reset_split_clusters()
shuffled_forest.interpret_splits(depth=3,mode='partial',metric='cosine',relatives=True,k=20)

iris_forest.maximum_spanning_tree(mode='samples')

iris_forest.html_tree_summary()

So now that we know that random forests are collections of ordered nodes, we can examine a more interesting question: do certain nodes occur repeatedly in the forest, despite operating on bootstrapped samples? 

In order to examine this question first we must understand different ways of describing a node. I think generally there are three helpful ways of looking at a node:

* **Node Sample Encoding**: A binary vector the length of the number of samples you are considering. 0 or false means the sample is absent from the node. A 1 or true means the sample is present in the node. 

* **Node Mean Encoding**: A float vector the length of the number of targets you are considering. Each value is the mean of the target values for all samples in this node. This is the node's prediction for samples that occur in it.

* **Node Additive Encoding**: A float vector the length of the number of targets you are considering. Each value is THE DIFFERENCE between the mean value for that target in THIS NODE and the mean value for that target IN THE PARENT of this node. For root nodes, which have no parents, the additive encoding is simply th mean value across the entire dataset. (As if the mean of a hypothetical parent would have been 0). This encoding represents the marginal effect of each node.

We should examine if there are any common patterns that appear if we encode many nodes from a forest using each of these representations:

In [None]:
# Here we plot the sample representations of nodes for the iris forest. 
# This generates a set of figures demonstrating the existence of node clusters

# from sklearn.decomposition import PCA

# sample_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='sample')
# sister_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='sister')
# reduced_sample = PCA(n_components=100).fit_transform(sample_encoding.T)
# reduced_sister = PCA(n_components=100).fit_transform(sister_encoding.T)
# # reduced_node = PCA(n_components=100).fit_transform(sample_encoding)
# reduced_node = PCA(n_components=100).fit_transform(sister_encoding)

# print(sample_encoding.shape)
# print(reduced_sample.shape)
# print(reduced_node.shape)

# from scipy.cluster.hierarchy import linkage,dendrogram

# sample_agglomeration = dendrogram(linkage(reduced_sample, metric='cosine', method='average'), no_plot=True)['leaves']
# sister_agglomeration = dendrogram(linkage(reduced_sister, metric='cosine', method='average'), no_plot=True)['leaves']
# node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

# plt.figure()
# plt.title("Iris Sample Presence in Node (Two-Way Agglomerated)")
# plt.imshow(sample_encoding[node_agglomeration].T[sample_agglomeration].T,cmap='binary',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()

# plt.figure()
# plt.title("Iris Sample Presence in Node (Two-Way Agglomerated)")
# plt.imshow(sister_encoding[node_agglomeration].T[sister_agglomeration].T,cmap='bwr',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()


# plt.figure()
# plt.title("Iris Sample Presence in Node (Agglomerated)")
# plt.imshow(sample_encoding[node_agglomeration],cmap='binary',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()

# plt.figure()
# plt.title("Iris Sample Presence in Node (Agglomerated)")
# plt.imshow(sister_encoding[node_agglomeration],cmap='bwr',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()

plt.figure()
# plt.suptitle("Iris Sample Presence in Node (Agglomerated)")
ax1 = plt.axes([0,.1,.9,.9])
im1 = plt.imshow(sister_encoding[node_agglomeration],cmap='bwr',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.xticks([])
ax2 = plt.axes([0,0,.9,.1])
plt.imshow(np.array([iris_header,]),aspect='auto',interpolation='none',cmap='Set3')
plt.ylabel("Species")
plt.xlabel("Samples")
plt.yticks([])
ax3 = plt.axes([.92,.1,.06,.9])
plt.colorbar(im1,cax=ax3,label="Presence In Node")
# plt.tight_layout()
plt.show()

# And here we sort the nodes after they have been clustered (more on the clustering procedure in a bit)

# node_cluster_sort = np.argsort([n.split_cluster for n in iris_forest.nodes(depth=3,root=False)])

# plt.figure()
# plt.title("Sample Presence in Node (Clustered by Gain)")
# plt.imshow(sample_encoding[node_cluster_sort],cmap='binary',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()

# plt.figure()
# plt.title("Sample Presence in Node (Clustered by Gain)")
# plt.imshow(sister_encoding[node_cluster_sort],cmap='bwr',aspect='auto',interpolation='none')
# plt.xlabel("Samples")
# plt.ylabel("Nodes")
# plt.colorbar(label="Presence In Node")
# plt.tight_layout()
# plt.show()

# plt.figure(figsize=(4,1))
# plt.imshow(np.array([iris_header,]),aspect='auto',interpolation='none',cmap='rainbow')
# plt.xlabel("Samples")
# plt.yticks([])
# plt.title("Iris Species By Color")
# plt.show()


In [None]:
# Here we plot the sample representations of nodes for the iris forest. 
# This generates a set of figures demonstrating the existence of node clusters

from sklearn.decomposition import PCA

sample_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='sample')
sister_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='sister')
reduced_sample = PCA(n_components=100).fit_transform(sample_encoding.T)
reduced_sister = PCA(n_components=100).fit_transform(sister_encoding.T)
# reduced_node = PCA(n_components=100).fit_transform(sample_encoding)
reduced_node = PCA(n_components=100).fit_transform(sister_encoding)

print(sample_encoding.shape)
print(reduced_sample.shape)
print(reduced_node.shape)

from scipy.cluster.hierarchy import linkage,dendrogram

sample_agglomeration = dendrogram(linkage(reduced_sample, metric='cosine', method='average'), no_plot=True)['leaves']
sister_agglomeration = dendrogram(linkage(reduced_sister, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']
node_cluster_sort = np.argsort([n.split_cluster for n in spherical_forest.nodes(depth=3,root=False)])

plt.figure()
plt.title("Gaussian Noise Sample Presence in Node \n(Two-Way Agglomerated)")
plt.imshow(sample_encoding[node_agglomeration].T[sample_agglomeration].T,cmap='binary',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.colorbar(label="Presence In Node")
plt.tight_layout()
plt.show()

plt.figure()
plt.title("Gaussian Noise Sample Presence in Node vs Sister \n(Two-Way Agglomerated)")
plt.imshow(sister_encoding[node_agglomeration].T[sister_agglomeration].T,cmap='bwr',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.colorbar(label="Presence In Node")
plt.tight_layout()
plt.show()

# And here we sort the nodes after they have been clustered (more on the clustering procedure in a bit)


plt.figure()
plt.title("Gaussian Noise Sample Presence in Node (Clustered by Gain)")
plt.imshow(sample_encoding[node_cluster_sort].T[sample_agglomeration].T,cmap='binary',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.colorbar(label="Presence In Node")
plt.tight_layout()
plt.show()


In [None]:
from sklearn.decomposition import PCA

sample_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='sister')
reduced_sample = PCA(n_components=100).fit_transform(sample_encoding.T)
reduced_node = PCA(n_components=100).fit_transform(sample_encoding)

print(sample_encoding.shape)
print(reduced_sample.shape)
print(reduced_node.shape)

from scipy.cluster.hierarchy import linkage,dendrogram

sample_agglomeration = dendrogram(linkage(reduced_sample, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

node_cluster_sort = np.argsort([n.split_cluster for n in iris_forest.nodes(depth=3,root=False)])

plt.figure()
plt.title("Iris Sample Presence in Node vs Sister (Two-Way Agglomerated)")
plt.imshow(sample_encoding[node_agglomeration].T[sample_agglomeration].T,cmap='bwr',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.colorbar()
plt.tight_layout()
plt.show()

plt.figure()
plt.title("Iris Sample Presence in Node vs Sister \n(Clustered By Gain)")
plt.imshow(sample_encoding[node_cluster_sort].T[sample_agglomeration].T,cmap='bwr',aspect='auto',interpolation='none')
plt.xlabel("Samples")
plt.ylabel("Nodes")
plt.colorbar()
plt.tight_layout()
plt.show()


In [None]:
# Here we plot the construct and agglomerate the additive gain representation 


sample_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='sample')
reduced_sample = PCA(n_components=100).fit_transform(sample_encoding.T)
reduced_node = PCA(n_components=100).fit_transform(sample_encoding)

sample_agglomeration = dendrogram(linkage(reduced_sample, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

feature_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='additive_mean')
reduced_feature = PCA().fit_transform(feature_encoding.T)
reduced_node = PCA().fit_transform(feature_encoding)

feature_agglomeration = dendrogram(linkage(reduced_feature, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

node_cluster_sort = np.argsort([n.split_cluster for n in iris_forest.nodes(depth=3,root=False)])



In [None]:
# Here we plot the additive gain representation 

# print(feature_encoding.shape)

# plt.figure()
# plt.title("Target Gain in Node (Double-Agglomerated)")
# plt.imshow(feature_encoding[node_agglomeration].T[feature_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto',vmin=-2,vmax=2)
# plt.xlabel("Features")
# plt.ylabel("Nodes")
# plt.colorbar(label="Parent Target Mean - Node Target Mean")
# plt.tight_layout()
# plt.show()


plt.figure(figsize=(5,3))
plt.title("Target Gain in Node (Clustered)")
plt.imshow(feature_encoding[node_cluster_sort].T[feature_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto',vmin=-2,vmax=2)
plt.xlabel("Features")
plt.ylabel("Nodes")
plt.colorbar(label="Parent Mean - Node Mean")
plt.xticks(np.arange(4),labels=iris_features,rotation=20)
plt.tight_layout()
plt.show()

In [None]:
# iris_forest.html_tree_summary()

Finally we can look at silhouette plots scores for various node encodings in order to get a feel for whether or not we are adequately clustering them and whether or not the clusters meaningfully exist. 

In [None]:
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import linkage,dendrogram


sample_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='sample')
reduced_sample = PCA(n_components=100).fit_transform(sample_encoding.T)
reduced_node = PCA(n_components=100).fit_transform(sample_encoding)

sample_agglomeration = dendrogram(linkage(reduced_sample, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

feature_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='additive_mean')
reduced_feature = PCA().fit_transform(feature_encoding.T)
reduced_node = PCA().fit_transform(feature_encoding)

feature_agglomeration = dendrogram(linkage(reduced_feature, metric='cosine', method='average'), no_plot=True)['leaves']
node_agglomeration = dendrogram(linkage(reduced_node, metric='cosine', method='average'), no_plot=True)['leaves']

node_cluster_sort = np.argsort([n.split_cluster for n in spherical_forest.nodes(depth=3,root=False)])



In [None]:
# plt.figure()
# plt.title("Target Gain in Node (Gaussian Noise, Double-Agglomerated)")
# plt.imshow(feature_encoding[node_agglomeration].T[feature_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto',vmin=-2,vmax=2)
# plt.xlabel("Features")
# plt.ylabel("Nodes")
# plt.colorbar(label="Parent Target Mean - Node Target Mean")
# plt.tight_layout()
# plt.show()

original_clusters = np.array([n.split_cluster for n in spherical_forest.nodes(depth=3,root=False)])
renumbered_clusters = original_clusters.copy()
renumbered_clusters[original_clusters == 1] = 3
renumbered_clusters[original_clusters == 2] = 4
renumbered_clusters[original_clusters == 3] = 1
renumbered_clusters[original_clusters == 4] = 2
sort_renumbered = np.argsort(renumbered_clusters)

plt.figure()
plt.title("Target Gain in Node (Gaussian Noise, Clustered)")
plt.imshow(feature_encoding[sort_renumbered].T[feature_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto',vmin=-2,vmax=2)
plt.xlabel("Features")
plt.ylabel("Nodes")
plt.colorbar(label="Parent Target Mean - Node Target Mean")
plt.xticks([0,1,2,3],[1,2,3,4])
plt.tight_layout()
plt.show()


In [None]:
# Node mean encoding

mean_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='mean')

node_agglomeration = dendrogram(linkage(mean_encoding, metric='cosine', method='average'), no_plot=True)['leaves']
mean_agglomeration = dendrogram(linkage(mean_encoding.T, metric='cosine', method='average'), no_plot=True)['leaves']

plt.figure()
plt.title("Figure S2 a: Target Mean in Node (Gaussian Noise, Double-Agglomerated)")
plt.imshow(mean_encoding[node_agglomeration].T[mean_agglomeration].T,cmap='viridis',interpolation='none',aspect='auto')
plt.xlabel("Features")
plt.ylabel("Nodes")
plt.colorbar(label="Parent Target Mean - Node Target Mean")
plt.tight_layout()
plt.show()

In [None]:
# Silhouette Plots For Node Clusters 

from sklearn.metrics import silhouette_samples, silhouette_score

# feature_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='additive_mean')
# node_labels = np.array([n.split_cluster for n in iris_forest.nodes(depth=3,root=False)])

feature_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='additive_mean')
node_labels = np.array([n.split_cluster for n in spherical_forest.nodes(depth=3,root=False)])

silhouette_scores = silhouette_samples(feature_encoding,node_labels,metric='cosine')

sorted_silhouette = np.zeros(silhouette_scores.shape)
sorted_colors = np.zeros(silhouette_scores.shape)
sorted_indices = []

current_index = 0
next_index = 0
for i in sorted(set(node_labels)):
    mask = node_labels == i
#     selected_values = sorted(silhouette_scores[mask])    
    value_sort = np.argsort(silhouette_scores[mask])
    selected_values = silhouette_scores[mask][value_sort]
    sorted_local_indices = np.arange(len(silhouette_scores))[mask][value_sort]
    sorted_indices.extend(sorted_local_indices)
    next_index = current_index + np.sum(mask)
    sorted_silhouette[current_index:next_index] = selected_values
    sorted_colors[current_index:next_index] = i
    current_index = next_index

In [None]:
import matplotlib.cm as cm

plt.figure()
plt.title("Silhouette Plots For Node-Gain Encodings \n Clustered By Gain")
for i,node in enumerate(sorted_silhouette):
    plt.plot([0,node],[i,i],color=cm.nipy_spectral(sorted_colors[i] / len(iris_forest.split_clusters)))
# plt.scatter(sorted_silhouette,np.arange(len(sorted_silhouette)),s=1)
plt.plot([0,0],[0,len(sorted_silhouette)],color='black')
plt.xlabel("Silhouette Score")
plt.ylabel("Nodes")
plt.show()

In [None]:
plt.figure()
plt.title("Figure S2 b: Target Gain in Node (Clustered)")
plt.imshow(feature_encoding[sorted_indices].T[feature_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto',vmin=-2,vmax=2)
plt.xlabel("Features")
plt.ylabel("Nodes")
plt.colorbar(label="Parent Target Mean - Node Target Mean")
plt.tight_layout()
plt.show()

In [None]:
# Sample silhouettes 


from sklearn.metrics import silhouette_samples, silhouette_score

sample_encoding = iris_forest.node_representation(iris_forest.nodes(depth=3,root=False),mode='sister')
node_labels = np.array([n.split_cluster for n in iris_forest.nodes(depth=3,root=False)])

# sample_encoding = spherical_forest.node_representation(spherical_forest.nodes(depth=3,root=False),mode='sister')
# node_labels = np.array([n.split_cluster for n in spherical_forest.nodes(depth=3,root=False)])

silhouette_scores = silhouette_samples(sample_encoding,node_labels,metric='cosine')

sorted_silhouette = np.zeros(silhouette_scores.shape)
sorted_colors = np.zeros(silhouette_scores.shape)
sorted_indices = []

current_index = 0
next_index = 0
for i in sorted(set(node_labels)):
    mask = node_labels == i
    value_sort = np.argsort(silhouette_scores[mask])
    selected_values = silhouette_scores[mask][value_sort]
    sorted_local_indices = np.arange(len(silhouette_scores))[mask][value_sort]
    sorted_indices.extend(sorted_local_indices)
    next_index = current_index + np.sum(mask)
    sorted_silhouette[current_index:next_index] = selected_values
    sorted_colors[current_index:next_index] = i
    current_index = next_index

In [None]:
import matplotlib.cm as cm

plt.figure()
plt.title("Silhouette Plots For Node-Sister Encodings \n Clustered By Gain")
for i,node in enumerate(sorted_silhouette):
    plt.plot([0,node],[i,i],color=cm.nipy_spectral(sorted_colors[i] / len(iris_forest.split_clusters)))
# plt.scatter(sorted_silhouette,np.arange(len(sorted_silhouette)),s=1)
plt.plot([0,0],[0,len(sorted_silhouette)],color='black')
plt.ylim(1180,0)
plt.xlabel("Silhouette Score")
plt.ylabel("Nodes")
plt.show()

In [None]:
plt.figure()
plt.title("Figure S2 b: Target Gain in Node (Clustered)")
plt.imshow(sample_encoding[sorted_indices].T[sample_agglomeration].T,cmap='bwr',interpolation='none',aspect='auto')
plt.xlabel("Features")
plt.ylabel("Nodes")
plt.colorbar(label="Parent Target Mean - Node Target Mean")
plt.tight_layout()
plt.show()

In [None]:
sample_encoding.shape

In [None]:
sorted_indices

In [None]:
list(silhouette_scores[sorted_indices])

In [None]:
pa = iris_forest.node_representation(iris_forest.nodes(),mode='partial_absolute')

In [None]:
np.sum(np.abs(pa),axis=0)

In [None]:
[r.index for r in iris_forest.roots()]

In [None]:
pa[899]

In [None]:
np.around(iris_forest.node_representation(iris_forest.trees[0].nodes(),mode='partial_absolute'),3)

In [None]:
iris_forest.trees[0].root.nodes()[-1].level