Condensing Cluster Trees
=============

The goal is to condense a cluster tree down to a simpler tree based on `min_cluster_size`. Essentially we wish to view a node of the tree to continue to exist until it is split into clusters of size at least `min_cluster_size`. When a split occurs that has fewer than `min_cluster_size` points in it we view this as the cluster "losing" points rather than splitting into a new cluster. We want to record when it lost the points, but we wish to retain the cluster identity.

To start we gegin with a new node class that supports multiple children rather than just "left" and "right". We also need to have an id, and a dist at which the node split off from the parent.

In [1]:
class CondensedTreeNode:    
    def __init__(self, id, dist, children, size, is_leaf):
        self.id = id
        self.dist = dist
        self.children = children
        self.child_size = size
        self.is_leaf = is_leaf
    
    def add_child(self, child):
        self.children.append(child)

    def __repr__(self):
        return '<Node object at %s>' % (
            hex(id(self))
            )
        
    def __str__(self):
        return "ID: %d, Lambda %d, Number of Children %d, " \
               "Number of children %d, Leaf node: %s" % (self.id, self.dist, len(self.children), 
                                                         self.child_size, self.is_leaf)
    

Next we'll need a utility function to extract out all the leaf nodes under a given cluster node. This is essential since we want to make a new "leaf-cluster" for each node that is smaller than `min_cluster_size` and we'll want to gather up allm th actual leaves/data-points and place them flat within that node. But hey, scipy is awsome and comes with a pre-order function that takes a function (and defaults to `lambda x: x.id`) so we can just pass that the identity and get all the leaves.

In [2]:
#This function consumes a tree and returns the set of leaves
def get_leaves(tree):
    """Consume a tree object and return a list of leaf nodes"""
    return tree.pre_order(lambda x: x)

Now we can focus on the condense operation. Since we are working with a tree this is most easily implemented as a recursive function walking down the tree. At each stage we check if the left or right branch are under the `min_cluster_size` and then either recursively call `condense_tree` or add all the leaves of the "left-cluster" accordingly. We'll label leaf nodes with "POINT" to denote that it is indexing a data point and not a cluster id.

In [3]:
def condense_tree(tree, min_cluster_size=10, next_id=0):
        
    #Verbose assert
    if tree.count == 0:
        print("Invalid input: Null tree")
        result = Node(-1, -1, [], -1, False)
        return result
    elif tree.count == 1:
        #Passed in a single node. only that node
        result = Node(-1, tree.dist, [], 1, True)
        return result
        
    result = CondensedTreeNode(next_id, tree.dist, [], tree.left.count + tree.right.count, 0)
        
    #If the left node is too small, add a leaf
    if tree.left.count <= min_cluster_size:
        leaves = get_leaves(tree.left)
        for leaf in leaves:
            result.add_child(CondensedTreeNode("POINT %i" % leaf.id, tree.left.dist, [], 1, True))
    elif tree.right.count <= min_cluster_size:
        child, next_id = condense_tree(tree.left, min_cluster_size, next_id)
        result.add_child(child)
    else:
        child, next_id = condense_tree(tree.left, min_cluster_size, next_id + 1)
        result.add_child(child)
            
    #If the right node is too small, add a leaf
    if tree.right.count <= min_cluster_size:
        leaves = get_leaves(tree.right)
        for leaf in leaves:
            result.add_child(CondensedTreeNode("POINT %i" % leaf.id, tree.right.dist, [], 1, True))
    elif tree.left.count <= min_cluster_size:
        child, next_id = condense_tree(tree.right, min_cluster_size, next_id)
        result.add_child(child)        
    else:
        child, next_id = condense_tree(tree.right, min_cluster_size, next_id + 1)
        result.add_child(child)
        
    return result, next_id
    

Now we can load up some test data (iris will do for now) and try this out.

In [4]:
import pandas as pd
import numpy as np
import scipy.spatial.distance as dist

iris = pd.read_csv("iris.csv")
distance_matrix = dist.squareform(dist.pdist(iris.ix[:,:4].as_matrix()))

In [5]:
def mutual_reachability_distance_matrix(distance_matrix, min_points):
    dim = distance_matrix.shape[0]
    core_distances = np.partition(distance_matrix, min_points, axis=0)[min_points]
    core_distance_matrix = core_distances.repeat(dim).reshape((dim,dim))
    result = np.dstack((core_distance_matrix, core_distance_matrix.T, distance_matrix)).max(axis=2)
    return result

In [6]:
mr_dist_matrix = mutual_reachability_distance_matrix(distance_matrix, 10)

In [7]:
import fastcluster
import scipy.cluster.hierarchy as hclust

In [8]:
ctree = fastcluster.single(mr_dist_matrix)

In [9]:
hctree = hclust.to_tree(ctree)

In [10]:
condensed, final_id = condense_tree(hctree, 10, 0)

Now we need to flatten the tree. The easiest way is to flatten a node into a list of parent children relations, and thn have a flatten tree function that recursively calls `flatten_node` down the whole tree.

In [11]:
def flatten_node(tree_node):
    return [(tree_node.id, x.id, 1.0/tree_node.dist, x.child_size) for x in tree_node.children
             if tree_node.id != x.id]

In [12]:
def flatten_tree_recursion(tree):
    if tree.is_leaf:
        return []
    result = flatten_node(tree)
    for subtree in tree.children:
        result.extend(flatten_tree_recursion(subtree))
    return result

def flatten_tree(tree):
    result = flatten_tree_recursion(tree)
    return pd.DataFrame(result, columns=("parent","child","lambda","child_size"))

In [13]:
flatten_tree(condensed)

Unnamed: 0,parent,child,lambda,child_size
0,0,1,0.065131,50
1,0,4,0.065131,100
2,1,POINT 41,0.334293,1
3,1,POINT 13,0.374176,1
4,1,POINT 22,0.374176,1
5,1,POINT 15,0.443019,1
6,1,POINT 8,0.532927,1
7,1,POINT 38,0.532927,1
8,1,POINT 42,0.532927,1
9,1,POINT 5,0.631190,1


Well that looks not entirely unreasonable. Let's have a quick check that we actually have all the POINT data (one for each original data point) with no duplication.

Success! Of course we actually want to make it a numpy array and if we have strings in there we're going to end up with quite a mess, so let's not do that. We can make a new version of condense tree that gives negative values to leaf nodes.

In [24]:
def condense_tree(tree, min_cluster_size=10, next_id=0):
        
    #Verbose assert
    if tree.count == 0:
        print("Invalid input: Null tree")
        result = Node(-1, -1, [], -1, False)
        return result
    elif tree.count == 1:
        #Passed in a single node. only that node
        result = Node(-1, tree.dist, [], 1, True)
        return result
        
    result = CondensedTreeNode(next_id, tree.dist, [], tree.left.count + tree.right.count, 0)
        
    #If the left node is too small, add a leaf
    if tree.left.count <= min_cluster_size:
        leaves = get_leaves(tree.left)
        for leaf in leaves:
            result.add_child(CondensedTreeNode(-leaf.id-1, tree.left.dist, [], 1, True))
    elif tree.right.count <= min_cluster_size:
        child, next_id = condense_tree(tree.left, min_cluster_size, next_id)
        result.add_child(child)
    else:
        child, next_id = condense_tree(tree.left, min_cluster_size, next_id + 1)
        result.add_child(child)
            
    #If the right node is too small, add a leaf
    if tree.right.count <= min_cluster_size:
        leaves = get_leaves(tree.right)
        for leaf in leaves:
            result.add_child(CondensedTreeNode(-leaf.id-1, tree.right.dist, [], 1, True))
    elif tree.left.count <= min_cluster_size:
        child, next_id = condense_tree(tree.right, min_cluster_size, next_id)
        result.add_child(child)        
    else:
        child, next_id = condense_tree(tree.right, min_cluster_size, next_id + 1)
        result.add_child(child)
        
    return result, next_id

In [25]:
condensed, final_id = condense_tree(hctree)
flatten_tree(condensed)

Unnamed: 0,parent,child,lambda,child_size
0,0,1,0.065131,50
1,0,4,0.065131,100
2,1,-42,0.334293,1
3,1,-14,0.374176,1
4,1,-23,0.374176,1
5,1,-16,0.443019,1
6,1,-9,0.532927,1
7,1,-39,0.532927,1
8,1,-43,0.532927,1
9,1,-6,0.631190,1


In [26]:
hctree.dist

15.353690797474098

In [29]:
ctree=flatten_tree(condensed)

In [38]:
ctree

Unnamed: 0,parent,child,lambda,child_size
0,0,1,0.065131,50
1,0,4,0.065131,100
2,1,-42,0.334293,1
3,1,-14,0.374176,1
4,1,-23,0.374176,1
5,1,-16,0.443019,1
6,1,-9,0.532927,1
7,1,-39,0.532927,1
8,1,-43,0.532927,1
9,1,-6,0.631190,1


Cluster stability <BR>
$S(C_i) = \Sigma_{x_j \in C_i}{(\lambda_{max}(x_j, C_i) - \lambda_{min}(C_i)) }$ <BR>
$= \Sigma_{x_j \in C_i}{(1/{\epsilon_{min}(x_j,C_i)} - 1/{\epsilon_{max}(C_i)})}$

* $C_i$ is the $ith$ cluster
* $x_j \in C_i$ represents the point contained in cluster $C_i$
* $\lambda = 1/\epsilon$ is the density threshold.  Increase density threshold decreases cluster sizes.
* $\lambda_{max}(x_j, C_i)$ is the density level beyond which x_j no longer belongs to cluster $C_i$
* $\lambda_{min}(C_i)$ is the density level at which cluster $C_i$ first appears

Let's look at the minimum density level at which $C_i$ exists. <BR>
Let's assume column names 'parent', 'child', 'lambda', 'child size' <BR>
Because each of these represents an edge in a hierarchical clustering the lambda can be thought of as the density at which the child cluster was removed from the parent cluster.

lambda_min returns the smallest lambda for which each cluster appears as a parent in the clustering.  It assumes that the 0th cluster is the root and defines it's minimum density to be zero.

In [108]:
def lambda_min(tree):
    lambdaMin = tree.groupby(['child'])[['lambda']].min()
    lambdaMin = pd.concat((lambdaMin, pd.DataFrame({'lambda':[0]})))
    lambdaMin.index.name = 'child'
    lambdaMin['child'] = lambdaMin.index
    return(lambdaMin)

In [123]:
lmin = lambda_min(ctree)
lmin.tail()

Unnamed: 0_level_0,lambda,child
child,Unnamed: 1_level_1,Unnamed: 2_level_1
5,0.495653,5
6,0.495653,6
7,0.539257,7
8,0.539257,8
0,0.0,0


Next we join the base hierarchical cluster tree with the lambda_min for each parent.  
This allows the computation of 
${(\lambda_{max}(x_j, C_i) - \lambda_{min}(C_i)) }$ for any points removed from cluster $C_i$

In [114]:
#ctree.join(lmin, on='child')
struct = ctree.merge(lmin, how='left', left_on='parent', right_on='child')
struct['stability'] = struct['child_size']*(struct['lambda_x'] - struct['lambda_y'])
struct.head()

Unnamed: 0,parent,child_x,lambda_x,child_size,lambda_y,child_y,stability
0,0,1,0.065131,50,0.0,0,3.256546
1,0,4,0.065131,100,0.0,0,6.513092
2,1,-42,0.334293,1,0.065131,1,0.269162
3,1,-14,0.374176,1,0.065131,1,0.309045
4,1,-23,0.374176,1,0.065131,1,0.309045


In [120]:
def stability(tree):
    lambdaMin = lambda_min(tree)
    enrichedTree = ctree.merge(lambdaMin, how='left', left_on='parent', right_on='child')
    enrichedTree['stability'] = enrichedTree['child_size']*(enrichedTree['lambda_x'] - enrichedTree['lambda_y'])
    clusterStability = enrichedTree.groupby(['parent'])[['stability']].sum()
    clusterStability.index.name = 'clusterId'
    return(clusterStability)

In [121]:
stability(ctree)

Unnamed: 0_level_0,stability
clusterId,Unnamed: 1_level_1
0,9.769638
1,38.212507
2,5.673967
3,6.401039
4,40.164406
5,3.883341
6,2.410333
7,1.885658
8,3.05283
