Core Single Linkage via Minimal Spanning Tree
=========================

Working from Mullner's paper we are going to implement our own MST-LINKAGE-CORE so we can handle HDBSCANs mutual reachability graphs (which have non-zero self distances). The goal here is to start with a simple implemetation and then progressively profile and refine and use whatever we can (cython, numba, etc.) to try and make it run as fast as possible because this is the core of the algorithm. 

In [1]:
import numpy as np

To start let's do a naive and direct translation of Mullner's pseudocode. Once we're donethat we'll get some simple test data and work from there.

In [33]:
def mst_linkage_core(node_labels, distance_matrix):
    result = []
    current_node = np.random.choice(node_labels)
    level_distances = np.infty * np.ones((node_labels.shape[0], node_labels.shape[0]))
    current_labels = node_labels
    for i in xrange(1,node_labels.shape[0] - 1):
        current_labels = current_labels[current_labels != current_node]
        for other_node in current_labels:
            level_distances[i][other_node] = min(level_distances[i-1][other_node], 
                                                distance_matrix[other_node,current_node])
            
        new_node_index = np.argmin(level_distances[i][current_labels])
        new_node = current_labels[new_node_index]
        result.append([current_node, new_node, level_distances[i][new_node]])
        current_node = new_node
        
    return result

Okay, let's get some data (say, iris) and do some quick testing.

In [9]:
import pandas as pd
import scipy.spatial.distance as dist

iris = pd.read_csv("iris.csv")
distance_matrix = dist.squareform(dist.pdist(iris.ix[:,:4].as_matrix()))
distance_matrix

array([[ 0.        ,  0.53851648,  0.50990195, ...,  4.45982062,
         4.65080638,  4.14004831],
       [ 0.53851648,  0.        ,  0.3       , ...,  4.49888875,
         4.71805044,  4.15331193],
       [ 0.50990195,  0.3       ,  0.        , ...,  4.66154481,
         4.84871117,  4.29883705],
       ..., 
       [ 4.45982062,  4.49888875,  4.66154481, ...,  0.        ,
         0.6164414 ,  0.64031242],
       [ 4.65080638,  4.71805044,  4.84871117, ...,  0.6164414 ,
         0.        ,  0.76811457],
       [ 4.14004831,  4.15331193,  4.29883705, ...,  0.64031242,
         0.76811457,  0.        ]])

In [10]:
labels = np.arange(150)

In [34]:
mst_linkage_core(labels, distance_matrix)

[[88, 95, 0.1732050807568884],
 [95, 96, 0.14142135623730964],
 [96, 99, 0.14142135623730995],
 [99, 94, 0.17320508075688815],
 [94, 82, 0.26457513110645864],
 [82, 92, 0.14142135623730964],
 [92, 67, 0.24494897427831766],
 [67, 90, 0.26457513110645914],
 [90, 69, 0.26457513110645919],
 [69, 80, 0.17320508075688762],
 [80, 81, 0.14142135623730931],
 [81, 89, 0.24494897427831766],
 [89, 53, 0.20000000000000018],
 [53, 61, 0.3000000000000001],
 [61, 55, 0.31622776601683777],
 [55, 66, 0.30000000000000027],
 [66, 84, 0.19999999999999929],
 [84, 78, 0.33166247903553975],
 [78, 91, 0.19999999999999973],
 [91, 63, 0.14142135623730995],
 [63, 73, 0.22360679774997896],
 [73, 71, 0.34641016151377524],
 [71, 97, 0.33166247903554003],
 [97, 74, 0.20000000000000018],
 [74, 75, 0.26457513110645869],
 [75, 65, 0.14142135623730995],
 [65, 58, 0.24494897427831722],
 [58, 54, 0.24494897427831766],
 [54, 51, 0.3162277660168375],
 [51, 56, 0.2645751311064593],
 [56, 86, 0.31622776601683777],
 [86, 52, 0.

In [38]:
hierarchy = np.array(mst_linkage_core(labels, distance_matrix))
sort_order = np.argsort(hierarchy.T[2])
data_for_steve = hierarchy[sort_order,:]

In [39]:
data_for_steve

array([[  1.01000000e+02,   1.42000000e+02,   0.00000000e+00],
       [  3.40000000e+01,   3.70000000e+01,   0.00000000e+00],
       [  9.00000000e+00,   3.40000000e+01,   0.00000000e+00],
       [  7.00000000e+00,   3.90000000e+01,   1.00000000e-01],
       [  0.00000000e+00,   1.70000000e+01,   1.00000000e-01],
       [  1.28000000e+02,   1.32000000e+02,   1.00000000e-01],
       [  4.80000000e+01,   1.00000000e+01,   1.00000000e-01],
       [  1.70000000e+01,   4.00000000e+01,   1.41421356e-01],
       [  1.90000000e+01,   2.10000000e+01,   1.41421356e-01],
       [  4.00000000e+01,   4.00000000e+00,   1.41421356e-01],
       [  1.16000000e+02,   1.37000000e+02,   1.41421356e-01],
       [  8.00000000e+01,   8.10000000e+01,   1.41421356e-01],
       [  9.30000000e+01,   5.70000000e+01,   1.41421356e-01],
       [  2.90000000e+01,   3.00000000e+01,   1.41421356e-01],
       [  3.80000000e+01,   8.00000000e+00,   1.41421356e-01],
       [  2.10000000e+01,   4.60000000e+01,   1.4142135