Mutual Reachability "Graph"
===============

Build the mutual reachability graph from a distance matrix. We are operating with the following definitions per 'Campello, Moulavi, Sander':

**Core Distance**: The core distance of an object $x_p \in X$ with respect to $m_{\textrm{pts}}$ is the distance from $x_p$ to its $m_{\textrm{pts}}$-nearest neighbour.

**Mutual Reachability Distance**: The mutual reachability distance between two objects $x_p$ and $x_q$ in $X$ with respect to $m_{\textrm{pts}}$ is defined as $d_{\textrm{mreach}} = \max\{d_{\textrm{core}}(x_p), d_{\textrm{core}}(x_q), d(x_p, x_q)\}$.

In [1]:
import numpy as np

Let's assume we have an all pairs distance matrix to work with and generate a a new "mutual reachability distance matrix" from it. We'll start with the most naive implementation and work from there.

In [2]:
def naive_mutual_reachability_distance_matrix(distance_matrix, min_points):
    result = np.zeros(distance_matrix.shape)
    core_distances = np.sort(distance_matrix, axis=0)[min_points]
    for i in range(distance_matrix.shape[0]):
        for j in range(distance_matrix.shape[1]):
            result[i,j] = max(core_distances[i], core_distances[j], distance_matrix[i,j])
            
    return result

We need some test data to try this out on and do profiling, so let's load in iris and get the distance matrix up.

In [3]:
import pandas as pd
import scipy.spatial.distance as dist

iris = pd.read_csv("iris.csv")
distance_matrix = dist.squareform(dist.pdist(iris.ix[:,:4].as_matrix()))

In [4]:
%timeit naive_mutual_reachability_distance_matrix(distance_matrix, 5)

10 loops, best of 3: 84.6 ms per loop


In [5]:
%prun naive_mutual_reachability_distance_matrix(distance_matrix, 5)

 

Clearly we are spending our time in that max operation, the unnecessary range , and sorting the array to get the core distances. We can probably fix some of this. Let's go nuts with numpy! We can create a matrix that is the repeated vector of core distances. If we now depth stack a tensor of that matrix, the transpose of that matrix, and the distance matrix then the max across axis 2 is going to be the max of `core_distance[i]`, `core_distance[j]` and `distance_matrix[i,j]` at the `i,j` position ... or exactly what we want, but in a completely vectorized fashion. This is kind of awesome, and not entirely obvious, but hey, here's some documentation so have fun.

In [6]:
def mutual_reachability_distance_matrix(distance_matrix, min_points):
    core_distances = np.sort(distance_matrix, axis=0)[min_points]
    core_distance_matrix = core_distances.repeat(150).reshape((150,150))
    result = np.dstack((core_distance_matrix, core_distance_matrix.T, distance_matrix)).max(axis=2)
    return result

In [7]:
%timeit mutual_reachability_distance_matrix(distance_matrix, 5)

100 loops, best of 3: 8.15 ms per loop


In [8]:
np.all(mutual_reachability_distance_matrix(distance_matrix, 5) == 
       naive_mutual_reachability_distance_matrix(distance_matrix, 5))

True

In [31]:
%prun mutual_reachability_distance_matrix(distance_matrix, 5)

 

Well that worked remarkably well. The expense is still in the sort but otherwise we have drastically improved everything else. Of course the reality is that the sort is merely convenient, not necessary, so let's see if we can do a linear scan in find the `min_points`th entry and get $O(n)$ instead of $O(n\log n)$. Oh, wait, numpy already thought of that (yes, really) and has the partition function which sorts the k smallest elements of an array and leaves the remainder unsorted (and does so with a linear scan) so runs in linear time. Let's just use that.

In [9]:
def mutual_reachability_distance_matrix(distance_matrix, min_points):
    core_distances = np.partition(distance_matrix, min_points, axis=0)[min_points]
    core_distance_matrix = core_distances.repeat(150).reshape((150,150))
    result = np.dstack((core_distance_matrix, core_distance_matrix.T, distance_matrix)).max(axis=2)
    return result

In [11]:
%timeit mutual_reachability_distance_matrix(distance_matrix, 5)

The slowest run took 10.38 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 6.87 ms per loop


In [12]:
%prun mutual_reachability_distance_matrix(distance_matrix, 5)

 

And this is about as good as we can probably conceivably do given that we are solidly in efficient vectorized numpy for pretty much all our operations.

Tensor Version
----------------

In theory we can abstract this a little and work over all possible values of `min_points` at once. The goal is now to create a distance tensor where each slice of the tensor is a distance matrix for a value of `min_points`. Ideally we would like to be able to do this in one go with numpy rather than iterting. In practice this is easy enough by just pushing things up a dimension.

In [2]:
def mutual_reachability_tensor(distance_matrix):
    dim = distance_matrix.shape[0]
    core_distances = np.sort(distance_matrix, axis=0)
    raw_distance_tensor = distance_matrix.repeat(dim).reshape((dim,dim,dim)).T
    core_distance_tensor = core_distances.repeat(dim).reshape((dim,dim,dim))
    result = np.concatenate((core_distance_tensor[...,np.newaxis], 
                      core_distance_tensor.transpose(0,2,1)[...,np.newaxis], 
                      raw_distance_tensor[...,np.newaxis]), axis=3).max(axis=3)
    return result