New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227
Comments
@jamestwebber any thoughts on this one? |
Oh no! @adrianveres might have more useful thoughts, as my PR was primarily an integration of their project polo. But I should have tested this more thoroughly. On the other hand this issue is obsolete if we implement the suggestion in #11228 and switch algorithms. |
Ah, now that I have the time to look into this: the given example does not satisfy the triangle inequality and I don't know if any solution can guarantee good performance in that case (besides enumeration). So I'm not sure if this is as big a defect as I initially feared, if we assume that we're clustering a distance matrix. But I have read the other algorithm and it seems relatively straight-forward--I'll try to implement something this week. |
The original paper doesn't claim to require a metric, but if you want an example for that case, |
Yeah I didn't see any mention of requiring a metric but I have a hunch that it's implicitly assumed (given that they have a tree as input). Maybe not though. The O(n^3) paper does assume a metric though. Clearly I didn't get to this last week, but implementing the new algorithm is on my to-do list. I will refrain from giving an estimate this time as it will depend on other obligations. |
The O(n^4) paper https://people.csail.mit.edu/tommi/papers/BarGifJaa-ismb01.pdf definitely doesn't assume a metric as it even states the problem as maximizing a sum of similarities rather than minimizing a sum of distances ... the O(n^3) does state the assumption, but I couldn't find a place where this assumption is actually used in their algorithm. |
Just ran into this issue, and came to post a reproduction. Surprised to find there is already an open issue from years ago. Anyway, here's my reproduction if it's any help: import numpy as np
import matplotlib.pyplot as plt
import itertools
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list
D = np.array([[0. , 0.46680166, 0.7747411 , 0.79942054],
[0.46680166, 0. , 0.28626558, 0.68441929],
[0.7747411 , 0.28626558, 0. , 0.74692541],
[0.79942054, 0.68441929, 0.74692541, 0. ]])
condensed_dist = squareform(D)
Z_scipy_optimal = linkage(condensed_dist, method='complete', optimal_ordering=True)
dendrogram(Z_scipy_optimal)
plt.show() But, the optimal order is actually just 0,1,2,3. Clearly the 2 and 1 swapping places is a valid order. Showing that: def find_adjacent_sum(M, leaf_order):
total_sum = 0
for i, _ in itertools.islice(enumerate(leaf_order), len(leaf_order) - 1):
n1, n2 = leaf_order[i], leaf_order[i + 1]
s = M[n1, n2]
print(f'Sim between {n1} and {n2} is {s}.')
total_sum += s
print(f"Total sum is {total_sum}")
return total_sum
# this order is actually optimal:
optimal_order = [0, 1, 2, 3]
find_adjacent_sum(D, leaves_list(Z_scipy_optimal))
print()
find_adjacent_sum(D, optimal_order) Output:
As @xplat says, the triangle inequality is not involved at all. I've implemented (variants of) optimal ordering algorithms from these three papers (I used one to find the example above):
The first and third are O(n^3) time. The latter two papers contain several variants some of which are claimed to offer speed improvements (dubiously at least according to my initial testing). Unfortunately, I haven't optimized my implementations, and implementing them all in Cython and benchmarking etc to get one ready for scipy is a daunting task. Anyway, the algorithm currently in scipy is by no means optimal and probably shouldn't claim to be such until someone gets around to fixing it (or providing a better, functioning one). |
Believe it or not, the email about this issue is one of the few items that stays in my near-zero inbox, because I always intend to get to it someday...but you have exactly summarized the main obstacle:
If you can share your implementations that would be helpful, maybe someone else will find the motivation to optimize the code and get it into a PR (clearly I cannot commit to that). |
It computes enough data to find the correct answer with dynamic programming, but if you look at how the loop works in identify_swaps it is actually building the returned ordering greedily as it traverses the tree of clusters bottom-up. The assumption that would justify this is if the optimal ordering for a subtree must appear (possibly reversed) as part of the ordering of the entire tree, but if this were true the entire computation could use a greedy algorithm and finish in linear time. Unfortunately it is not true. A small test case:
resulting in
The inter-leaf distances for this ordering are 100, 1, 2 for a total of 103. The optimal leaf ordering is
with leaves 1 and 2 reversed. The inter-leaf distances here are 3, 1, 10 for a total of 14. As you can see by playing with this example, the result can be off by an arbitrarily large factor even when clustering only 4 points.
The text was updated successfully, but these errors were encountered: