scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

xplat · 2019-12-16T04:44:03Z

It computes enough data to find the correct answer with dynamic programming, but if you look at how the loop works in identify_swaps it is actually building the returned ordering greedily as it traverses the tree of clusters bottom-up. The assumption that would justify this is if the optimal ordering for a subtree must appear (possibly reversed) as part of the ordering of the entire tree, but if this were true the entire computation could use a greedy algorithm and finish in linear time. Unfortunately it is not true. A small test case:

import scipy.cluster.hierarchy as hi
import numpy as np

print(hi.linkage(np.array([2,10,10000,1,3,100],dtype=np.float),optimal_ordering=True))

resulting in

[[2. 1. 1. 2.]
 [4. 0. 2. 3.]
 [3. 5. 3. 4.]]

The inter-leaf distances for this ordering are 100, 1, 2 for a total of 103. The optimal leaf ordering is

[[1. 2. 1. 2.]
 [4. 0. 2. 3.]
 [3. 5. 3. 4.]]

with leaves 1 and 2 reversed. The inter-leaf distances here are 3, 1, 10 for a total of 14. As you can see by playing with this example, the result can be off by an arbitrarily large factor even when clustering only 4 points.

The text was updated successfully, but these errors were encountered:

rgommers · 2019-12-17T06:44:39Z

@jamestwebber any thoughts on this one?

jamestwebber · 2019-12-17T17:47:02Z

Oh no! @adrianveres might have more useful thoughts, as my PR was primarily an integration of their project polo. But I should have tested this more thoroughly.

On the other hand this issue is obsolete if we implement the suggestion in #11228 and switch algorithms.

jamestwebber · 2019-12-27T22:27:04Z

Ah, now that I have the time to look into this: the given example does not satisfy the triangle inequality and I don't know if any solution can guarantee good performance in that case (besides enumeration). So I'm not sure if this is as big a defect as I initially feared, if we assume that we're clustering a distance matrix. But I have read the other algorithm and it seems relatively straight-forward--I'll try to implement something this week.

xplat · 2020-01-11T20:49:21Z

The original paper doesn't claim to require a metric, but if you want an example for that case, [101,103,203,100,102,202] works. By padding with more zeros in the middle (e.g. [100001,100003,200003,100000,100002,200002]) you can get arbitrarily close to a 4/3 ratio of produced/optimal. Probably can't do "better" than a fixed ratio here, at least without adding additional points.

jamestwebber · 2020-01-11T21:25:53Z

Yeah I didn't see any mention of requiring a metric but I have a hunch that it's implicitly assumed (given that they have a tree as input). Maybe not though. The O(n^3) paper does assume a metric though.

Clearly I didn't get to this last week, but implementing the new algorithm is on my to-do list. I will refrain from giving an estimate this time as it will depend on other obligations.

xplat · 2020-01-12T14:51:29Z

The O(n^4) paper https://people.csail.mit.edu/tommi/papers/BarGifJaa-ismb01.pdf definitely doesn't assume a metric as it even states the problem as maximizing a sum of similarities rather than minimizing a sum of distances ... the O(n^3) does state the assumption, but I couldn't find a place where this assumption is actually used in their algorithm.

deklanw · 2021-07-21T21:36:59Z

Just ran into this issue, and came to post a reproduction. Surprised to find there is already an open issue from years ago.

Anyway, here's my reproduction if it's any help:

import numpy as np
import matplotlib.pyplot as plt
import itertools

from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list

D = np.array([[0.        , 0.46680166, 0.7747411 , 0.79942054],
       [0.46680166, 0.        , 0.28626558, 0.68441929],
       [0.7747411 , 0.28626558, 0.        , 0.74692541],
       [0.79942054, 0.68441929, 0.74692541, 0.        ]])

condensed_dist = squareform(D)
Z_scipy_optimal = linkage(condensed_dist, method='complete', optimal_ordering=True)
dendrogram(Z_scipy_optimal)
plt.show()

Output:

But, the optimal order is actually just 0,1,2,3. Clearly the 2 and 1 swapping places is a valid order.

Showing that:

def find_adjacent_sum(M, leaf_order):
    total_sum = 0

    for i, _ in itertools.islice(enumerate(leaf_order), len(leaf_order) - 1):
        n1, n2 = leaf_order[i], leaf_order[i + 1]
        s = M[n1, n2]
        
        print(f'Sim between {n1} and {n2} is {s}.')
        
        total_sum += s

    print(f"Total sum is {total_sum}")
    
    return total_sum

# this order is actually optimal:
optimal_order = [0, 1, 2, 3]

find_adjacent_sum(D, leaves_list(Z_scipy_optimal))
print()
find_adjacent_sum(D, optimal_order)

Output:

Sim between 0 and 2 is 0.7747411.
Sim between 2 and 1 is 0.28626558.
Sim between 1 and 3 is 0.68441929.
Total sum is 1.7454259699999999

Sim between 0 and 1 is 0.46680166.
Sim between 1 and 2 is 0.28626558.
Sim between 2 and 3 is 0.74692541.
Total sum is 1.49999265

As @xplat says, the triangle inequality is not involved at all.

I've implemented (variants of) optimal ordering algorithms from these three papers (I used one to find the example above):

Rainer E. Burkard, Vladimir G. Deĭneko, and Gerhard J. Woeginger, “The Travelling Salesman and the PQ-Tree,” Mathematics of Operations Research 23, no. 3 (August 1, 1998): 613–23, https://doi.org/10.1287/moor.23.3.613.

Ziv Bar-Joseph, David K. Gifford, and Tommi S. Jaakkola, “Fast Optimal Leaf Ordering for Hierarchical Clustering,” Bioinformatics 17, no. suppl_1 (June 1, 2001): S22–29, https://doi.org/10.1093/bioinformatics/17.suppl_1.S22.

Ziv Bar-Joseph et al., “K-Ary Clustering with Optimal Leaf Ordering for Gene Expression Data,” Bioinformatics 19, no. 9 (June 12, 2003): 1070–78, https://doi.org/10.1093/bioinformatics/btg030.

The first and third are O(n^3) time. The latter two papers contain several variants some of which are claimed to offer speed improvements (dubiously at least according to my initial testing). Unfortunately, I haven't optimized my implementations, and implementing them all in Cython and benchmarking etc to get one ready for scipy is a daunting task.

Anyway, the algorithm currently in scipy is by no means optimal and probably shouldn't claim to be such until someone gets around to fixing it (or providing a better, functioning one).

jamestwebber · 2021-07-22T01:26:00Z

Believe it or not, the email about this issue is one of the few items that stays in my near-zero inbox, because I always intend to get to it someday...but you have exactly summarized the main obstacle:

Unfortunately, I haven't optimized my implementations, and implementing them all in Cython and benchmarking etc to get one ready for scipy is a daunting task.

If you can share your implementations that would be helpful, maybe someone else will find the motivation to optimize the code and get it into a PR (clearly I cannot commit to that).

tylerjereddy added the scipy.cluster label Dec 16, 2019

rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Dec 17, 2019

jamestwebber mentioned this issue Dec 17, 2019

scipy.cluster.hierarchy.optimal_leaf_ordering should use newer algorithm from Bar-Joseph et al. 2001 #11228

Open

jamestwebber mentioned this issue Jul 10, 2020

Optimal leaf ordering in scipy.cluster.hierarchy.dendrogram #5238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

xplat commented Dec 16, 2019

rgommers commented Dec 17, 2019

jamestwebber commented Dec 17, 2019

jamestwebber commented Dec 27, 2019

xplat commented Jan 11, 2020 •

edited

jamestwebber commented Jan 11, 2020

xplat commented Jan 12, 2020

deklanw commented Jul 21, 2021 •

edited

jamestwebber commented Jul 22, 2021

scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

Comments

xplat commented Dec 16, 2019

rgommers commented Dec 17, 2019

jamestwebber commented Dec 17, 2019

jamestwebber commented Dec 27, 2019

xplat commented Jan 11, 2020 • edited

jamestwebber commented Jan 11, 2020

xplat commented Jan 12, 2020

deklanw commented Jul 21, 2021 • edited

jamestwebber commented Jul 22, 2021

xplat commented Jan 11, 2020 •

edited

deklanw commented Jul 21, 2021 •

edited