Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.cluster.hierarchy.optimal_leaf_ordering does not return the right ordering (though it does implicitly compute it) #11227

Open
xplat opened this issue Dec 16, 2019 · 8 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.cluster

Comments

@xplat
Copy link

xplat commented Dec 16, 2019

It computes enough data to find the correct answer with dynamic programming, but if you look at how the loop works in identify_swaps it is actually building the returned ordering greedily as it traverses the tree of clusters bottom-up. The assumption that would justify this is if the optimal ordering for a subtree must appear (possibly reversed) as part of the ordering of the entire tree, but if this were true the entire computation could use a greedy algorithm and finish in linear time. Unfortunately it is not true. A small test case:

import scipy.cluster.hierarchy as hi
import numpy as np

print(hi.linkage(np.array([2,10,10000,1,3,100],dtype=np.float),optimal_ordering=True))

resulting in

[[2. 1. 1. 2.]
 [4. 0. 2. 3.]
 [3. 5. 3. 4.]]

The inter-leaf distances for this ordering are 100, 1, 2 for a total of 103. The optimal leaf ordering is

[[1. 2. 1. 2.]
 [4. 0. 2. 3.]
 [3. 5. 3. 4.]]

with leaves 1 and 2 reversed. The inter-leaf distances here are 3, 1, 10 for a total of 14. As you can see by playing with this example, the result can be off by an arbitrarily large factor even when clustering only 4 points.

@rgommers
Copy link
Member

@jamestwebber any thoughts on this one?

@rgommers rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Dec 17, 2019
@jamestwebber
Copy link
Contributor

Oh no! @adrianveres might have more useful thoughts, as my PR was primarily an integration of their project polo. But I should have tested this more thoroughly.

On the other hand this issue is obsolete if we implement the suggestion in #11228 and switch algorithms.

@jamestwebber
Copy link
Contributor

Ah, now that I have the time to look into this: the given example does not satisfy the triangle inequality and I don't know if any solution can guarantee good performance in that case (besides enumeration). So I'm not sure if this is as big a defect as I initially feared, if we assume that we're clustering a distance matrix. But I have read the other algorithm and it seems relatively straight-forward--I'll try to implement something this week.

@xplat
Copy link
Author

xplat commented Jan 11, 2020

The original paper doesn't claim to require a metric, but if you want an example for that case, [101,103,203,100,102,202] works. By padding with more zeros in the middle (e.g. [100001,100003,200003,100000,100002,200002]) you can get arbitrarily close to a 4/3 ratio of produced/optimal. Probably can't do "better" than a fixed ratio here, at least without adding additional points.

@jamestwebber
Copy link
Contributor

Yeah I didn't see any mention of requiring a metric but I have a hunch that it's implicitly assumed (given that they have a tree as input). Maybe not though. The O(n^3) paper does assume a metric though.

Clearly I didn't get to this last week, but implementing the new algorithm is on my to-do list. I will refrain from giving an estimate this time as it will depend on other obligations.

@xplat
Copy link
Author

xplat commented Jan 12, 2020

The O(n^4) paper https://people.csail.mit.edu/tommi/papers/BarGifJaa-ismb01.pdf definitely doesn't assume a metric as it even states the problem as maximizing a sum of similarities rather than minimizing a sum of distances ... the O(n^3) does state the assumption, but I couldn't find a place where this assumption is actually used in their algorithm.

@deklanw
Copy link

deklanw commented Jul 21, 2021

Just ran into this issue, and came to post a reproduction. Surprised to find there is already an open issue from years ago.

Anyway, here's my reproduction if it's any help:

import numpy as np
import matplotlib.pyplot as plt
import itertools

from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list

D = np.array([[0.        , 0.46680166, 0.7747411 , 0.79942054],
       [0.46680166, 0.        , 0.28626558, 0.68441929],
       [0.7747411 , 0.28626558, 0.        , 0.74692541],
       [0.79942054, 0.68441929, 0.74692541, 0.        ]])

condensed_dist = squareform(D)
Z_scipy_optimal = linkage(condensed_dist, method='complete', optimal_ordering=True)
dendrogram(Z_scipy_optimal)
plt.show()

Output:
image

But, the optimal order is actually just 0,1,2,3. Clearly the 2 and 1 swapping places is a valid order.

Showing that:

def find_adjacent_sum(M, leaf_order):
    total_sum = 0

    for i, _ in itertools.islice(enumerate(leaf_order), len(leaf_order) - 1):
        n1, n2 = leaf_order[i], leaf_order[i + 1]
        s = M[n1, n2]
        
        print(f'Sim between {n1} and {n2} is {s}.')
        
        total_sum += s

    print(f"Total sum is {total_sum}")
    
    return total_sum

# this order is actually optimal:
optimal_order = [0, 1, 2, 3]

find_adjacent_sum(D, leaves_list(Z_scipy_optimal))
print()
find_adjacent_sum(D, optimal_order)

Output:

Sim between 0 and 2 is 0.7747411.
Sim between 2 and 1 is 0.28626558.
Sim between 1 and 3 is 0.68441929.
Total sum is 1.7454259699999999

Sim between 0 and 1 is 0.46680166.
Sim between 1 and 2 is 0.28626558.
Sim between 2 and 3 is 0.74692541.
Total sum is 1.49999265

As @xplat says, the triangle inequality is not involved at all.

I've implemented (variants of) optimal ordering algorithms from these three papers (I used one to find the example above):

Rainer E. Burkard, Vladimir G. Deĭneko, and Gerhard J. Woeginger, “The Travelling Salesman and the PQ-Tree,” Mathematics of Operations Research 23, no. 3 (August 1, 1998): 613–23, https://doi.org/10.1287/moor.23.3.613.

Ziv Bar-Joseph, David K. Gifford, and Tommi S. Jaakkola, “Fast Optimal Leaf Ordering for Hierarchical Clustering,” Bioinformatics 17, no. suppl_1 (June 1, 2001): S22–29, https://doi.org/10.1093/bioinformatics/17.suppl_1.S22.

Ziv Bar-Joseph et al., “K-Ary Clustering with Optimal Leaf Ordering for Gene Expression Data,” Bioinformatics 19, no. 9 (June 12, 2003): 1070–78, https://doi.org/10.1093/bioinformatics/btg030.

The first and third are O(n^3) time. The latter two papers contain several variants some of which are claimed to offer speed improvements (dubiously at least according to my initial testing). Unfortunately, I haven't optimized my implementations, and implementing them all in Cython and benchmarking etc to get one ready for scipy is a daunting task.

Anyway, the algorithm currently in scipy is by no means optimal and probably shouldn't claim to be such until someone gets around to fixing it (or providing a better, functioning one).

@jamestwebber
Copy link
Contributor

Believe it or not, the email about this issue is one of the few items that stays in my near-zero inbox, because I always intend to get to it someday...but you have exactly summarized the main obstacle:

Unfortunately, I haven't optimized my implementations, and implementing them all in Cython and benchmarking etc to get one ready for scipy is a daunting task.

If you can share your implementations that would be helpful, maybe someone else will find the motivation to optimize the code and get it into a PR (clearly I cannot commit to that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.cluster
Projects
None yet
Development

No branches or pull requests

5 participants