Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: cluster: Add custom leaves ordering for dendrogram #12049

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

erheron
Copy link
Contributor

@erheron erheron commented May 7, 2020

Customizing leaves ordering inscipy.cluster.dendrogram via cut-tree (linkage matrix) reordering

Reference issue

Ref. #11060

What does this implement/fix?

The objective is to provide the possibility to specify custom leaf ordering in dendrogram. As for now I've chosen a way in which leaves_order parameter must be a permutation of indices of original array.

Additional information

This PR will provide described functionality for any input in which leaves of the dendrogram are the original observations, and in my opinion considering truncate_mode option doesn't really make sense. Time and memory complexity are linear, and that's optimal

As for now the following things must be improved:
 - time complexity (now it's O(n^2))
 - memory usage

Ref. scipy#11060
@miladsade96 miladsade96 added enhancement A new feature or improvement scipy.cluster labels May 7, 2020
@erheron
Copy link
Contributor Author

erheron commented May 7, 2020

@vinisalazar @rgommers I'd be grateful if You express Your thoughts about it. Ahead of time/memory complexity (which I'm going to improve), but rather is it going in the right direction or not in general.

@erheron
Copy link
Contributor Author

erheron commented May 7, 2020

I have a feeling I should leave one more comment about the algorithm i use now: it basically rebuilds the tree encoded by linkage matrix so that order of leaves of the new tree corresponds to leaves_order parameter.

leaves_order : iterable, optional
Plots the leaves in the order specified by a vector of
original observation indices. If the vector contains duplicates
or results in a crossing, an exception will be thrown. Passing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a user know that a crossing will not happen, and if it happens what can they do about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chosen an approach in which user must guarantee that the ordering is correct. As I've mentioned above, this function basically reorders the cut tree. If I understand everything correctly, in case provided ordering isn't correct user will get an exception from here:

is_valid_linkage(Z, throw=True, name='Z')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm guessing that it's more common than not to get a crossing when a user puts in some random ordering, the envisioned workflow is:

  • Use without leaves_order
  • Review the plot, decide which items to swap
  • Map node labels to indices to be used as leave ordering
  • Add the right leaves_order

Without trying it, I'm still hesitant on how practical this is going to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. by 'envisioned workflow' You mean user's workflow, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this will actually be quite useful. In my use case, I'm looking at how behavior changes over time and combining time intervals based on the distance between each interval. Also, I want to keep the intervals in their inherent, temporal ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by 'envisioned workflow' You mean user's workflow, right?

yes indeed

thanks @ihowell, that is helpful feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgommers @ihowell in the recent commit I've added full parameter correctness checking (e.g. if it's indeed a permutation, and if it doesn't result in a crossing). I'm now wondering whether it's possible (and reasonable) with my approach to support truncate_mode option.
I'm tempted to ask, whether it would be possible to say, that this new option is incompatible with truncate_mode parameter.
The reason is: suppose one uses truncate_mode and gets some new leaves, which are in reality inner nodes. As far as I know, now the only way to label these nodes (which are now leaves) is to provide leaf_label_func. I don't see how in this case user should say that he wants to reorder these new leaves. Would be grateful if You express Your opinions.

@erheron
Copy link
Contributor Author

erheron commented May 12, 2020

One more thingy I'm thinking about: because all this function does is cut-tree reordering, maybe we should move it to hierarchy.linkage as optional parameter?

@ihowell
Copy link

ihowell commented May 13, 2020

Also, just wanted to let you know that I am working off of this branch (as this feature is critical to my research), so I'll let you know if anything feels odd. Thank you so much @erheron!!!

@ihowell
Copy link

ihowell commented May 15, 2020

Alright, I have a bug report and a constructed minimal example. When specifying the linkage order, if distances are float values, they are floored, losing the actual metric values. For example, consider this code:

import numpy as np
import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram

linkages = np.array([[0, 1, 0.3, 2], [3, 2, 0.7, 3]])

fig = plt.figure(figsize=(25, 10))
de = dendrogram(linkages)
plt.show()

It generates the image:

ex_1

But, suppose we want to reverse the leaf order:

import numpy as np
import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram

linkages = np.array([[0, 1, 0.3, 2], [3, 2, 0.7, 3]])

fig = plt.figure(figsize=(25, 10))
de = dendrogram(linkages, leaves_order=[2, 1, 0])
plt.show()

I get the following:
ex_2

ch = np.zeros((n - 1, 2))
for i in range(n - 1):
ch[i] = (Z[i][0], Z[i][1]) # ch[i] = children of node n+i
Z = Z.astype('int32')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the culprit here. Casting to int gets rid of float valued metrics.

Copy link
Contributor Author

@erheron erheron May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ihowell! I had a feeling that I've forgotten about something, and here it is. I was kind of aware of this, so I'm sorry for not letting know about that :|
I'm going to improve the complexity from O(n^2) to O(n), and this may lead to other algorithm. I'll keep that bug in mind. Hoping to produce new code in a week

Copy link
Contributor Author

@erheron erheron May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihowell improved to linear and fixed this bug with casting in 00954ac ! What do You think about moving this to linkage?
The rational behind this would be as follows:
this function reorders cut-tree e.g. the tree encoded by linkage matrix. Now suppose someone draws a dendrogram, but uses parameter like truncate_mode. We would have a plot of dendrogram, in which "leaves" are really clusters (e.g. they are not the original leaves of a tree, but inner nodes). As for now, my function doesn't do something like this. I think it's doable, but:

  • Maybe the code in 00954ac is useful "as is" (e.g. as just something which reorders the tree)
  • Code which takes into account options like truncate_mode might be little more complicated and require more time, however I think I can tackle this :)
    I'd be grateful if You tell your thoughs about it.. @rgommers @ihowell

One more thingy I'm thinking about: because all this function does is cut-tree reordering, maybe we should move it to hierarchy.linkage as optional parameter?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think it belongs in linkage, but I haven't dived deep into the exact purpose of linkage as opposed to dendrogram and where the line is drawn. From my background though, linkage seems like the data model and dendrogram is the view. Keeping these separate would mean that we keep the re-ordering code in dendrogram, as the data model doesn't really care. But again, I could be way off base here. I'll test out the code and get back to you!

Copy link
Contributor Author

@erheron erheron May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihowell thanks for testing!!! I didn't add the support for checking parameter consistensy yet. Hope to produce it soon, but for now it would rather work as "if You see something weird on the screen then You must've messed up with leaves_order "

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erheron Floating point distances are showing up again, thanks! But there is one UI issue. You require that the input to dendrogram implement the shape function, such as a numpy array. This is breaking from the current dendrogram implementation which at the very least allows for a list of linkages, no numpy needed.

Copy link
Contributor Author

@erheron erheron May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihowell fixed that in recent commit, thanks! Full parameter consistency checking, as well as considering truncate_mode (and maybe other) parameters which affect dendrogram is on its way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihowell, as far as I'm concerned, this is the final code (I wrote more in another thread above). What do You think?

As for now this algorithm doesn't check if the ordering provided is
good or not, just assumes the former. The complexity has been improved
from quadratic to linear (time and memory). Integrity check to be added
later.
@larsoner
Copy link
Member

FYI AppVeyor failure is just #12160

@erheron erheron force-pushed the dendrogram-leaf-ordering-and-color branch from adb621a to 7d63405 Compare May 23, 2020 08:40
Additionally, provide two unit tests for this
@erheron erheron changed the title [WIP] custom leaves ordering for dendrogram Add custom leaves ordering for dendrogram May 31, 2020
@erheron erheron requested a review from rgommers June 2, 2020 09:46
@erheron
Copy link
Contributor Author

erheron commented Jul 9, 2020

@ihowell @rgommers, I haven't heard from You for a while. Maybe someone of You guys want to take a look at this? Would be grateful

p - array of parents in tree T
height[i] is a value of Z[i][3] e.g. cost of creating node (n+i)
num_leaves[i] = number of leaves of a subtree rooted in a node i in T
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add in Parameters and Returns such as in other functions. While this isn't a public function, these should still be documented as it is fairly complex function.

@@ -930,6 +930,26 @@ def test_dendrogram_colors(self):
# reset color palette (global list)
set_link_color_palette(None)

@pytest.mark.parametrize("order", [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a parameterization with a duplicated value? You mention it in the description of the function, so we should probably explicitly test it here. Also, there should be a test for the happy path, where everything works as it should, and for the edge case of having an empty tree and empty list of reordering (or similar appropriate edge case).

Copy link
Contributor Author

@erheron erheron Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand You correctly, than line 936 contains exactly duplicated value for permutation. Maybe I should change the comment in that line. The rest of the tests You mention is on the way!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihowell done!

@erheron erheron force-pushed the dendrogram-leaf-ordering-and-color branch from f935d0a to 9f38380 Compare July 9, 2020 20:38
@erheron
Copy link
Contributor Author

erheron commented Jul 12, 2020

I've looked at build issues, and they look unrelated to PR (rather problems with pipeline).

@erheron
Copy link
Contributor Author

erheron commented Nov 8, 2020

@ihowell @rgommers I see that this PR had gone stale. Maybe there is something we can do about that, given that a reasonable amount of work had been put into this?

@aleneum
Copy link

aleneum commented Jan 11, 2021

@erheron, thanks a lot for this. Would be nice to see this merged.

@sjwright90
Copy link

@erheron Hi! for what it's worth I implemented your re-ordering function and it worked without issue, thanks it was very helpful

@erheron
Copy link
Contributor Author

erheron commented Jul 6, 2023

Thanks guys @aleneum @sjwright90, it's been years since I worked on this, and really heart-warming to know it's been useful to you 💔

@rgommers rgommers added this to the 1.12.0 milestone Jul 11, 2023
@rgommers
Copy link
Member

@ihowell @rgommers I see that this PR had gone stale. Maybe there is something we can do about that, given that a reasonable amount of work had been put into this?

I'm sorry, this got completely lost at a time when I didn't have much bandwidth for SciPy. And I'm more or less the only active maintainer for scipy.cluster, so PRs with algorithmic changes got stuck. I just did a set of triaging of everything that's open for cluster; I added the 1.12.0 milestone for it now - will try to get it in soon.

@tylerjereddy
Copy link
Contributor

I'm more or less the only active maintainer for scipy.cluster

@rgommers I think you're pretty busy with i.e., NumPy stuff, do you want me to bump the milestone here again, or perhaps leave it right to the last minute ~ December 6th just in case?

@rgommers
Copy link
Member

Yes, let's leave it for now. I'll have to look at what's still on my list for 1.12.x

@tylerjereddy tylerjereddy modified the milestones: 1.12.0, 1.13.0 Dec 10, 2023
@tylerjereddy tylerjereddy changed the title Add custom leaves ordering for dendrogram ENH: Add custom leaves ordering for dendrogram Mar 12, 2024
@tylerjereddy
Copy link
Contributor

Would be nice addition, but not critical for 1.13.0 branching ~Sunday, so bumping milestone, especially given a few months of inactivity here.

@tylerjereddy tylerjereddy modified the milestones: 1.13.0, 1.14.0 Mar 12, 2024
@vinisalazar
Copy link

Hi @aleneum @erheron @sjwright90 I'm trying to give this one a go, but am having trouble installing SciPy from source using this branch. Could you provide any advice on how to do this?

I tried creating a dev environment and running pip install ., python setup.py install, and neither worked.

Thank you!!

@rgommers
Copy link
Member

@vinisalazar the guide at http://scipy.github.io/devdocs/building/index.html should help. pip install is the right command, but you're probably missing a build dependency.

This is a pure Python PR, so another thing you could do is take the diff of this PR and apply it to an installed SciPy 1.13.0 manually.

@vinisalazar
Copy link

Thank you @rgommers, I managed to get the install to work. I'll comment here if I can get the feature to work on my use case as well.

@lucascolley lucascolley changed the title ENH: Add custom leaves ordering for dendrogram ENH: cluster: Add custom leaves ordering for dendrogram May 17, 2024
@tylerjereddy
Copy link
Contributor

This is a "nice to have," but not critical for the 1.14.0 release, so I'll bump the milestone again. Just roll it back if the PR comes back to life and gets merged before branching.

@tylerjereddy tylerjereddy modified the milestones: 1.14.0, 1.15.0 May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.cluster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants