-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: cluster: Add custom leaves ordering for dendrogram #12049
base: main
Are you sure you want to change the base?
Conversation
As for now the following things must be improved: - time complexity (now it's O(n^2)) - memory usage Ref. scipy#11060
@vinisalazar @rgommers I'd be grateful if You express Your thoughts about it. Ahead of time/memory complexity (which I'm going to improve), but rather is it going in the right direction or not in general. |
I have a feeling I should leave one more comment about the algorithm i use now: it basically rebuilds the tree encoded by linkage matrix so that order of leaves of the new tree corresponds to |
scipy/cluster/hierarchy.py
Outdated
leaves_order : iterable, optional | ||
Plots the leaves in the order specified by a vector of | ||
original observation indices. If the vector contains duplicates | ||
or results in a crossing, an exception will be thrown. Passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does a user know that a crossing will not happen, and if it happens what can they do about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've chosen an approach in which user must guarantee that the ordering is correct. As I've mentioned above, this function basically reorders the cut tree. If I understand everything correctly, in case provided ordering isn't correct user will get an exception from here:
scipy/scipy/cluster/hierarchy.py
Line 3280 in 8a6ac2c
is_valid_linkage(Z, throw=True, name='Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm guessing that it's more common than not to get a crossing when a user puts in some random ordering, the envisioned workflow is:
- Use without
leaves_order
- Review the plot, decide which items to swap
- Map node labels to indices to be used as leave ordering
- Add the right
leaves_order
Without trying it, I'm still hesitant on how practical this is going to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. by 'envisioned workflow' You mean user's workflow, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, this will actually be quite useful. In my use case, I'm looking at how behavior changes over time and combining time intervals based on the distance between each interval. Also, I want to keep the intervals in their inherent, temporal ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by 'envisioned workflow' You mean user's workflow, right?
yes indeed
thanks @ihowell, that is helpful feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers @ihowell in the recent commit I've added full parameter correctness checking (e.g. if it's indeed a permutation, and if it doesn't result in a crossing). I'm now wondering whether it's possible (and reasonable) with my approach to support truncate_mode
option.
I'm tempted to ask, whether it would be possible to say, that this new option is incompatible with truncate_mode
parameter.
The reason is: suppose one uses truncate_mode
and gets some new leaves, which are in reality inner nodes. As far as I know, now the only way to label these nodes (which are now leaves) is to provide leaf_label_func
. I don't see how in this case user should say that he wants to reorder these new leaves. Would be grateful if You express Your opinions.
One more thingy I'm thinking about: because all this function does is cut-tree reordering, maybe we should move it to |
Also, just wanted to let you know that I am working off of this branch (as this feature is critical to my research), so I'll let you know if anything feels odd. Thank you so much @erheron!!! |
Alright, I have a bug report and a constructed minimal example. When specifying the linkage order, if distances are float values, they are floored, losing the actual metric values. For example, consider this code:
It generates the image: But, suppose we want to reverse the leaf order:
|
scipy/cluster/hierarchy.py
Outdated
ch = np.zeros((n - 1, 2)) | ||
for i in range(n - 1): | ||
ch[i] = (Z[i][0], Z[i][1]) # ch[i] = children of node n+i | ||
Z = Z.astype('int32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the culprit here. Casting to int gets rid of float valued metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ihowell! I had a feeling that I've forgotten about something, and here it is. I was kind of aware of this, so I'm sorry for not letting know about that :|
I'm going to improve the complexity from O(n^2) to O(n), and this may lead to other algorithm. I'll keep that bug in mind. Hoping to produce new code in a week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihowell improved to linear and fixed this bug with casting in 00954ac ! What do You think about moving this to linkage?
The rational behind this would be as follows:
this function reorders cut-tree e.g. the tree encoded by linkage matrix. Now suppose someone draws a dendrogram, but uses parameter like truncate_mode
. We would have a plot of dendrogram, in which "leaves" are really clusters (e.g. they are not the original leaves of a tree, but inner nodes). As for now, my function doesn't do something like this. I think it's doable, but:
- Maybe the code in 00954ac is useful "as is" (e.g. as just something which reorders the tree)
- Code which takes into account options like
truncate_mode
might be little more complicated and require more time, however I think I can tackle this :)
I'd be grateful if You tell your thoughs about it.. @rgommers @ihowell
One more thingy I'm thinking about: because all this function does is cut-tree reordering, maybe we should move it to
hierarchy.linkage
as optional parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't think it belongs in linkage, but I haven't dived deep into the exact purpose of linkage as opposed to dendrogram and where the line is drawn. From my background though, linkage seems like the data model and dendrogram is the view. Keeping these separate would mean that we keep the re-ordering code in dendrogram, as the data model doesn't really care. But again, I could be way off base here. I'll test out the code and get back to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihowell thanks for testing!!! I didn't add the support for checking parameter consistensy yet. Hope to produce it soon, but for now it would rather work as "if You see something weird on the screen then You must've messed up with leaves_order
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erheron Floating point distances are showing up again, thanks! But there is one UI issue. You require that the input to dendrogram
implement the shape
function, such as a numpy array. This is breaking from the current dendrogram implementation which at the very least allows for a list of linkages, no numpy needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihowell fixed that in recent commit, thanks! Full parameter consistency checking, as well as considering truncate_mode
(and maybe other) parameters which affect dendrogram is on its way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihowell, as far as I'm concerned, this is the final code (I wrote more in another thread above). What do You think?
As for now this algorithm doesn't check if the ordering provided is good or not, just assumes the former. The complexity has been improved from quadratic to linear (time and memory). Integrity check to be added later.
FYI AppVeyor failure is just #12160 |
adb621a
to
7d63405
Compare
Additionally, provide two unit tests for this
p - array of parents in tree T | ||
height[i] is a value of Z[i][3] e.g. cost of creating node (n+i) | ||
num_leaves[i] = number of leaves of a subtree rooted in a node i in T | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add in Parameters
and Returns
such as in other functions. While this isn't a public function, these should still be documented as it is fairly complex function.
@@ -930,6 +930,26 @@ def test_dendrogram_colors(self): | |||
# reset color palette (global list) | |||
set_link_color_palette(None) | |||
|
|||
@pytest.mark.parametrize("order", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a parameterization with a duplicated value? You mention it in the description of the function, so we should probably explicitly test it here. Also, there should be a test for the happy path, where everything works as it should, and for the edge case of having an empty tree and empty list of reordering (or similar appropriate edge case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand You correctly, than line 936 contains exactly duplicated value for permutation. Maybe I should change the comment in that line. The rest of the tests You mention is on the way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihowell done!
f935d0a
to
9f38380
Compare
I've looked at build issues, and they look unrelated to PR (rather problems with pipeline). |
@erheron, thanks a lot for this. Would be nice to see this merged. |
@erheron Hi! for what it's worth I implemented your re-ordering function and it worked without issue, thanks it was very helpful |
Thanks guys @aleneum @sjwright90, it's been years since I worked on this, and really heart-warming to know it's been useful to you 💔 |
I'm sorry, this got completely lost at a time when I didn't have much bandwidth for SciPy. And I'm more or less the only active maintainer for |
@rgommers I think you're pretty busy with i.e., NumPy stuff, do you want me to bump the milestone here again, or perhaps leave it right to the last minute ~ December 6th just in case? |
Yes, let's leave it for now. I'll have to look at what's still on my list for 1.12.x |
Would be nice addition, but not critical for |
Hi @aleneum @erheron @sjwright90 I'm trying to give this one a go, but am having trouble installing SciPy from source using this branch. Could you provide any advice on how to do this? I tried creating a dev environment and running Thank you!! |
@vinisalazar the guide at http://scipy.github.io/devdocs/building/index.html should help. This is a pure Python PR, so another thing you could do is take the diff of this PR and apply it to an installed SciPy 1.13.0 manually. |
Thank you @rgommers, I managed to get the install to work. I'll comment here if I can get the feature to work on my use case as well. |
This is a "nice to have," but not critical for the |
Customizing leaves ordering in
scipy.cluster.dendrogram
via cut-tree (linkage matrix) reorderingReference issue
Ref. #11060
What does this implement/fix?
The objective is to provide the possibility to specify custom leaf ordering in dendrogram. As for now I've chosen a way in which
leaves_order
parameter must be a permutation of indices of original array.Additional information
This PR will provide described functionality for any input in which leaves of the dendrogram are the original observations, and in my opinion considering
truncate_mode
option doesn't really make sense. Time and memory complexity are linear, and that's optimal