-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced cut tree method for hierarchical clustering #10448
Comments
Hi @v-reyes, thanks for proposing this. I have two questions for now:
|
Hi @rgommers, many thanks for comments and questions.
|
That does seem like a lot of overhead for a pull request, so I definitely wouldn't suggest that unless you're interested in writing a paper anyway (in that case, please do!). A paper is just helpful, because then we have some external validation that this algorithm is an improvement over what we already have. If it's a small variation though, we can apply common sense and thorough review:) Your implementation does seem to have an example that demonstrates the improvement. Maybe you can add that here, with some dendrograms to show the effect?
that's excellent, thanks |
Hi @rgommers, many thanks again for your very useful comments.
Actually I was not planing to write a paper describing this method. In principle I would do that only if necessary for trying the pull request. Following your suggestion, I think that a thorough review of the method and the code would be very appropriate and beneficial. Anyway, I will look on the next days for proper references that might describe this idea. For instance, an algorithm included in the ELKI project follows a similar idea (although the algorithm is different, i.e. based on k-means instead of hierarchical clustering).
That's a very sensible comment. As a first step, I updated the Readme in order to document better the intention of the new method and its improvement over the established tree cut method. Further, I will work on the following days on your suggestion, i.e. providing the corresponding dendrograms to visually illustrate the main idea. Many thanks & best regards |
Hi @rgommers, following your suggestions I updated the code and the Readme in order to:
Do you think these enhancements are helpful? What would be the next steps in order to proceed with a PR? |
Thanks @vreyespue. Yes, I can see the value of this method. Dendrograms are nice. The dynamic tree cut paper has >1000 citations, so there's clearly a need. Looking at what's being done, this is a little different from My current impression is that a new function would be good. Unfortunately @kylessmith, as the author of |
opening a PR would be good. for now I'd suggest a new function ( |
Hi @rgommers, as always many thanks for your comments and thoughts.
Yes, this is a very sensible task to do. Obviously the
I also think so. Definitely the
I will first do a small comparison between both discussed methods, and then we can decide on how to proceed. I agree with you, probably it would be better to add the method as a separate function. Many thanks and best regards |
Dear @rgommers, I performed a comparison between the results from my method, and the results from the dynamicTreeCut in Python. In order to do so, I used a small piece of code which generates the same input data as in my example script, and generates from these data a similar number of dynamic clusters.
As shown below, the output is a numpy array of 100 elements, assigning one cluster ID to each input vector (of 4 dimensions). Note that the ID of the resulting clusters go from 0 to 19 in this case. The resulting clustering is unbalanced, i.e. containing two big clusters (where the number of data samples is 19 and 32, respectively), and many small clusters (each containing 5 or less data samples). As result, the range of cluster sizes goes from 1 to 32, showing a standard deviation of 7.20 data samples.
Note that the resulting clustering from my method is more balanced (for an equal number of resulting clusters), since the range of cluster sizes goes there from 1 to 10, showing a standard deviation of 2.68 data samples. Please see the Readme in my repo for more information. Further, please note that I did not include this comparison within my repo, since the dynamicTreeCut in Python is not part of the SciPy stablished framework, nor it is very well documented (as its R counterpart). Therefore, adding more unnecessary comparisons into my repo might be rather confusing, and not beneficial for the potential user. All in all, this small experiment already shows that my method adds value to both the standard cut_tree and the dynamicTreeCut methods, and thus it might considered as independent contribution to the SciPy framework on its own (obviously only if you agree). I have two small questions before proceeding with the arrangement of a suitable PR.
As always, many thanks for your patience and for your kind support. |
Hi @vreyespue, thanks for the update. Quick answers:
|
Three questions:
|
Hi @jolespin, many thanks for your interest. About your questions:
Best regards. |
@vreyespue The reason why I asked if the other linkage metrics could be used is because I know Regarding What is needed to make the PR a reality in the coming versions? |
@jolespin That's a very good point. The method just works with any other linkage method, but at this point I am not sure whether the result would be mathematically correct in case a distance different than In essence the balanced method would take similar inputs as the standard cut_tree function. So in principle I would argue that anything that works for the standard cut_tree function should also work for the balanced version. But in any case, I would have to work on this issue to be sure. About the PR, I can only say that the review is still in process, and that I am eager to apply any fixes and changes to the code necessary to include the function into the official SciPy repo. Maybe @rgommers or @tylerjereddy can provide some more information about this point. |
That’s great to know that it’s somewhat active. I use hierarchical clustering for so many projects and it’s such an intuitive analysis of diversity. I’m not sure if you’ve written a paper on it but if this was easily accessible I would certainly use this and cite in papers. Right now, my only other option really is dynamicTreeCut which has its own issues and the Python port is a dead github archive. This is literally the only other Python alternative so it would be awesome to have it in SciPy and citeable. |
@jolespin Now I remember that @ljmartin was working on a Certainly it would be great to get this method available for citations in papers. I was waiting for its inclusion into the official SciPy, but it is taking a long time (@rgommers or @tylerjereddy might know how long it might take). In the meanwhile, maybe we could write a small paper (in Arxiv or something like that) describing the method and citing our own repos. Please let me know if you would be interested in a co-authorship, or if you would know somebody who might be interested. |
Hi @vreyespue , the history of that implementation was that I wanted to use your nice balanced cut approach on large dendrograms (i.e. more than 10,000 data points). In my case, I had about 300k items, so a 300k x 300k ndarray as gets created in the current implementation would be too large for memory. That one is merged in scikit-network now ( https://github.com/sknetwork-team/scikit-network/blob/e1bfad780b6781c8af32716a282dda6ebec78ea9/sknetwork/hierarchy/postprocess.py#L122 ) - it does the same thing but is 'online' in the sense that it iterates through the dendrogram once without creating big arrays. I'd be happy to contribute something to a paper - it's been really useful thanks! |
Hi @ljmartin, you are right, my implementation for SciPy does not show a good performance on large dendrograms. Actually its memory consumption should be similar to the standard |
Is it the same implementation you created for your project? If so, then yes that is awesome it's already available! |
Hi @jolespin, as far as I understand, the implementation is not identical (since it is based in different data structures, i.e. scikit-network dendrogram vs. scipy linkage matrix). But I think the idea behind is basically the same. Maybe @ljmartin can provide more info on this point. |
the implementation in scikit-network just works on a dendrogram. They implement dendrograms in the same way scipy does, so it should be a complete drop-in replacement |
I proceed to close this issue and the related PR due to its low priority. Many thanks for your consideration and best regards. |
Dear colleagues, the current version of the cut_tree function for hierarchical clustering only accepts two cutting options, (a) by setting the tree height, or (b) by setting a specific number of resulting clusters.
By setting one of these parameters, probably you will end up having a few big clusters (where the number of data samples is high), and many small clusters (each containing very few data samples). Thus, the resulting clustering is unbalanced, i.e. it contains clusters of very variable size.
I recently developed a balanced cut tree method which addresses this specific issue.
The proposed method looks recursively along the hierarchical tree, from the root (single cluster gathering all the samples) to the leaves (i.e. the clusters with only one sample), retrieving the biggest possible clusters containing a number of samples lower than a given maximum. Since all output clusters contain no more than a given maximum number of samples, the resulting clustering is considered to be more balanced than the standard tree cut.
If you consider this method plausible, useful and interesting, I would be very happy to integrate it into the main ScyPy repo. I would be open to consider all proposed improvements and amendments, so that the function integrates smoothly within the ScyPy framework.
Many thanks in advance for any comments and suggestions.
The text was updated successfully, but these errors were encountered: