Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balanced cut tree method for hierarchical clustering #10448

Closed
vreyespue opened this issue Jul 12, 2019 · 23 comments
Closed

Balanced cut tree method for hierarchical clustering #10448

vreyespue opened this issue Jul 12, 2019 · 23 comments
Labels
enhancement A new feature or improvement scipy.cluster

Comments

@vreyespue
Copy link

vreyespue commented Jul 12, 2019

Dear colleagues, the current version of the cut_tree function for hierarchical clustering only accepts two cutting options, (a) by setting the tree height, or (b) by setting a specific number of resulting clusters.

By setting one of these parameters, probably you will end up having a few big clusters (where the number of data samples is high), and many small clusters (each containing very few data samples). Thus, the resulting clustering is unbalanced, i.e. it contains clusters of very variable size.

I recently developed a balanced cut tree method which addresses this specific issue.

The proposed method looks recursively along the hierarchical tree, from the root (single cluster gathering all the samples) to the leaves (i.e. the clusters with only one sample), retrieving the biggest possible clusters containing a number of samples lower than a given maximum. Since all output clusters contain no more than a given maximum number of samples, the resulting clustering is considered to be more balanced than the standard tree cut.

If you consider this method plausible, useful and interesting, I would be very happy to integrate it into the main ScyPy repo. I would be open to consider all proposed improvements and amendments, so that the function integrates smoothly within the ScyPy framework.

Many thanks in advance for any comments and suggestions.

@rgommers rgommers added enhancement A new feature or improvement scipy.cluster labels Jul 12, 2019
@rgommers
Copy link
Member

Hi @v-reyes, thanks for proposing this. I have two questions for now:

  1. Is your implementation based on a published paper?
  2. You added a GPL license to your repo, which is incompatible with SciPy's license. Would you mind relicensing to BSD-3 so we could accept code based on it under SciPy's license?

@vreyespue
Copy link
Author

Hi @rgommers, many thanks for comments and questions.

  1. That is a very good point. In fact, I am not aware of any publication describing directly this method, nor any previous implementation of it (althought it is indeed a trivial variation of the standard cut tree method). I can take look next week for relevant citations in the literature. If I do not find anything, should I proceed and write myself a paper? Which journal do you think would be appropriate? Would be there any alternatives to classical journal papers?
  2. Already done. Is it ok like that, or should I change something within the license? (I am very flexible on this point).
    Again, many thanks for your support.

@rgommers
Copy link
Member

If I do not find anything, should I proceed and write myself a paper?

That does seem like a lot of overhead for a pull request, so I definitely wouldn't suggest that unless you're interested in writing a paper anyway (in that case, please do!). A paper is just helpful, because then we have some external validation that this algorithm is an improvement over what we already have. If it's a small variation though, we can apply common sense and thorough review:) Your implementation does seem to have an example that demonstrates the improvement. Maybe you can add that here, with some dendrograms to show the effect?

Is it ok like that, or should I change something within the license? (I am very flexible on this point).

that's excellent, thanks

@vreyespue
Copy link
Author

Hi @rgommers, many thanks again for your very useful comments.

That does seem like a lot of overhead for a pull request, so I definitely wouldn't suggest that unless you're interested in writing a paper anyway.

Actually I was not planing to write a paper describing this method. In principle I would do that only if necessary for trying the pull request. Following your suggestion, I think that a thorough review of the method and the code would be very appropriate and beneficial. Anyway, I will look on the next days for proper references that might describe this idea. For instance, an algorithm included in the ELKI project follows a similar idea (although the algorithm is different, i.e. based on k-means instead of hierarchical clustering).

Your implementation does seem to have an example that demonstrates the improvement. Maybe you can add that here, with some dendrograms to show the effect?

That's a very sensible comment. As a first step, I updated the Readme in order to document better the intention of the new method and its improvement over the established tree cut method. Further, I will work on the following days on your suggestion, i.e. providing the corresponding dendrograms to visually illustrate the main idea.

Many thanks & best regards

@vreyespue
Copy link
Author

Hi @rgommers, following your suggestions I updated the code and the Readme in order to:

  • Provide dendrograms to visually illustrate the intention of the method and the main results, and
  • Provide relevant links describing related work (which might be used as references for the method).

Do you think these enhancements are helpful? What would be the next steps in order to proceed with a PR?
Many thanks & best regards

@rgommers
Copy link
Member

Thanks @vreyespue. Yes, I can see the value of this method. Dendrograms are nice. The dynamic tree cut paper has >1000 citations, so there's clearly a need.

Looking at what's being done, this is a little different from tree_cut, so I'm not sure it makes sense to combine them in a single function. Your algorithm is more similar to dynamic tree cut. It would actually also be good to compare to https://github.com/kylessmith/dynamicTreeCut.

My current impression is that a new function would be good. Unfortunately tree_cut is poorly documented - for example a simple explanation of what an ultrametric tree is would be useful.

@kylessmith, as the author of dynamicTreeCut you may be interested in this. If you have an expert opinion on whether the proposed algorithm belongs in scipy.cluster as a separate new function or inside cut_tree as an option, that would be great to hear. If dynamicTreeCut was license-compatible, it seems like that could have fit in nicely as well.

@rgommers
Copy link
Member

What would be the next steps in order to proceed with a PR?

opening a PR would be good. for now I'd suggest a new function (cut_tree_balanced?)

@vreyespue
Copy link
Author

Hi @rgommers, as always many thanks for your comments and thoughts.

Your algorithm is more similar to dynamic tree cut. It would actually also be good to compare to https://github.com/kylessmith/dynamicTreeCut.

Yes, this is a very sensible task to do. Obviously the dynamicTreeCut method is more complex and advanced than mine. I will work in the next days on this point, so that we can at least provide some results from both methods.

If dynamicTreeCut was license-compatible, it seems like that could have fit in nicely as well.

I also think so. Definitely the dynamicTreeCut method would be very valuable for the SciPy framework. It would be great if @kylessmith is open to collaborate on this point.

opening a PR would be good. for now I'd suggest a new function (cut_tree_balanced?)

I will first do a small comparison between both discussed methods, and then we can decide on how to proceed. I agree with you, probably it would be better to add the method as a separate function.

Many thanks and best regards

@vreyespue
Copy link
Author

Dear @rgommers,

I performed a comparison between the results from my method, and the results from the dynamicTreeCut in Python. In order to do so, I used a small piece of code which generates the same input data as in my example script, and generates from these data a similar number of dynamic clusters.

from dynamicTreeCut import cutreeHybrid
from scipy.spatial.distance import pdist
from scipy.stats import gamma
import numpy as np
from scipy.cluster.hierarchy import linkage

# Initialize the random seed
np.random.seed(5)

# Create a input matrix containing 100 data samples with 4 dimensions. Note: random sample 
# from gamma distribution in order to obtain an unbalanced clustering (see below)
X = gamma.rvs(0.1, size=400).reshape((100,4))
    
# Compute the distances and linkage matrix using the scipy methods
distances = pdist(X, "euclidean")
link = linkage(distances, "average")

# Perform the tree cut by usind the dynamicTreeCut library. Note: the parameters minClusterSize
# and cutHeight were set in order to obtain 20 clusters as result (see below)
clusters = cutreeHybrid(link, distances, minClusterSize=1, cutHeight=0.1, 
    respectSmallClusters = True)

# Use only the labels from the returned dictionary as result
standard_cut_cluster_id = clusters["labels"]

# Print information about the resulting clusterings
print("Type of the standard clustering result: %s" % type(standard_cut_cluster_id))
print("Shape of the standard clustering result (one cluster id per data sample): %s" % 
    str(standard_cut_cluster_id.shape))
print("First 10 rows of the standard clustering result (one cluster id per sample):")
print(str(standard_cut_cluster_id[0:10].reshape(10)) + " ...")
standard_cluster_values, standard_cluster_counts = np.unique(standard_cut_cluster_id, 
    return_counts=True)
print("Total number of resulting clusters = %s" % standard_cluster_values.shape[0])
print("For each resulting cluster: Cluster ID")
print(standard_cluster_values)
print("For each resulting cluster: Count of data samples")
print(standard_cluster_counts)
print("Count of data samples per cluster: mean = %d, max = %d, min = %d, std = %.2f" %
    (np.mean(standard_cluster_counts), np.max(standard_cluster_counts),
    np.min(standard_cluster_counts), np.std(standard_cluster_counts)))
print('')

As shown below, the output is a numpy array of 100 elements, assigning one cluster ID to each input vector (of 4 dimensions). Note that the ID of the resulting clusters go from 0 to 19 in this case. The resulting clustering is unbalanced, i.e. containing two big clusters (where the number of data samples is 19 and 32, respectively), and many small clusters (each containing 5 or less data samples). As result, the range of cluster sizes goes from 1 to 32, showing a standard deviation of 7.20 data samples.

Type of the standard clustering result: <type 'numpy.ndarray'>
Shape of the standard clustering result (one cluster id per data sample): (100,)
First 10 rows of the standard clustering result (one cluster id per sample):
[ 3  7  1 12  1  0  5  0  1  0] ...
Total number of resulting clusters = 20
For each resulting cluster: Cluster ID
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
For each resulting cluster: Count of data samples
[19 32  5  5  4  4  3  3  3  2  2  2  2  2  2  2  2  2  2  2]
Count of data samples per cluster: mean = 5, max = 32, min = 2, std = 7.20

Note that the resulting clustering from my method is more balanced (for an equal number of resulting clusters), since the range of cluster sizes goes there from 1 to 10, showing a standard deviation of 2.68 data samples. Please see the Readme in my repo for more information.

Further, please note that I did not include this comparison within my repo, since the dynamicTreeCut in Python is not part of the SciPy stablished framework, nor it is very well documented (as its R counterpart). Therefore, adding more unnecessary comparisons into my repo might be rather confusing, and not beneficial for the potential user.

All in all, this small experiment already shows that my method adds value to both the standard cut_tree and the dynamicTreeCut methods, and thus it might considered as independent contribution to the SciPy framework on its own (obviously only if you agree).

I have two small questions before proceeding with the arrangement of a suitable PR.

  • Should I include my cut_tree_balanced() function within the main 'hierarchy.py' file? If yes, just below the standard cut_tree() function?
  • Should I include the tests already in the first version of the PR? If so, should I include them within the the main 'test_hierarchy.py' file?
  • Are benchmarks strictly necessary?
  • I saw in the SciPy docu that the PEP8 and doctest standards should be followed. Are there any other standards I should take into consideration?

As always, many thanks for your patience and for your kind support.
Best regards :)

@rgommers
Copy link
Member

Hi @vreyespue, thanks for the update. Quick answers:

  • yes
  • yes
  • no, but they would be nice and are quite easy to write:)
  • no (just following similar code style and test arrangement as we have should be fine)

@vreyespue
Copy link
Author

Hi @rgommers, as discussed I already prepared the corresponding PR: #10730
Please let me know if you need further information.
Many thanks & best regards

@jolespin
Copy link

Three questions:

  1. Is this available in the current SciPy version?
  2. Can this be used with non-ward linkage?
  3. Do you know of any dendrogram cutting algorithms like cuttree dynamic that are available in Python? I've had issues with the Python port and have been using rpy2 but the dependencies are always horrendous.

@vreyespue
Copy link
Author

vreyespue commented Aug 27, 2020

Hi @jolespin, many thanks for your interest. About your questions:

  1. No, it isn't. It has been added as a PR which has been planned for the 1.6.0 milestone. However, please note that the PR is already more than one year old, so this might take some time to be added into the official SciPy (probably low priority). But maybe @rgommers or @tylerjereddy can provide some more information about this.
  2. Good question. Actually it works (I tested it). The method ward_cut_tree_balanced() (which might be indeed renamed to cut_tree_balanced()) takes as input a linkage matrix of type ndarray (Z, see the documentation at the original repo). Note that Z can be created using the ward method, or any other linkage method.
  3. I quoted some further methods within the documentation at the original repo (see Related Work section by the end). I think you might be interested on the translation of the dynamicTreeCut method to Python.

Best regards.

@jolespin
Copy link

@vreyespue The reason why I asked if the other linkage metrics could be used is because I know ward linkage works specifically on Euclidean distances so I wasn't sure if this was a similar situation.

Regarding dynamicTreeCut in Python, I've had compatibility issues kylessmith/dynamicTreeCut#9 and the author hasn't addressed them nor has there been any activity on the project for a few years so I feel that this method is abandoned.

What is needed to make the PR a reality in the coming versions?

@vreyespue
Copy link
Author

@jolespin That's a very good point. The method just works with any other linkage method, but at this point I am not sure whether the result would be mathematically correct in case a distance different than Euclidean is used. Food for thought.

In essence the balanced method would take similar inputs as the standard cut_tree function. So in principle I would argue that anything that works for the standard cut_tree function should also work for the balanced version. But in any case, I would have to work on this issue to be sure.

About the PR, I can only say that the review is still in process, and that I am eager to apply any fixes and changes to the code necessary to include the function into the official SciPy repo. Maybe @rgommers or @tylerjereddy can provide some more information about this point.

@jolespin
Copy link

That’s great to know that it’s somewhat active. I use hierarchical clustering for so many projects and it’s such an intuitive analysis of diversity. I’m not sure if you’ve written a paper on it but if this was easily accessible I would certainly use this and cite in papers. Right now, my only other option really is dynamicTreeCut which has its own issues and the Python port is a dead github archive. This is literally the only other Python alternative so it would be awesome to have it in SciPy and citeable.

@vreyespue
Copy link
Author

vreyespue commented Aug 28, 2020

@jolespin Now I remember that @ljmartin was working on a balanced_cut version for Scikit (see link1, link2, link3), which was actually inspired by my own method. But I don't know whether that Scikit function is already available.

Certainly it would be great to get this method available for citations in papers. I was waiting for its inclusion into the official SciPy, but it is taking a long time (@rgommers or @tylerjereddy might know how long it might take).

In the meanwhile, maybe we could write a small paper (in Arxiv or something like that) describing the method and citing our own repos. Please let me know if you would be interested in a co-authorship, or if you would know somebody who might be interested.

@ljmartin
Copy link

ljmartin commented Sep 7, 2020

Hi @vreyespue , the history of that implementation was that I wanted to use your nice balanced cut approach on large dendrograms (i.e. more than 10,000 data points). In my case, I had about 300k items, so a 300k x 300k ndarray as gets created in the current implementation would be too large for memory. That one is merged in scikit-network now ( https://github.com/sknetwork-team/scikit-network/blob/e1bfad780b6781c8af32716a282dda6ebec78ea9/sknetwork/hierarchy/postprocess.py#L122 ) - it does the same thing but is 'online' in the sense that it iterates through the dendrogram once without creating big arrays.

I'd be happy to contribute something to a paper - it's been really useful thanks!

@vreyespue
Copy link
Author

Hi @ljmartin, you are right, my implementation for SciPy does not show a good performance on large dendrograms. Actually its memory consumption should be similar to the standard cut_tree() function, if I remember correctly.
Anyhow, I think your implementation for scikit-network is very elegant, and it is great to hear that it shows a good performance. That's great news 😄 👍
I think the most important would be to provide something that can be quoted in papers. @jolespin do you think the implementation for scikit-network could be enough for that purpose? The implementation for SciPy should be eventually available, but I don't know when (might take some time, @rgommers could provide more info about this).
If that helps, I would be also eager to contribute something to a paper, but I am not sure about the format, the length, etc. In addition, I do not have at the moment any academic affiliation, which is necessary for publishing in many journals. If you know of any workaround for this, that would be very helpful. Please let me know what you think.
Best regards & many thanks!

@jolespin
Copy link

jolespin commented Sep 9, 2020

Is it the same implementation you created for your project?
https://scikit-network.readthedocs.io/en/latest/reference/hierarchy.html#sknetwork.hierarchy.cut_balanced

If so, then yes that is awesome it's already available!

@vreyespue
Copy link
Author

Hi @jolespin, as far as I understand, the implementation is not identical (since it is based in different data structures, i.e. scikit-network dendrogram vs. scipy linkage matrix). But I think the idea behind is basically the same. Maybe @ljmartin can provide more info on this point.
About quoting this method, I don't know if it is possible to quote the scikit-network repo. As I understood, it is not part of the official scikit framework. Maybe @ljmartin can provide some more info on this point as well.
In any case, I think it is great to have two implementations of the same method available 😄 👍
Many thanks & best regards

@ljmartin
Copy link

the implementation in scikit-network just works on a dendrogram. They implement dendrograms in the same way scipy does, so it should be a complete drop-in replacement

@vreyespue vreyespue changed the title Balanced cut tree method for Ward hierarchical clustering Balanced cut tree method for hierarchical clustering Sep 15, 2020
@vreyespue
Copy link
Author

I proceed to close this issue and the related PR due to its low priority. Many thanks for your consideration and best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.cluster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants