Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added distance_threshold parameter to hierarchical clustering #9069

Merged

Conversation

VathsalaAchar
Copy link
Contributor

@VathsalaAchar VathsalaAchar commented Jun 8, 2017

Reference Issue

Fixes #3796

What does this implement/fix? Explain your changes.

Hierarchical clustering now has a distance_threshold parameter which can be set instead of n_clusters and this is used to determine the number of clusters to cut the tree at.

Note: that this works only when compute_full_tree=True and n_clusters=None.

Changes

  • Either distance_threshold or n_clusters is accepted as parameter. When distance_threshold is set n_clusters needs to be set to None as the default of 2 clusters has been retained.
  • When building the tree the return_distance is set to True if the distance_threshold has been set.
    The distances returned is then used to calculate the number of clusters when cutting the tree.

Tests

  • Updated distance_threshold tests to pass with n_clusters=None as parameter as the default for n_clusters is 2.
  • Test to raise error when neither n_cluster nor distance_threshold is passed in.
  • Test agglomerative clustering with distance_threshold passed in and compare with the different number of clusters produced with and without connectivity.

Documentation

  • Updated the User Guide to show distance_threshold parameter in the table.
  • Doc string updated with distance_threshold explanation

Any other comments?

Is this implementation the right behaviour of distance_threshold or should the construction of the tree stop when the distance is reached?


Apologies for the constant stream of commits, I was trying to sort out the errors myself.

@VathsalaAchar VathsalaAchar force-pushed the hierarchical_clustering_threshold branch 2 times, most recently from a868db7 to 158c942 Compare June 9, 2017 14:36
@raghavrv raghavrv changed the title Distance threshold added to hierarchical clustering [MRG] Distance threshold added to hierarchical clustering Jun 28, 2017
Copy link
Contributor

@massich massich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that appveyor failures would be fixed if you rebase up on master.

Thanks for the PR.

doc/modules/clustering.rst Outdated Show resolved Hide resolved
doc/modules/clustering.rst Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see a test which checks the invariant that we want to hold:

  • set a distance threshold
  • check that the clusters produced match the point in the linkage tree where the threshold would have been exceeded
  • test the boundary case of the threshold equalling the distance

Btw, is this always specified as an absolute distance? Might we want (or prefer) to specify it relative to the average or median pairwise distance, for instance?

@@ -214,6 +214,104 @@ def test_agglomerative_clustering():
assert_array_equal(clustering.labels_, clustering2.labels_)


def test_agglomerative_clustering_with_distance_threshold():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's good reason to duplicate the tests for this parameter.

@VathsalaAchar VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 158c942 to a0e2df2 Compare June 28, 2017 13:54
@VathsalaAchar
Copy link
Contributor Author

VathsalaAchar commented Jun 28, 2017

@massich I rebased off master to fix the CircleCI errors and now Travis fails while looking for nose-timer. Is there anything I am supposed to do to fix this?

And thank you very much for the review, I'll get the changes sorted out and get back soon.

@massich
Copy link
Contributor

massich commented Jun 28, 2017

The problem comes from the way you had fixed this comment. Surely, I didn't express myself properly. What I was proposing was to use some dummy variables to improve readability.

if distance_threshold:
      (ch, com, le, pa) = \
           memory.cache...
else
       (ch, com, le, pa) = \
           memory.cache...
self.children_ = ch
self.n_comp...

That's why travis failed. You can execute the PEP8 check locally:

bash ./build_tools/travis/flake8_diff.sh

@jnothman
Copy link
Member

jnothman commented Jun 28, 2017 via email

@raghavrv
Copy link
Member

Are you then okay with users having to explicitly set n_clusters=None when using distance_threshold?

@raghavrv
Copy link
Member

I'm unable to recollect which other class has something like this. But I feel we did come across this problem before.

@VathsalaAchar
Copy link
Contributor Author

bash ./build_tools/travis/flake8_diff.sh

@massich Thanks for this, I usually forget to test flake locally and noticed the one too many spaces later. But the error I was talking about was a time out error in a previous build, that fixed itself on the next push and hasn't appeared since. I have also fixed the variables for better readability.

@jnothman I have updated the tests. But could I get a quick review to see if I missed anything?

@jnothman @raghavrv How can I move forward on setting or not setting n_clusters=None when using distance_threshold?

@VathsalaAchar VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 4e7e49d to 065decb Compare June 30, 2017 13:34
return_distance=True)
clusters_at_threshold = np.count_nonzero(
distances >= distance_threshold) + 1
assert_true(clusters_at_threshold == clusters_produced)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant you should checked that the clusters, not the number of clusters, match.

@VathsalaAchar VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 065decb to 0cce945 Compare July 4, 2017 11:24
@VathsalaAchar
Copy link
Contributor Author

Could someone help me finish this off? I made all the necessary changes but a quick review would help. Thanks in advance!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few small things that surprised me, like the n_clusters_ behaviour and a couple of the error cases, but this is otherwise looking good.

sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
@blackyang
Copy link

Great feature! When will this be merged?

@asanakoy
Copy link
Contributor

asanakoy commented Oct 6, 2017

Also waiting for the merge

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you intend by the n_clusters_ attribute. It does not appear to be documented, and it does not seem to differ from self.n_clusters. If this PR introduces a new n_clusters_ attribute, its role should be to report the number of clusters automatically identified by distance_threshold. Otherwise, I see no need for a new attribute.

sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
@@ -560,12 +563,29 @@ def _hc_cut(n_clusters, children, n_leaves):
n_leaves : int
Number of leaves of the tree.

distance_threshold : int (optional)
The distance threshold to cluster at.
If ``distance_threshold`` is set then ``n_clusters`` will be the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing here to refer to the n_clusters parameter. I'd rather: "If distance_threshold is set then n_clusters will be ignored. Instead, the number of clusters will be determined by when distances between clusters first exceed the distance_threshold during agglomeration."

NOT both.

distance_threshold : int (optional)
The distance threshold to cluster at.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to match above

"""Function cutting the ward tree for a given number of clusters.

Parameters
----------
n_clusters : int or ndarray
The number of clusters to form.
NOTE: You should set either ``n_clusters`` or ``distance_threshold``,
NOT both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't really make sense, if distance_threshold overrides n_clusters. I think it's fine to only have the comment under distance_threshold saying that it being set makes this ignored.

@VathsalaAchar
Copy link
Contributor Author

I don't understand what you intend by the n_clusters_ attribute. It does not appear to be documented, and it does not seem to differ from self.n_clusters. If this PR introduces a new n_clusters_ attribute, its role should be to report the number of clusters automatically identified by distance_threshold. Otherwise, I see no need for a new attribute.

You're absolutely right. The n_clusters_ attribute made sense initially when I started off and wanted n_clusters to be None and this attribute was the workaround. But after all the changes it doesn't have any purpose now so I'll fix that.

I'll also take care of the remaining documentation issues.

When distance_threshold is set then it is used to determine the number of clusters to cut the tree at.
Though it is to be noted that this works only when computer_full_tree=True.

* When building the tree the return_distance set to True if the distance_threshold has been set.
The distances returned is then used to calculate the number of clusters when cutting the tree.

* Test agglomerative clustering with distance_threshold passed in and compare with the different number of clusters produced with and without connectivity.

Changes to documentation to include distance_threshold

Updates to distance threshold in hierarchical clustering

* Moved the parameter check from init to fit for consistency

* Updates to tests to account for changes made above

Documentation changes based on review

* backticks for variables in docstrings

* formatting without backslashes

Test for hierarchical clustering with distance_threshold

* clusters produced are checked against the linkage tree to confirm that it matches the point where the distance exceeds the threshold set
* boundary case test when distance_threshold is equal to the distance

* Updated tests to compare clusters and number of clusters

* Allowing users to set n_clusters or distance_threshold and updated tests

* Checking the n_clusters None condition better

* Removed the necessity for n_clusters_ to be set to None and redundant checks

* Updated tests after the above changes

* Cleaned up test to compare clusters produced using n_clusters against distance_threshold

* Added and Simplified test for boundary conditions

* Updated the documentation on distance_threshold restrictions
* Doc string updates for clear information

* Removed redundant attribute n_clusters_

* Fixed tests

* Changes to FeatureAgglomeration to include distance threshold
self.labels_ = _hc_cut(self.n_clusters, self.children_,
self.n_leaves_)
if distance_threshold is not None:
self.labels_ = _hc_cut(self.n_clusters, self.children_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just calculate n_clusters from distances here and avoid modifying _hc_cut (adding two parameters for the sake of 1 line of logic)?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we should do is to change n_clusters to have a default value of None, which means "2 if distance_threshold is unset", and then raise an error when fitting a model with both n_clusters and distance_threshold set.

Otherwise, this is looking good.

n_clus = -1
agc = AgglomerativeClustering(n_clusters=n_clus)
msg = ("n_clusters should be an integer greater than 0."
" %s was provided." % str(agc.n_clusters))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use a loop to test both -1 and 0, or put -1 directly in the expected error message



def test_agg_n_cluster_and_distance_threshold():
# Test that when distance_threshold is set n_clusters_ is unchanged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right comment anymore

children, n_components, n_leaves, parent, distances = \
tree_builder(X, connectivity=conn, n_clusters=None,
return_distance=True)
num_clusters_at_threshold = np.count_nonzero(
Copy link
Member

@jnothman jnothman Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should test something more explicit, just to be sure that your logic here is correct, like check that in single linkage, the maximum within-cluster pairwise distance for each sample is under the threshold and the minimum out-of-cluster pairwise distance is greater.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean along the lines of this test?
I could use the same dataset to do a more explicit test.

Copy link
Member

@jnothman jnothman Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it relates to that test. I mean that for some X and some predicted labels:

D = pairwise_distances(X, metric=metric)
for i in range(len(X)):
    in_cluster_mask = labels == labels[i]
    max_in_cluster_distance = D[i, in_cluster_mask].max()
    min_out_cluster_distance = D[i, ~in_cluster_mask].min()
    # XXX: there should be equality on one of these conditions
    assert max_in_cluster_distance < threshold
    assert min_in_cluster_distance > threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response, but as far as I understand the pairwise distance will give the distance between each point in X not the distance between the clusters as they join up. The distance between each cluster is in the distances matrix calculated using the scipy.cluster.hierarchy.linkage method.

So is there still a need to have an explicit test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although what I said is true only when connectivity is None. If a connectivity matrix is passed in then calculating the clusters would mean deciphering what is happening in the ward_tree and linkage_tree methods, and I'm not sure it's worth the effort...

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

I suppose, but is there a need to do this?
I'm really not sure how to do a more explicit test, so I'd really appreciate help with this.

@jnothman
Copy link
Member

@thomasjpfan, feel like giving this once over?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, haven't reviewed the tests yet

sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/hierarchical.py Show resolved Hide resolved
sklearn/cluster/hierarchical.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests need an update

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
@@ -711,8 +711,19 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
``pooling_func`` has been deprecated in 0.20 and will be removed
in 0.22.

distance_threshold : float, optional (default=None)
The distance threshold to cluster at. If not ``None``, ``n_clusters``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this needs a description. It isn't clear what this parameter does right now.

sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_hierarchical.py Show resolved Hide resolved
@jnothman jnothman modified the milestones: 0.21, 0.22 Apr 29, 2019
@NicolasHug NicolasHug changed the title [MRG+1] Distance threshold added to hierarchical clustering Added distance_threshold parameter to hierarchical clustering Apr 29, 2019
@NicolasHug NicolasHug merged commit 602f3d6 into scikit-learn:master Apr 29, 2019
@NicolasHug
Copy link
Member

Thanks for the update Adrin.

Just merged, @jnothman this can get in 0.21 right? Else we'll need to udpate the versionadded parts

@jnothman
Copy link
Member

jnothman commented Apr 29, 2019 via email

@jnothman
Copy link
Member

jnothman commented Apr 29, 2019 via email

@bede
Copy link

bede commented May 2, 2019

This feature can't come soon enough.

@NicolasHug
Copy link
Member

Sorry Joel I was out for the last few days and didn't answer, I just saw that you cherry picked it. Thanks!

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hierarchical clustering: distance threshold