Added distance_threshold parameter to hierarchical clustering #9069

VathsalaAchar · 2017-06-08T16:02:19Z

Reference Issue

What does this implement/fix? Explain your changes.

Hierarchical clustering now has a distance_threshold parameter which can be set instead of n_clusters and this is used to determine the number of clusters to cut the tree at.

Note: that this works only when compute_full_tree=True and n_clusters=None.

Changes

Either distance_threshold or n_clusters is accepted as parameter. When distance_threshold is set n_clusters needs to be set to None as the default of 2 clusters has been retained.
When building the tree the return_distance is set to True if the distance_threshold has been set.
The distances returned is then used to calculate the number of clusters when cutting the tree.

Tests

Updated distance_threshold tests to pass with n_clusters=None as parameter as the default for n_clusters is 2.
Test to raise error when neither n_cluster nor distance_threshold is passed in.
Test agglomerative clustering with distance_threshold passed in and compare with the different number of clusters produced with and without connectivity.

Documentation

Updated the User Guide to show distance_threshold parameter in the table.
Doc string updated with distance_threshold explanation

Any other comments?

Is this implementation the right behaviour of distance_threshold or should the construction of the tree stop when the distance is reached?

Apologies for the constant stream of commits, I was trying to sort out the errors myself.

massich

I think that appveyor failures would be fixed if you rebase up on master.

Thanks for the PR.

doc/modules/clustering.rst

sklearn/cluster/hierarchical.py

sklearn/cluster/tests/test_hierarchical.py

sklearn/cluster/hierarchical.py

jnothman

I'd rather see a test which checks the invariant that we want to hold:

set a distance threshold
check that the clusters produced match the point in the linkage tree where the threshold would have been exceeded
test the boundary case of the threshold equalling the distance

Btw, is this always specified as an absolute distance? Might we want (or prefer) to specify it relative to the average or median pairwise distance, for instance?

jnothman · 2017-06-28T13:42:33Z

sklearn/cluster/tests/test_hierarchical.py

@@ -214,6 +214,104 @@ def test_agglomerative_clustering():
    assert_array_equal(clustering.labels_, clustering2.labels_)


+def test_agglomerative_clustering_with_distance_threshold():


I don't think there's good reason to duplicate the tests for this parameter.

VathsalaAchar · 2017-06-28T14:33:47Z

@massich I rebased off master to fix the CircleCI errors and now Travis fails while looking for nose-timer. Is there anything I am supposed to do to fix this?

And thank you very much for the review, I'll get the changes sorted out and get back soon.

massich · 2017-06-28T15:40:15Z

The problem comes from the way you had fixed this comment. Surely, I didn't express myself properly. What I was proposing was to use some dummy variables to improve readability.

if distance_threshold:
      (ch, com, le, pa) = \
           memory.cache...
else
       (ch, com, le, pa) = \
           memory.cache...
self.children_ = ch
self.n_comp...

That's why travis failed. You can execute the PEP8 check locally:

bash ./build_tools/travis/flake8_diff.sh

jnothman · 2017-06-28T21:47:52Z

I don't think that's an appropriate solution Raghav

…

On 29 Jun 2017 1:57 am, "(Venkat) Raghav, Rajagopalan" < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/cluster/hierarchical.py <#9069 (comment)> : > @@ -696,7 +717,15 @@ def fit(self, X, y=None): " instance, got 'memory={!r}' instead.".format( type(memory))) - if self.n_clusters <= 0: + if (self.distance_threshold and self.n_clusters) or \ + (not self.n_clusters and not self.distance_threshold): + raise ValueError("Either n_clusters (>0) or distance_threshold " + "needs to be set, got n_clusters={} and " + "distance_threshold={} instead. " + "Please set n_clusters=None to continue.".format( Instead of setting the n_clusters, we can make it a private attribute at *init* like self._n_clusters = n_clusters and have a property to define it's actual value like @propertydef n_clusters(self): if self._n_clusters is None and self.distance_threshold is None: return 2 elif self.distance_threshold is None ^ self._n_clusters is None: return self._n_clusters else: raise AttributeError("Both n_clusters and distance_threshold parameters" "were set during initialization.") — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9069 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69MQpGlj3YV9W2RZ-9cammqXszvdks5sInf8gaJpZM4N0SK3> .

raghavrv · 2017-06-30T09:05:54Z

Are you then okay with users having to explicitly set n_clusters=None when using distance_threshold?

raghavrv · 2017-06-30T09:07:36Z

I'm unable to recollect which other class has something like this. But I feel we did come across this problem before.

VathsalaAchar · 2017-06-30T12:47:03Z

bash ./build_tools/travis/flake8_diff.sh

@massich Thanks for this, I usually forget to test flake locally and noticed the one too many spaces later. But the error I was talking about was a time out error in a previous build, that fixed itself on the next push and hasn't appeared since. I have also fixed the variables for better readability.

@jnothman I have updated the tests. But could I get a quick review to see if I missed anything?

@jnothman @raghavrv How can I move forward on setting or not setting n_clusters=None when using distance_threshold?

jnothman · 2017-07-01T11:19:56Z

sklearn/cluster/tests/test_hierarchical.py

+                         return_distance=True)
+        clusters_at_threshold = np.count_nonzero(
+            distances >= distance_threshold) + 1
+        assert_true(clusters_at_threshold == clusters_produced)


No, I meant you should checked that the clusters, not the number of clusters, match.

VathsalaAchar · 2017-08-01T11:03:38Z

Could someone help me finish this off? I made all the necessary changes but a quick review would help. Thanks in advance!

jnothman

There are a few small things that surprised me, like the n_clusters_ behaviour and a couple of the error cases, but this is otherwise looking good.

sklearn/cluster/hierarchical.py

sklearn/cluster/tests/test_hierarchical.py

sklearn/cluster/hierarchical.py

sklearn/cluster/tests/test_hierarchical.py

blackyang · 2017-09-28T20:20:34Z

Great feature! When will this be merged?

asanakoy · 2017-10-06T13:16:51Z

Also waiting for the merge

jnothman

I don't understand what you intend by the n_clusters_ attribute. It does not appear to be documented, and it does not seem to differ from self.n_clusters. If this PR introduces a new n_clusters_ attribute, its role should be to report the number of clusters automatically identified by distance_threshold. Otherwise, I see no need for a new attribute.

sklearn/cluster/hierarchical.py

jnothman · 2017-10-09T09:43:20Z

sklearn/cluster/hierarchical.py

@@ -560,12 +563,29 @@ def _hc_cut(n_clusters, children, n_leaves):
    n_leaves : int
        Number of leaves of the tree.

+    distance_threshold : int (optional)
+        The distance threshold to cluster at.
+        If ``distance_threshold`` is set then ``n_clusters`` will be the


It is confusing here to refer to the n_clusters parameter. I'd rather: "If distance_threshold is set then n_clusters will be ignored. Instead, the number of clusters will be determined by when distances between clusters first exceed the distance_threshold during agglomeration."

jnothman · 2017-10-09T09:44:17Z

sklearn/cluster/hierarchical.py

+        NOT both.
+
+    distance_threshold : int (optional)
+        The distance threshold to cluster at.


Update to match above

jnothman · 2017-10-09T09:45:18Z

sklearn/cluster/hierarchical.py

    """Function cutting the ward tree for a given number of clusters.

    Parameters
    ----------
    n_clusters : int or ndarray
        The number of clusters to form.
+        NOTE: You should set either ``n_clusters`` or ``distance_threshold``,
+        NOT both.


This comment doesn't really make sense, if distance_threshold overrides n_clusters. I think it's fine to only have the comment under distance_threshold saying that it being set makes this ignored.

VathsalaAchar · 2017-10-27T14:21:43Z

I don't understand what you intend by the n_clusters_ attribute. It does not appear to be documented, and it does not seem to differ from self.n_clusters. If this PR introduces a new n_clusters_ attribute, its role should be to report the number of clusters automatically identified by distance_threshold. Otherwise, I see no need for a new attribute.

You're absolutely right. The n_clusters_ attribute made sense initially when I started off and wanted n_clusters to be None and this attribute was the workaround. But after all the changes it doesn't have any purpose now so I'll fix that.

I'll also take care of the remaining documentation issues.

When distance_threshold is set then it is used to determine the number of clusters to cut the tree at. Though it is to be noted that this works only when computer_full_tree=True. * When building the tree the return_distance set to True if the distance_threshold has been set. The distances returned is then used to calculate the number of clusters when cutting the tree. * Test agglomerative clustering with distance_threshold passed in and compare with the different number of clusters produced with and without connectivity. Changes to documentation to include distance_threshold Updates to distance threshold in hierarchical clustering * Moved the parameter check from init to fit for consistency * Updates to tests to account for changes made above Documentation changes based on review * backticks for variables in docstrings * formatting without backslashes Test for hierarchical clustering with distance_threshold * clusters produced are checked against the linkage tree to confirm that it matches the point where the distance exceeds the threshold set * boundary case test when distance_threshold is equal to the distance * Updated tests to compare clusters and number of clusters * Allowing users to set n_clusters or distance_threshold and updated tests * Checking the n_clusters None condition better * Removed the necessity for n_clusters_ to be set to None and redundant checks * Updated tests after the above changes * Cleaned up test to compare clusters produced using n_clusters against distance_threshold * Added and Simplified test for boundary conditions * Updated the documentation on distance_threshold restrictions

* Doc string updates for clear information * Removed redundant attribute n_clusters_ * Fixed tests * Changes to FeatureAgglomeration to include distance threshold

jnothman · 2017-11-09T03:39:30Z

sklearn/cluster/hierarchical.py

-            self.labels_ = _hc_cut(self.n_clusters, self.children_,
-                                   self.n_leaves_)
+            if distance_threshold is not None:
+                self.labels_ = _hc_cut(self.n_clusters, self.children_,


why don't you just calculate n_clusters from distances here and avoid modifying _hc_cut (adding two parameters for the sake of 1 line of logic)?

jnothman

I think what we should do is to change n_clusters to have a default value of None, which means "2 if distance_threshold is unset", and then raise an error when fitting a model with both n_clusters and distance_threshold set.

Otherwise, this is looking good.

jnothman · 2017-11-09T03:41:37Z

sklearn/cluster/tests/test_hierarchical.py

+    n_clus = -1
+    agc = AgglomerativeClustering(n_clusters=n_clus)
+    msg = ("n_clusters should be an integer greater than 0."
+           " %s was provided." % str(agc.n_clusters))


Either use a loop to test both -1 and 0, or put -1 directly in the expected error message

jnothman · 2017-11-09T03:46:11Z

sklearn/cluster/tests/test_hierarchical.py

+
+
+def test_agg_n_cluster_and_distance_threshold():
+    # Test that when distance_threshold is set n_clusters_ is unchanged


This isn't the right comment anymore

jnothman · 2017-11-09T03:51:02Z

sklearn/cluster/tests/test_hierarchical.py

+            children, n_components, n_leaves, parent, distances = \
+                tree_builder(X, connectivity=conn, n_clusters=None,
+                             return_distance=True)
+            num_clusters_at_threshold = np.count_nonzero(


Perhaps we should test something more explicit, just to be sure that your logic here is correct, like check that in single linkage, the maximum within-cluster pairwise distance for each sample is under the threshold and the minimum out-of-cluster pairwise distance is greater.

Do you mean along the lines of this test?
I could use the same dataset to do a more explicit test.

I don't see how it relates to that test. I mean that for some X and some predicted labels:

D = pairwise_distances(X, metric=metric) for i in range(len(X)): in_cluster_mask = labels == labels[i] max_in_cluster_distance = D[i, in_cluster_mask].max() min_out_cluster_distance = D[i, ~in_cluster_mask].min() # XXX: there should be equality on one of these conditions assert max_in_cluster_distance < threshold assert min_in_cluster_distance > threshold

Apologies for the delayed response, but as far as I understand the pairwise distance will give the distance between each point in X not the distance between the clusters as they join up. The distance between each cluster is in the distances matrix calculated using the scipy.cluster.hierarchy.linkage method.

So is there still a need to have an explicit test?

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

Although what I said is true only when connectivity is None. If a connectivity matrix is passed in then calculating the clusters would mean deciphering what is happening in the ward_tree and linkage_tree methods, and I'm not sure it's worth the effort...

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

I suppose, but is there a need to do this?
I'm really not sure how to do a more explicit test, so I'd really appreciate help with this.

jnothman · 2019-04-16T03:56:54Z

@thomasjpfan, feel like giving this once over?

NicolasHug

First round, haven't reviewed the tests yet

sklearn/cluster/hierarchical.py

sklearn/cluster/tests/test_hierarchical.py

sklearn/cluster/hierarchical.py

NicolasHug

Tests need an update

doc/whats_new/v0.21.rst

NicolasHug · 2019-04-23T12:00:51Z

sklearn/cluster/hierarchical.py

@@ -711,8 +711,19 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
            ``pooling_func`` has been deprecated in 0.20 and will be removed
            in 0.22.

+    distance_threshold : float, optional (default=None)
+        The distance threshold to cluster at. If not ``None``, ``n_clusters``


I still think this needs a description. It isn't clear what this parameter does right now.

sklearn/cluster/tests/test_hierarchical.py

…tering_threshold

NicolasHug · 2019-04-29T16:44:16Z

Thanks for the update Adrin.

Just merged, @jnothman this can get in 0.21 right? Else we'll need to udpate the versionadded parts

jnothman · 2019-04-29T21:34:35Z

0.21 has been branched. Should we move this to 0.22 to keep things clean?

jnothman · 2019-04-29T21:35:35Z

Sorry I didn't see the last comment. I'd be happy to keep it out of 0.21 to keep the lines clear. But we can cherry pick it in if you'd rather.

…-learn#9069)

bede · 2019-05-02T12:58:52Z

This feature can't come soon enough.

NicolasHug · 2019-05-02T14:11:18Z

Sorry Joel I was out for the last few days and didn't answer, I just saw that you cherry picked it. Thanks!

…-learn#9069)

VathsalaAchar force-pushed the hierarchical_clustering_threshold branch 2 times, most recently from a868db7 to 158c942 Compare June 9, 2017 14:36

raghavrv changed the title ~~Distance threshold added to hierarchical clustering~~ [MRG] Distance threshold added to hierarchical clustering Jun 28, 2017

massich reviewed Jun 28, 2017

View reviewed changes

jnothman reviewed Jun 28, 2017

View reviewed changes

VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 158c942 to a0e2df2 Compare June 28, 2017 13:54

VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 4e7e49d to 065decb Compare June 30, 2017 13:34

jnothman reviewed Jul 1, 2017

View reviewed changes

VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 065decb to 0cce945 Compare July 4, 2017 11:24

jnothman reviewed Aug 1, 2017

View reviewed changes

jnothman reviewed Oct 9, 2017

View reviewed changes

VathsalaAchar added 2 commits October 30, 2017 12:23

Changes based on review

8ea9afa

* Doc string updates for clear information * Removed redundant attribute n_clusters_ * Fixed tests * Changes to FeatureAgglomeration to include distance threshold

VathsalaAchar force-pushed the hierarchical_clustering_threshold branch from 1c9ad69 to 8ea9afa Compare October 30, 2017 12:50

This was referenced Nov 9, 2017

implement different cut criteria for agglomerative clustering #6197

Closed

Clustering based on cophenetic distance added #6234

Closed

jnothman reviewed Nov 9, 2017

View reviewed changes

Updates based on review

e56670d

merge upstream/master

b17a818

NicolasHug reviewed Apr 16, 2019

View reviewed changes

adrinjalali added 7 commits April 22, 2019 11:08

merge upstream/master

fd44b65

remove sentence, add to FeatureAgglomeration

cd6c9aa

code style change

c07fa4d

check compute_full_tree, and docstring fix

ba326ac

apply more comments

0b94bef

force only one of the parameters to be non-None

680e515

merge upstream/master

c84f429

NicolasHug reviewed Apr 23, 2019

View reviewed changes

adrinjalali added 5 commits April 25, 2019 10:53

Merge remote-tracking branch 'upstream/master' into hierarchical_clus…

3981918

…tering_threshold

improve docstrings

3f1a6be

merge upstream/master

36cd205

improve tests and apply Nicolas's comments

4ae66fd

merge upstream/master

8060020

jnothman modified the milestones: 0.21, 0.22 Apr 29, 2019

NicolasHug approved these changes Apr 29, 2019

View reviewed changes

NicolasHug changed the title ~~[MRG+1] Distance threshold added to hierarchical clustering~~ Added distance_threshold parameter to hierarchical clustering Apr 29, 2019

NicolasHug merged commit 602f3d6 into scikit-learn:master Apr 29, 2019

jnothman pushed a commit that referenced this pull request Apr 30, 2019

Added distance_threshold parameter to hierarchical clustering (#9069)

950c33c

marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019

Added distance_threshold parameter to hierarchical clustering (scikit…

0339b55

…-learn#9069)

marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019

Added distance_threshold parameter to hierarchical clustering (scikit…

0c4d489

…-learn#9069)

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

Added distance_threshold parameter to hierarchical clustering (scikit…

29134a0

…-learn#9069)

		@@ -214,6 +214,104 @@ def test_agglomerative_clustering():
		assert_array_equal(clustering.labels_, clustering2.labels_)


		def test_agglomerative_clustering_with_distance_threshold():



		def test_agg_n_cluster_and_distance_threshold():
		# Test that when distance_threshold is set n_clusters_ is unchanged

Added distance_threshold parameter to hierarchical clustering #9069

Added distance_threshold parameter to hierarchical clustering #9069

Conversation

VathsalaAchar commented Jun 8, 2017 • edited by jnothman

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

massich left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VathsalaAchar commented Jun 28, 2017 • edited

massich commented Jun 28, 2017

jnothman commented Jun 28, 2017 via email

raghavrv commented Jun 30, 2017

raghavrv commented Jun 30, 2017

VathsalaAchar commented Jun 30, 2017

Choose a reason for hiding this comment

VathsalaAchar commented Aug 1, 2017

jnothman left a comment

Choose a reason for hiding this comment

blackyang commented Sep 28, 2017

asanakoy commented Oct 6, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VathsalaAchar commented Oct 27, 2017

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Nov 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Dec 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 16, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Apr 29, 2019

jnothman commented Apr 29, 2019 via email

jnothman commented Apr 29, 2019 via email

bede commented May 2, 2019

NicolasHug commented May 2, 2019

VathsalaAchar commented Jun 8, 2017 •

edited by jnothman

VathsalaAchar commented Jun 28, 2017 •

edited

jnothman Nov 9, 2017 •

edited

jnothman Dec 7, 2017 •

edited