New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added distance_threshold parameter to hierarchical clustering #9069
Added distance_threshold parameter to hierarchical clustering #9069
Conversation
a868db7
to
158c942
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that appveyor failures would be fixed if you rebase up on master.
Thanks for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see a test which checks the invariant that we want to hold:
- set a distance threshold
- check that the clusters produced match the point in the linkage tree where the threshold would have been exceeded
- test the boundary case of the threshold equalling the distance
Btw, is this always specified as an absolute distance? Might we want (or prefer) to specify it relative to the average or median pairwise distance, for instance?
@@ -214,6 +214,104 @@ def test_agglomerative_clustering(): | |||
assert_array_equal(clustering.labels_, clustering2.labels_) | |||
|
|||
|
|||
def test_agglomerative_clustering_with_distance_threshold(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's good reason to duplicate the tests for this parameter.
158c942
to
a0e2df2
Compare
@massich I rebased off master to fix the CircleCI errors and now Travis fails while looking for And thank you very much for the review, I'll get the changes sorted out and get back soon. |
The problem comes from the way you had fixed this comment. Surely, I didn't express myself properly. What I was proposing was to use some dummy variables to improve readability. if distance_threshold:
(ch, com, le, pa) = \
memory.cache...
else
(ch, com, le, pa) = \
memory.cache...
self.children_ = ch
self.n_comp... That's why travis failed. You can execute the PEP8 check locally: bash ./build_tools/travis/flake8_diff.sh |
I don't think that's an appropriate solution Raghav
…On 29 Jun 2017 1:57 am, "(Venkat) Raghav, Rajagopalan" < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/cluster/hierarchical.py
<#9069 (comment)>
:
> @@ -696,7 +717,15 @@ def fit(self, X, y=None):
" instance, got 'memory={!r}' instead.".format(
type(memory)))
- if self.n_clusters <= 0:
+ if (self.distance_threshold and self.n_clusters) or \
+ (not self.n_clusters and not self.distance_threshold):
+ raise ValueError("Either n_clusters (>0) or distance_threshold "
+ "needs to be set, got n_clusters={} and "
+ "distance_threshold={} instead. "
+ "Please set n_clusters=None to continue.".format(
Instead of setting the n_clusters, we can make it a private attribute at
*init* like
self._n_clusters = n_clusters
and have a property to define it's actual value like
@propertydef n_clusters(self):
if self._n_clusters is None and self.distance_threshold is None:
return 2
elif self.distance_threshold is None ^ self._n_clusters is None:
return self._n_clusters
else:
raise AttributeError("Both n_clusters and distance_threshold parameters"
"were set during initialization.")
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9069 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69MQpGlj3YV9W2RZ-9cammqXszvdks5sInf8gaJpZM4N0SK3>
.
|
Are you then okay with users having to explicitly set |
I'm unable to recollect which other class has something like this. But I feel we did come across this problem before. |
@massich Thanks for this, I usually forget to test flake locally and noticed the one too many spaces later. But the error I was talking about was a time out error in a previous build, that fixed itself on the next push and hasn't appeared since. I have also fixed the variables for better readability. @jnothman I have updated the tests. But could I get a quick review to see if I missed anything? @jnothman @raghavrv How can I move forward on setting or not setting n_clusters=None when using distance_threshold? |
4e7e49d
to
065decb
Compare
return_distance=True) | ||
clusters_at_threshold = np.count_nonzero( | ||
distances >= distance_threshold) + 1 | ||
assert_true(clusters_at_threshold == clusters_produced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant you should checked that the clusters, not the number of clusters, match.
065decb
to
0cce945
Compare
Could someone help me finish this off? I made all the necessary changes but a quick review would help. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few small things that surprised me, like the n_clusters_
behaviour and a couple of the error cases, but this is otherwise looking good.
Great feature! When will this be merged? |
Also waiting for the merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you intend by the n_clusters_
attribute. It does not appear to be documented, and it does not seem to differ from self.n_clusters
. If this PR introduces a new n_clusters_
attribute, its role should be to report the number of clusters automatically identified by distance_threshold
. Otherwise, I see no need for a new attribute.
sklearn/cluster/hierarchical.py
Outdated
@@ -560,12 +563,29 @@ def _hc_cut(n_clusters, children, n_leaves): | |||
n_leaves : int | |||
Number of leaves of the tree. | |||
|
|||
distance_threshold : int (optional) | |||
The distance threshold to cluster at. | |||
If ``distance_threshold`` is set then ``n_clusters`` will be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing here to refer to the n_clusters
parameter. I'd rather: "If distance_threshold
is set then n_clusters
will be ignored. Instead, the number of clusters will be determined by when distances between clusters first exceed the distance_threshold
during agglomeration."
sklearn/cluster/hierarchical.py
Outdated
NOT both. | ||
|
||
distance_threshold : int (optional) | ||
The distance threshold to cluster at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to match above
sklearn/cluster/hierarchical.py
Outdated
"""Function cutting the ward tree for a given number of clusters. | ||
|
||
Parameters | ||
---------- | ||
n_clusters : int or ndarray | ||
The number of clusters to form. | ||
NOTE: You should set either ``n_clusters`` or ``distance_threshold``, | ||
NOT both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't really make sense, if distance_threshold
overrides n_clusters
. I think it's fine to only have the comment under distance_threshold
saying that it being set makes this ignored.
You're absolutely right. The I'll also take care of the remaining documentation issues. |
When distance_threshold is set then it is used to determine the number of clusters to cut the tree at. Though it is to be noted that this works only when computer_full_tree=True. * When building the tree the return_distance set to True if the distance_threshold has been set. The distances returned is then used to calculate the number of clusters when cutting the tree. * Test agglomerative clustering with distance_threshold passed in and compare with the different number of clusters produced with and without connectivity. Changes to documentation to include distance_threshold Updates to distance threshold in hierarchical clustering * Moved the parameter check from init to fit for consistency * Updates to tests to account for changes made above Documentation changes based on review * backticks for variables in docstrings * formatting without backslashes Test for hierarchical clustering with distance_threshold * clusters produced are checked against the linkage tree to confirm that it matches the point where the distance exceeds the threshold set * boundary case test when distance_threshold is equal to the distance * Updated tests to compare clusters and number of clusters * Allowing users to set n_clusters or distance_threshold and updated tests * Checking the n_clusters None condition better * Removed the necessity for n_clusters_ to be set to None and redundant checks * Updated tests after the above changes * Cleaned up test to compare clusters produced using n_clusters against distance_threshold * Added and Simplified test for boundary conditions * Updated the documentation on distance_threshold restrictions
* Doc string updates for clear information * Removed redundant attribute n_clusters_ * Fixed tests * Changes to FeatureAgglomeration to include distance threshold
1c9ad69
to
8ea9afa
Compare
sklearn/cluster/hierarchical.py
Outdated
self.labels_ = _hc_cut(self.n_clusters, self.children_, | ||
self.n_leaves_) | ||
if distance_threshold is not None: | ||
self.labels_ = _hc_cut(self.n_clusters, self.children_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just calculate n_clusters from distances here and avoid modifying _hc_cut
(adding two parameters for the sake of 1 line of logic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we should do is to change n_clusters to have a default value of None, which means "2 if distance_threshold is unset", and then raise an error when fitting a model with both n_clusters
and distance_threshold
set.
Otherwise, this is looking good.
n_clus = -1 | ||
agc = AgglomerativeClustering(n_clusters=n_clus) | ||
msg = ("n_clusters should be an integer greater than 0." | ||
" %s was provided." % str(agc.n_clusters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either use a loop to test both -1 and 0, or put -1 directly in the expected error message
|
||
|
||
def test_agg_n_cluster_and_distance_threshold(): | ||
# Test that when distance_threshold is set n_clusters_ is unchanged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the right comment anymore
children, n_components, n_leaves, parent, distances = \ | ||
tree_builder(X, connectivity=conn, n_clusters=None, | ||
return_distance=True) | ||
num_clusters_at_threshold = np.count_nonzero( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should test something more explicit, just to be sure that your logic here is correct, like check that in single linkage, the maximum within-cluster pairwise distance for each sample is under the threshold and the minimum out-of-cluster pairwise distance is greater.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean along the lines of this test?
I could use the same dataset to do a more explicit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it relates to that test. I mean that for some X
and some predicted labels
:
D = pairwise_distances(X, metric=metric)
for i in range(len(X)):
in_cluster_mask = labels == labels[i]
max_in_cluster_distance = D[i, in_cluster_mask].max()
min_out_cluster_distance = D[i, ~in_cluster_mask].min()
# XXX: there should be equality on one of these conditions
assert max_in_cluster_distance < threshold
assert min_in_cluster_distance > threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed response, but as far as I understand the pairwise distance will give the distance between each point in X
not the distance between the clusters as they join up. The distance between each cluster is in the distances
matrix calculated using the scipy.cluster.hierarchy.linkage
method.
So is there still a need to have an explicit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although what I said is true only when connectivity
is None
. If a connectivity
matrix is passed in then calculating the clusters would mean deciphering what is happening in the ward_tree
and linkage_tree
methods, and I'm not sure it's worth the effort...
True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.
I suppose, but is there a need to do this?
I'm really not sure how to do a more explicit test, so I'd really appreciate help with this.
@thomasjpfan, feel like giving this once over? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round, haven't reviewed the tests yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests need an update
sklearn/cluster/hierarchical.py
Outdated
@@ -711,8 +711,19 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin): | |||
``pooling_func`` has been deprecated in 0.20 and will be removed | |||
in 0.22. | |||
|
|||
distance_threshold : float, optional (default=None) | |||
The distance threshold to cluster at. If not ``None``, ``n_clusters`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this needs a description. It isn't clear what this parameter does right now.
Thanks for the update Adrin. Just merged, @jnothman this can get in 0.21 right? Else we'll need to udpate the |
0.21 has been branched. Should we move this to 0.22 to keep things clean?
|
Sorry I didn't see the last comment. I'd be happy to keep it out of 0.21 to
keep the lines clear. But we can cherry pick it in if you'd rather.
|
This feature can't come soon enough. |
Sorry Joel I was out for the last few days and didn't answer, I just saw that you cherry picked it. Thanks! |
Reference Issue
Fixes #3796
What does this implement/fix? Explain your changes.
Hierarchical clustering now has a
distance_threshold
parameter which can be set instead ofn_clusters
and this is used to determine the number of clusters to cut the tree at.Note: that this works only when compute_full_tree=True and n_clusters=None.
Changes
distance_threshold
orn_clusters
is accepted as parameter. Whendistance_threshold
is setn_clusters
needs to be set to None as the default of 2 clusters has been retained.distance_threshold
has been set.The distances returned is then used to calculate the number of clusters when cutting the tree.
Tests
n_clusters=None
as parameter as the default for n_clusters is 2.Documentation
distance_threshold
parameter in the table.distance_threshold
explanationAny other comments?
Is this implementation the right behaviour of
distance_threshold
or should the construction of the tree stop when the distance is reached?Apologies for the constant stream of commits, I was trying to sort out the errors myself.