Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added distance_threshold parameter to hierarchical clustering #9069

Merged
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6c5f957
Distance threshold added to hierarchical clustering
VathsalaAchar Jun 8, 2017
8ea9afa
Changes based on review
VathsalaAchar Oct 30, 2017
e56670d
Updates based on review
VathsalaAchar Dec 6, 2017
03709c8
Updates based on comments
VathsalaAchar Dec 7, 2017
fda248d
Documentation for new attribute
VathsalaAchar Dec 7, 2017
c45ec1f
Merge remote-tracking branch 'upstream/master' into hierarchical_clus…
adrinjalali Apr 9, 2019
6efc3ec
fix n_components
adrinjalali Apr 9, 2019
47790a3
move parameter to the end of the list
adrinjalali Apr 9, 2019
71dd010
add whats_new entry
adrinjalali Apr 9, 2019
0764729
minor fix on n_clusters_
adrinjalali Apr 9, 2019
0649b29
fix tests
adrinjalali Apr 9, 2019
83600ea
fix docstrings
adrinjalali Apr 9, 2019
43ef071
minor fix
adrinjalali Apr 9, 2019
7a8fc68
remove assert_true
adrinjalali Apr 9, 2019
b37c183
remove unrelated change
adrinjalali Apr 9, 2019
38659e4
add a more explicit test
adrinjalali Apr 10, 2019
b17a818
merge upstream/master
adrinjalali Apr 16, 2019
fd44b65
merge upstream/master
adrinjalali Apr 22, 2019
cd6c9aa
remove sentence, add to FeatureAgglomeration
adrinjalali Apr 22, 2019
c07fa4d
code style change
adrinjalali Apr 22, 2019
ba326ac
check compute_full_tree, and docstring fix
adrinjalali Apr 22, 2019
0b94bef
apply more comments
adrinjalali Apr 22, 2019
680e515
force only one of the parameters to be non-None
adrinjalali Apr 23, 2019
c84f429
merge upstream/master
adrinjalali Apr 23, 2019
3981918
Merge remote-tracking branch 'upstream/master' into hierarchical_clus…
adrinjalali Apr 25, 2019
3f1a6be
improve docstrings
adrinjalali Apr 26, 2019
36cd205
merge upstream/master
adrinjalali Apr 26, 2019
4ae66fd
improve tests and apply Nicolas's comments
adrinjalali Apr 26, 2019
8060020
merge upstream/master
adrinjalali Apr 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,13 +73,13 @@ Overview of clustering methods
- Graph distance (e.g. nearest-neighbor graph)

* - :ref:`Ward hierarchical clustering <hierarchical_clustering>`
- number of clusters
- number of clusters or distance threshold
- Large ``n_samples`` and ``n_clusters``
- Many clusters, possibly connectivity constraints
- Distances between points

* - :ref:`Agglomerative clustering <hierarchical_clustering>`
- number of clusters, linkage type, distance
- number of clusters or distance threshold, linkage type, distance
- Large ``n_samples`` and ``n_clusters``
- Many clusters, possibly connectivity constraints, non Euclidean
distances
Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,11 @@ Support for Python 3.4 and below has been officially dropped.
``n_connected_components_``.
:issue:`13427` by :user:`Stephane Couvreur <scouvreur>`.

- |Enhancement| :class:`cluster.AgglomerativeClustering` and
:class:`cluster.FeatureAgglomeration` now accept a ``distance_threshold``
parameter which can be used to find the clusters instead of ``n_clusters``.
:issue:`9069` by :user:`Vathsala Achar <VathsalaAchar>`.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

- |Fix| Fixed a bug in :class:`KMeans` where empty clusters weren't correctly
relocated when using sample weights. :issue:`13486`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.
Expand Down
75 changes: 61 additions & 14 deletions sklearn/cluster/hierarchical.py
Original file line number Diff line number Diff line change
Expand Up @@ -711,8 +711,21 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
``pooling_func`` has been deprecated in 0.20 and will be removed
in 0.22.

distance_threshold : float (optional)
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
The distance threshold to cluster at.
NOTE: You should set either ``n_clusters`` or ``distance_threshold``,
NOT both. If the ``distance_threshold`` is set then ``n_clusters`` is
ignored.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 0.21

Attributes
----------
n_clusters_ : int
The number of clusters found by the algorithm. If
``distance_threshold=None``, it will be equal to the given
``n_clusters``. Otherwise it is set to the number of reported clusters.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

labels_ : array [n_samples]
cluster labels for each point

Expand All @@ -739,8 +752,9 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering # doctest: +NORMALIZE_WHITESPACE
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='ward', memory=None, n_clusters=2,
pooling_func='deprecated')
connectivity=None, distance_threshold=None,
linkage='ward', memory=None, n_clusters=2,
pooling_func='deprecated')
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])

Expand All @@ -749,8 +763,10 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
def __init__(self, n_clusters=2, affinity="euclidean",
memory=None,
connectivity=None, compute_full_tree='auto',
linkage='ward', pooling_func='deprecated'):
linkage='ward', pooling_func='deprecated',
distance_threshold=None):
self.n_clusters = n_clusters
self.distance_threshold = distance_threshold
self.memory = memory
self.connectivity = connectivity
self.compute_full_tree = compute_full_tree
Expand Down Expand Up @@ -788,10 +804,14 @@ def fit(self, X, y=None):
X = check_array(X, ensure_min_samples=2, estimator=self)
memory = check_memory(self.memory)

if self.n_clusters <= 0:
if self.n_clusters is not None and self.n_clusters <= 0:
raise ValueError("n_clusters should be an integer greater than 0."
" %s was provided." % str(self.n_clusters))

if self.n_clusters is None and self.distance_threshold is None:
raise ValueError("n_clusters and distance_threshold cannot be "
"both None.")

NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
if self.linkage == "ward" and self.affinity != "euclidean":
raise ValueError("%s was provided as affinity. Ward can only "
"work with euclidean distances." %
Expand All @@ -814,7 +834,7 @@ def fit(self, X, y=None):
compute_full_tree = self.compute_full_tree
if self.connectivity is None:
compute_full_tree = True
if compute_full_tree == 'auto':
if compute_full_tree == 'auto' and self.distance_threshold is None:
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
# Early stopping is likely to give a speed up only for
# a large number of clusters. The actual threshold
# implemented here is heuristic
Expand All @@ -828,14 +848,31 @@ def fit(self, X, y=None):
if self.linkage != 'ward':
kwargs['linkage'] = self.linkage
kwargs['affinity'] = self.affinity
(self.children_, self.n_connected_components_, self.n_leaves_,
parents) = memory.cache(tree_builder)(X, connectivity,
n_clusters=n_clusters,
**kwargs)

distance_threshold = self.distance_threshold
# if distance_threshold is set then distances is returned
if distance_threshold is not None:
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
ch, n_comps, n_lvs, parents, distances = \
memory.cache(tree_builder)(X, connectivity,
n_clusters=n_clusters,
return_distance=True,
**kwargs)
self.n_clusters_ = np.count_nonzero(
distances >= distance_threshold) + 1
else:
ch, n_comps, n_lvs, parents = \
memory.cache(tree_builder)(X, connectivity,
n_clusters=n_clusters,
**kwargs)
self.n_clusters_ = self.n_clusters

self.children_ = ch
self.n_connected_components_ = n_comps
self.n_leaves_ = n_lvs

# Cut the tree
if compute_full_tree:
self.labels_ = _hc_cut(self.n_clusters, self.children_,
self.labels_ = _hc_cut(self.n_clusters_, self.children_,
self.n_leaves_)
else:
labels = _hierarchical.hc_get_heads(parents, copy=False)
Expand Down Expand Up @@ -904,6 +941,14 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
value, and should accept an array of shape [M, N] and the keyword
argument `axis=1`, and reduce it to an array of size [M].

distance_threshold : float (optional)
The distance threshold to cluster at.
NOTE: You should set either ``n_clusters`` or ``distance_threshold``,
NOT both. If the ``distance_threshold`` is set then ``n_clusters`` is
ignored.

.. versionadded:: 0.21

Attributes
----------
labels_ : array-like, (n_features,)
Expand Down Expand Up @@ -933,8 +978,9 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
>>> agglo = cluster.FeatureAgglomeration(n_clusters=32)
>>> agglo.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='ward', memory=None, n_clusters=32,
pooling_func=...)
connectivity=None, distance_threshold=None, linkage='ward',
memory=None, n_clusters=32,
pooling_func=...)
>>> X_reduced = agglo.transform(X)
>>> X_reduced.shape
(1797, 32)
Expand All @@ -943,11 +989,12 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
def __init__(self, n_clusters=2, affinity="euclidean",
memory=None,
connectivity=None, compute_full_tree='auto',
linkage='ward', pooling_func=np.mean):
linkage='ward', pooling_func=np.mean,
distance_threshold=None):
super().__init__(
n_clusters=n_clusters, memory=memory, connectivity=connectivity,
compute_full_tree=compute_full_tree, linkage=linkage,
affinity=affinity)
affinity=affinity, distance_threshold=distance_threshold)
self.pooling_func = pooling_func

def fit(self, X, y=None, **params):
Expand Down
100 changes: 100 additions & 0 deletions sklearn/cluster/tests/test_hierarchical.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from scipy import sparse
from scipy.cluster import hierarchy

from sklearn.metrics.cluster.supervised import adjusted_rand_score
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import assert_equal
from sklearn.utils.testing import assert_almost_equal
Expand Down Expand Up @@ -573,6 +574,21 @@ def test_agg_n_clusters():
assert_raise_message(ValueError, msg, agc.fit, X)


def test_agg_n_cluster_and_distance_threshold():
# Test that when distance_threshold is set that n_clusters is ignored
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

n_clus, dist_thresh = None, 10
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
rng = np.random.RandomState(0)
X = rng.rand(20, 10)
agc = AgglomerativeClustering(n_clusters=n_clus,
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should check that you get the same behaviour if nclus is an int

distance_threshold=dist_thresh)
agc.fit(X)
# Expecting no errors here
assert agc.n_clusters == n_clus
assert agc.n_clusters_ != n_clus
assert agc.n_clusters_ > 0


def test_affinity_passed_to_fix_connectivity():
# Test that the affinity parameter is actually passed to the pairwise
# function
Expand Down Expand Up @@ -600,6 +616,90 @@ def increment(self, *args, **kwargs):
assert_equal(fa.counter, 3)


def test_agglomerative_clustering_with_distance_threshold():
# Check that we obtain the correct number of clusters with
# agglomerative clustering with distance_threshold.

rng = np.random.RandomState(0)
mask = np.ones([10, 10], dtype=np.bool)
n_samples = 100
X = rng.randn(n_samples, 50)
connectivity = grid_to_graph(*mask.shape)
# test when distance threshold is set to 10
distance_threshold = 10
for linkage in ("ward", "complete", "average"):
for conn in [None, connectivity]:
clustering = AgglomerativeClustering(
distance_threshold=distance_threshold,
connectivity=conn, linkage=linkage)
clustering.fit(X)
clusters_produced = clustering.labels_
num_clusters_produced = len(np.unique(clustering.labels_))
# test if the clusters produced match the point in the linkage tree
# where the distance exceeds the threshold
tree_builder = _TREE_BUILDERS[linkage]
children, n_components, n_leaves, parent, distances = \
tree_builder(X, connectivity=conn, n_clusters=None,
return_distance=True)
num_clusters_at_threshold = np.count_nonzero(
Copy link
Member

@jnothman jnothman Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should test something more explicit, just to be sure that your logic here is correct, like check that in single linkage, the maximum within-cluster pairwise distance for each sample is under the threshold and the minimum out-of-cluster pairwise distance is greater.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean along the lines of this test?
I could use the same dataset to do a more explicit test.

Copy link
Member

@jnothman jnothman Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it relates to that test. I mean that for some X and some predicted labels:

D = pairwise_distances(X, metric=metric)
for i in range(len(X)):
    in_cluster_mask = labels == labels[i]
    max_in_cluster_distance = D[i, in_cluster_mask].max()
    min_out_cluster_distance = D[i, ~in_cluster_mask].min()
    # XXX: there should be equality on one of these conditions
    assert max_in_cluster_distance < threshold
    assert min_in_cluster_distance > threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response, but as far as I understand the pairwise distance will give the distance between each point in X not the distance between the clusters as they join up. The distance between each cluster is in the distances matrix calculated using the scipy.cluster.hierarchy.linkage method.

So is there still a need to have an explicit test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although what I said is true only when connectivity is None. If a connectivity matrix is passed in then calculating the clusters would mean deciphering what is happening in the ward_tree and linkage_tree methods, and I'm not sure it's worth the effort...

True. But surely a similar invariance could be constructed about the average distances with average linkage...? I've not thought about it too rigorously.

I suppose, but is there a need to do this?
I'm really not sure how to do a more explicit test, so I'd really appreciate help with this.

distances >= distance_threshold) + 1
# test number of clusters produced
assert num_clusters_at_threshold == num_clusters_produced
# test clusters produced
clusters_at_threshold = _hc_cut(n_clusters=num_clusters_produced,
children=children,
n_leaves=n_leaves)
assert np.array_equiv(clusters_produced,
clusters_at_threshold)
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

rng = np.random.RandomState(0)
n_samples = 10
X = rng.randint(-3, 3, size=(n_samples, 3))
# this should result in all data in their own clusters
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
clustering = AgglomerativeClustering(
distance_threshold=1,
linkage="single").fit(X)
assert len(np.unique(clustering.labels_)) == 10
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved

# check the distances within the clusters and with other clusters
threshold = 2
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
clustering = AgglomerativeClustering(
distance_threshold=threshold,
linkage="single").fit(X)
labels = clustering.labels_
D = pairwise_distances(X, metric="euclidean")
# to avoid taking the 0 diagonal in min()
np.fill_diagonal(D, np.inf)
for i in np.unique(labels):
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
in_cluster_mask = labels == i
max_in_cluster_distance = (D[in_cluster_mask][:, in_cluster_mask]
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
.min(axis=0).max())
min_out_cluster_distance = (D[in_cluster_mask][:, ~in_cluster_mask]
.min(axis=0).min())
# single data point clusters only have that inf diagonal here
if in_cluster_mask.sum() > 1:
assert max_in_cluster_distance < threshold
assert min_out_cluster_distance >= threshold


def test_agglomerative_clustering_with_distance_threshold_edge_case():
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
# test boundary case of distance_threshold matching the distance
X = [[0], [1]]
for linkage in ("ward", "complete", "average"):
for threshold, y_true in [(0.5, [1, 0]), (1.0, [1, 0]), (1.5, [0, 0])]:
clusterer = AgglomerativeClustering(distance_threshold=threshold,
linkage=linkage)
y_pred = clusterer.fit_predict(X)
assert_equal(1, adjusted_rand_score(y_true, y_pred))
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved


def test_none_dis_threshold_n_clust():
X = [[0], [1]]
with pytest.raises(ValueError, match="cannot be both None"):
AgglomerativeClustering(n_clusters=None,
distance_threshold=None).fit(X)


def test_n_components_deprecation():
# Test that a Deprecation warning is thrown when n_components_
# attribute is accessed
Expand Down