New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated input points silently create duplicated clusters in KMeans #10059

Closed
ilectra opened this Issue Nov 2, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@ilectra

ilectra commented Nov 2, 2017

Description

When there are duplicated input points to Kmeans resulting to number of unique points < number of requested clusters, there is no error thrown. Instead, clustering continues to (seemingly) produce the number of clusters requested, but some of them are exactly the same, so the cluster labels produced for the input points do not go all the way to number of requested clusters.

Steps/Code to Reproduce

from sklearn.cluster import KMeans
import numpy as np

# some input points here are identical, so that n_total=17, n_unique=9
x2d = np.array([(1086, 348), (1087, 347), (1190, 244), (1190, 244), (1086, 348), (1185, 249), (1193, 241), (1185, 249), (1087, 347), (1188, 247), (1187, 233), (26, 111), (26, 111), (26, 110), (26, 110), (26, 110), (26, 110)])
kmeans = KMeans(n_clusters=10) # n_clusters > n_unique
c_labels = kmeans.fit_predict(x2d)
c_centers = kmeans.cluster_centers_

Expected Results

Either an error thrown, or the cluster labels produced should match the unique clusters only (i.e. no identical cluster centres)

Actual Results

>>> c_labels  # note there's no entry for cluster 9
array([7, 2, 6, 6, 7, 5, 4, 5, 2, 1, 3, 8, 8, 0, 0, 0, 0], dtype=int32)
>>> c_centers # two of these 10 clusters have identical centers, so only 9 of them are unique
array([[   26.,   110.],
       [ 1188.,   247.],
       [ 1087.,   347.],
       [ 1187.,   233.],
       [ 1193.,   241.],
       [ 1185.,   249.],
       [ 1190.,   244.],
       [ 1086.,   348.],
       [   26.,   111.],
       [   26.,   110.]]) 

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.18.2
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 2, 2017

Member
Member

jnothman commented Nov 2, 2017

@christianbraune79

This comment has been minimized.

Show comment
Hide comment
@christianbraune79

christianbraune79 Nov 3, 2017

Contributor

Though not nice, this is actually, what MacQueen wrote in his paper and thus it could be considered expected behaviour.

(MacQueen, 1967, Some Methods for Classification and Analysis of Multivariate Observations, https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992, p283)

The set S_i(x) contains the points in E_N nearest to x_i, with tied points being assigned
arbitrarily to the set of lower index. Note that with this convention concerning
tied points, if x_i = x_j and i < j then S_j(x) = 0.

0 here means the empty set. x is the complete data set and S_j(x) is the j-th cluster generated for x.

Contributor

christianbraune79 commented Nov 3, 2017

Though not nice, this is actually, what MacQueen wrote in his paper and thus it could be considered expected behaviour.

(MacQueen, 1967, Some Methods for Classification and Analysis of Multivariate Observations, https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992, p283)

The set S_i(x) contains the points in E_N nearest to x_i, with tied points being assigned
arbitrarily to the set of lower index. Note that with this convention concerning
tied points, if x_i = x_j and i < j then S_j(x) = 0.

0 here means the empty set. x is the complete data set and S_j(x) is the j-th cluster generated for x.

@ilectra

This comment has been minimized.

Show comment
Hide comment
@ilectra

ilectra Nov 3, 2017

@christianbraune79 If S_j(x) = 0, shouldn't c_centers reflect this fact by being resized to the valid/unique only clusters? And then of course x_i and x_j would belong to the same cluster.

In any case, a warning issued by the code, and some discussion in the documentation would help.

ilectra commented Nov 3, 2017

@christianbraune79 If S_j(x) = 0, shouldn't c_centers reflect this fact by being resized to the valid/unique only clusters? And then of course x_i and x_j would belong to the same cluster.

In any case, a warning issued by the code, and some discussion in the documentation would help.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 4, 2017

Member
Member

jnothman commented Nov 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment