Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated input points silently create duplicated clusters in KMeans #10059

Closed
ilectra opened this issue Nov 2, 2017 · 4 comments
Closed

Duplicated input points silently create duplicated clusters in KMeans #10059

ilectra opened this issue Nov 2, 2017 · 4 comments

Comments

@ilectra
Copy link

@ilectra ilectra commented Nov 2, 2017

Description

When there are duplicated input points to Kmeans resulting to number of unique points < number of requested clusters, there is no error thrown. Instead, clustering continues to (seemingly) produce the number of clusters requested, but some of them are exactly the same, so the cluster labels produced for the input points do not go all the way to number of requested clusters.

Steps/Code to Reproduce

from sklearn.cluster import KMeans
import numpy as np

# some input points here are identical, so that n_total=17, n_unique=9
x2d = np.array([(1086, 348), (1087, 347), (1190, 244), (1190, 244), (1086, 348), (1185, 249), (1193, 241), (1185, 249), (1087, 347), (1188, 247), (1187, 233), (26, 111), (26, 111), (26, 110), (26, 110), (26, 110), (26, 110)])
kmeans = KMeans(n_clusters=10) # n_clusters > n_unique
c_labels = kmeans.fit_predict(x2d)
c_centers = kmeans.cluster_centers_

Expected Results

Either an error thrown, or the cluster labels produced should match the unique clusters only (i.e. no identical cluster centres)

Actual Results

>>> c_labels  # note there's no entry for cluster 9
array([7, 2, 6, 6, 7, 5, 4, 5, 2, 1, 3, 8, 8, 0, 0, 0, 0], dtype=int32)
>>> c_centers # two of these 10 clusters have identical centers, so only 9 of them are unique
array([[   26.,   110.],
       [ 1188.,   247.],
       [ 1087.,   347.],
       [ 1187.,   233.],
       [ 1193.,   241.],
       [ 1185.,   249.],
       [ 1190.,   244.],
       [ 1086.,   348.],
       [   26.,   111.],
       [   26.,   110.]]) 

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.18.2
@jnothman
Copy link
Member

@jnothman jnothman commented Nov 2, 2017

@christianbraune79
Copy link
Contributor

@christianbraune79 christianbraune79 commented Nov 3, 2017

Though not nice, this is actually, what MacQueen wrote in his paper and thus it could be considered expected behaviour.

(MacQueen, 1967, Some Methods for Classification and Analysis of Multivariate Observations, https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992, p283)

The set S_i(x) contains the points in E_N nearest to x_i, with tied points being assigned
arbitrarily to the set of lower index. Note that with this convention concerning
tied points, if x_i = x_j and i < j then S_j(x) = 0.

0 here means the empty set. x is the complete data set and S_j(x) is the j-th cluster generated for x.

@ilectra
Copy link
Author

@ilectra ilectra commented Nov 3, 2017

@christianbraune79 If S_j(x) = 0, shouldn't c_centers reflect this fact by being resized to the valid/unique only clusters? And then of course x_i and x_j would belong to the same cluster.

In any case, a warning issued by the code, and some discussion in the documentation would help.

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants