Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix K Means init center bug - Included test case #7872

Merged
merged 13 commits into from Nov 30, 2016

Conversation

jkarno
Copy link
Contributor

@jkarno jkarno commented Nov 14, 2016

Reference Issue

Fixes #6740 and builds upon #6741 with additional test case

What does this implement/fix? Explain your changes.

This takes the previous PR and adds the test case described by the user. It also resolves conflicts with the master branch.

Any other comments?

I added the test case described by the previous user. Please let me know if there are other necessary test cases to be handled.

Also, apologies for the slow update, I was traveling throughout this week.

tomtung and others added 4 commits May 1, 2016 19:09
…cted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
@agramfort
Copy link
Member

@jkarno can you see why travis is not happy?

@jkarno
Copy link
Contributor Author

jkarno commented Nov 14, 2016

@agramfort Looks like there was one line too long failing a pyflakes test, so I fixed that. The other failure seems like it was Travis hanging on downloading a certain package. It's passing now on a rebuild.

@amueller
Copy link
Member

The appveyor failure is unrelated

if hasattr(init, '__array__'):
init = check_array(init, dtype=X.dtype.type, copy=True)
_validate_center_shape(X, n_clusters, init)
if hasattr(init, '__array__'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. Even if X is sparse, we can still pass explicit initial centers.

np.testing.assert_allclose(
centers,
KMeans(n_clusters=3,
init=centers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't call validate_centers, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test that an error is raised if init has the wrong shape (say 4 custers)

'performing only one init in k-means instead of n_init=%d'
% n_init, RuntimeWarning, stacklevel=2)
n_init = 1
init -= X_mean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only make this conditional on whether X i not sparse.

Copy link
Contributor Author

@jkarno jkarno Nov 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you clarify this again? Are you saying that it doesn't need to check if it's an array in order to subtract the mean? Or are you saying that this is the only line that should stay under the "is not sparse" check, as well as the array check?

Because I'm not sure how it should then handle the other cases of init being a string or a callable.


# Test that a ValueError is raised for validate_center_shape
classifier = KMeans(n_clusters=3, init=centers, n_init=1)
assert_raises(ValueError, classifier.fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use assert_raise_message to be more specific.

@amueller amueller changed the title [MRG] Fix K Means init center bug - Included test case [MRG + 1] Fix K Means init center bug - Included test case Nov 22, 2016
@amueller amueller modified the milestones: 0.18.1, 0.19 Nov 22, 2016
@amueller
Copy link
Member

LGTM

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address @amueller's comment regarding a stronger assertion. Otherwise LGTM. Please add an entry to what's new.

@jnothman jnothman changed the title [MRG + 1] Fix K Means init center bug - Included test case [MRG+2] Fix K Means init center bug - Included test case Nov 23, 2016
@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.

- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse array
X and initial centroids, where X's means were unnecessarily being subtracted from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep line length < 80 chars where possible

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be nitpicky, but if I get you to fix it up once you hopefully won't forget it next time

@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.

- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
sparse array X and initial centroids, where X's means were unnecessarily
being subtracted from the centroids.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attribution? issue number?

@jkarno
Copy link
Contributor Author

jkarno commented Nov 28, 2016

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

@lesteve
Copy link
Member

lesteve commented Nov 28, 2016

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

======================================================================
FAIL: sklearn.cluster.tests.test_k_means.test_sparse_validate_centers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27-x64\lib\site-packages\sklearn\cluster\tests\test_k_means.py", line 870, in test_sparse_validate_centers
    assert_raise_message(ValueError, msg, classifier.fit, X)
  File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 368, in assert_raise_message
    (message, error_message))
AssertionError: Error message does not include the expected string: 'The shape of the initial centers ((4, 4)) does not match the number of clusters 3'. Observed error message: 'The shape of the initial centers ((4L, 4L)) does not match the number of clusters 3'

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

By default we use links to github, i.e. https://github.com/jkarno in your case.

@@ -824,3 +824,47 @@ def test_KMeans_init_centers():
km = KMeans(init=init_centers_test, n_clusters=3, n_init=1)
km.fit(X_test)
assert_equal(False, np.may_share_memory(km.cluster_centers_, init_centers))


def test_sparse_KMeans_init_centers():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow flake8 doesn't enforce naming conventions, I did not know that.

@lesteve
Copy link
Member

lesteve commented Nov 28, 2016

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

Actually I did not read the error message very well, the problem is that the message has additional L ((4L, 4L)) instead of ((4, 4)). It's easy to fix by using assert_raises_regex and using something like this inside the regex r'\(\(4L?, 4L?\)\)'.

@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.

- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
sparse array X and initial centroids, where X's means were unnecessarily
being subtracted from the centroids. :issue:`7872` by Josh Karnofsky
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use

by :user:`Josh Karnofsky <jkarno>`

@jnothman jnothman merged commit 89b2e45 into scikit-learn:master Nov 30, 2016
@jnothman
Copy link
Member

Thanks @jkarno!

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K-Means: Should subtract X_means from initial centroids iff it's also subtracted from X
6 participants