New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix K Means init center bug - Included test case #7872

Merged
merged 13 commits into from Nov 30, 2016

Conversation

Projects
None yet
6 participants
@jkarno
Contributor

jkarno commented Nov 14, 2016

Reference Issue

Fixes #6740 and builds upon #6741 with additional test case

What does this implement/fix? Explain your changes.

This takes the previous PR and adds the test case described by the user. It also resolves conflicts with the master branch.

Any other comments?

I added the test case described by the previous user. Please let me know if there are other necessary test cases to be handled.

Also, apologies for the slow update, I was traveling throughout this week.

tomtung and others added some commits May 2, 2016

K-Means: Subtract X_means from initial centroids iff it's also subtra…
…cted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Nov 14, 2016

Member

@jkarno can you see why travis is not happy?

Member

agramfort commented Nov 14, 2016

@jkarno can you see why travis is not happy?

@jkarno

This comment has been minimized.

Show comment
Hide comment
@jkarno

jkarno Nov 14, 2016

Contributor

@agramfort Looks like there was one line too long failing a pyflakes test, so I fixed that. The other failure seems like it was Travis hanging on downloading a certain package. It's passing now on a rebuild.

Contributor

jkarno commented Nov 14, 2016

@agramfort Looks like there was one line too long failing a pyflakes test, so I fixed that. The other failure seems like it was Travis hanging on downloading a certain package. It's passing now on a rebuild.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 16, 2016

Member

The appveyor failure is unrelated

Member

amueller commented Nov 16, 2016

The appveyor failure is unrelated

Show outdated Hide outdated sklearn/cluster/k_means_.py
if hasattr(init, '__array__'):
init = check_array(init, dtype=X.dtype.type, copy=True)
_validate_center_shape(X, n_clusters, init)
if hasattr(init, '__array__'):

This comment has been minimized.

@amueller

amueller Nov 16, 2016

Member

I don't think this is correct. Even if X is sparse, we can still pass explicit initial centers.

@amueller

amueller Nov 16, 2016

Member

I don't think this is correct. Even if X is sparse, we can still pass explicit initial centers.

np.testing.assert_allclose(
centers,
KMeans(n_clusters=3,
init=centers,

This comment has been minimized.

@amueller

amueller Nov 16, 2016

Member

This doesn't call validate_centers, right?

@amueller

amueller Nov 16, 2016

Member

This doesn't call validate_centers, right?

This comment has been minimized.

@amueller

amueller Nov 16, 2016

Member

maybe add a test that an error is raised if init has the wrong shape (say 4 custers)

@amueller

amueller Nov 16, 2016

Member

maybe add a test that an error is raised if init has the wrong shape (say 4 custers)

Show outdated Hide outdated sklearn/cluster/k_means_.py
'performing only one init in k-means instead of n_init=%d'
% n_init, RuntimeWarning, stacklevel=2)
n_init = 1
init -= X_mean

This comment has been minimized.

@amueller

amueller Nov 16, 2016

Member

only make this conditional on whether X i not sparse.

@amueller

amueller Nov 16, 2016

Member

only make this conditional on whether X i not sparse.

This comment has been minimized.

@jkarno

jkarno Nov 18, 2016

Contributor

Sorry, could you clarify this again? Are you saying that it doesn't need to check if it's an array in order to subtract the mean? Or are you saying that this is the only line that should stay under the "is not sparse" check, as well as the array check?

Because I'm not sure how it should then handle the other cases of init being a string or a callable.

@jkarno

jkarno Nov 18, 2016

Contributor

Sorry, could you clarify this again? Are you saying that it doesn't need to check if it's an array in order to subtract the mean? Or are you saying that this is the only line that should stay under the "is not sparse" check, as well as the array check?

Because I'm not sure how it should then handle the other cases of init being a string or a callable.

Show outdated Hide outdated sklearn/cluster/tests/test_k_means.py
# Test that a ValueError is raised for validate_center_shape
classifier = KMeans(n_clusters=3, init=centers, n_init=1)
assert_raises(ValueError, classifier.fit, X)

This comment has been minimized.

@amueller

amueller Nov 22, 2016

Member

you could use assert_raise_message to be more specific.

@amueller

amueller Nov 22, 2016

Member

you could use assert_raise_message to be more specific.

@amueller amueller changed the title from [MRG] Fix K Means init center bug - Included test case to [MRG + 1] Fix K Means init center bug - Included test case Nov 22, 2016

@amueller amueller modified the milestones: 0.18.1, 0.19 Nov 22, 2016

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 22, 2016

Member

LGTM

Member

amueller commented Nov 22, 2016

LGTM

@jnothman

Please address @amueller's comment regarding a stronger assertion. Otherwise LGTM. Please add an entry to what's new.

@jnothman jnothman changed the title from [MRG + 1] Fix K Means init center bug - Included test case to [MRG+2] Fix K Means init center bug - Included test case Nov 23, 2016

Show outdated Hide outdated doc/whats_new.rst
@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse array
X and initial centroids, where X's means were unnecessarily being subtracted from

This comment has been minimized.

@jnothman

jnothman Nov 25, 2016

Member

Please keep line length < 80 chars where possible

@jnothman

jnothman Nov 25, 2016

Member

Please keep line length < 80 chars where possible

@jnothman

sorry to be nitpicky, but if I get you to fix it up once you hopefully won't forget it next time

Show outdated Hide outdated doc/whats_new.rst
@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
sparse array X and initial centroids, where X's means were unnecessarily
being subtracted from the centroids.

This comment has been minimized.

@jnothman

jnothman Nov 27, 2016

Member

attribution? issue number?

@jnothman

jnothman Nov 27, 2016

Member

attribution? issue number?

jkarno added some commits Nov 28, 2016

@jkarno

This comment has been minimized.

Show comment
Hide comment
@jkarno

jkarno Nov 28, 2016

Contributor

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

Contributor

jkarno commented Nov 28, 2016

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Nov 28, 2016

Member

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

======================================================================
FAIL: sklearn.cluster.tests.test_k_means.test_sparse_validate_centers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27-x64\lib\site-packages\sklearn\cluster\tests\test_k_means.py", line 870, in test_sparse_validate_centers
    assert_raise_message(ValueError, msg, classifier.fit, X)
  File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 368, in assert_raise_message
    (message, error_message))
AssertionError: Error message does not include the expected string: 'The shape of the initial centers ((4, 4)) does not match the number of clusters 3'. Observed error message: 'The shape of the initial centers ((4L, 4L)) does not match the number of clusters 3'

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

By default we use links to github, i.e. https://github.com/jkarno in your case.

Member

lesteve commented Nov 28, 2016

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

======================================================================
FAIL: sklearn.cluster.tests.test_k_means.test_sparse_validate_centers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27-x64\lib\site-packages\sklearn\cluster\tests\test_k_means.py", line 870, in test_sparse_validate_centers
    assert_raise_message(ValueError, msg, classifier.fit, X)
  File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 368, in assert_raise_message
    (message, error_message))
AssertionError: Error message does not include the expected string: 'The shape of the initial centers ((4, 4)) does not match the number of clusters 3'. Observed error message: 'The shape of the initial centers ((4L, 4L)) does not match the number of clusters 3'

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

By default we use links to github, i.e. https://github.com/jkarno in your case.

Show outdated Hide outdated sklearn/cluster/tests/test_k_means.py
@@ -824,3 +824,47 @@ def test_KMeans_init_centers():
km = KMeans(init=init_centers_test, n_clusters=3, n_init=1)
km.fit(X_test)
assert_equal(False, np.may_share_memory(km.cluster_centers_, init_centers))
def test_sparse_KMeans_init_centers():

This comment has been minimized.

@lesteve

lesteve Nov 28, 2016

Member

Wow flake8 doesn't enforce naming conventions, I did not know that.

@lesteve

lesteve Nov 28, 2016

Member

Wow flake8 doesn't enforce naming conventions, I did not know that.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Nov 28, 2016

Member

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

Actually I did not read the error message very well, the problem is that the message has additional L ((4L, 4L)) instead of ((4, 4)). It's easy to fix by using assert_raises_regex and using something like this inside the regex r'\(\(4L?, 4L?\)\)'.

Member

lesteve commented Nov 28, 2016

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

Actually I did not read the error message very well, the problem is that the message has additional L ((4L, 4L)) instead of ((4, 4)). It's easy to fix by using assert_raises_regex and using something like this inside the regex r'\(\(4L?, 4L?\)\)'.

Show outdated Hide outdated doc/whats_new.rst
@@ -88,6 +88,10 @@ Bug fixes
- Tree splitting criterion classes' cloning/pickling is now memory safe
:issue:`7680` by `Ibraim Ganiev`_.
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
sparse array X and initial centroids, where X's means were unnecessarily
being subtracted from the centroids. :issue:`7872` by Josh Karnofsky

This comment has been minimized.

@jnothman

jnothman Nov 28, 2016

Member

you can use

by :user:`Josh Karnofsky <jkarno>`
@jnothman

jnothman Nov 28, 2016

Member

you can use

by :user:`Josh Karnofsky <jkarno>`

jkarno added some commits Nov 29, 2016

@jnothman jnothman merged commit 89b2e45 into scikit-learn:master Nov 30, 2016

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 30, 2016

Member

Thanks @jkarno!

Member

jnothman commented Nov 30, 2016

Thanks @jkarno!

adis300 added a commit to adis300/scikit-learn that referenced this pull request Dec 13, 2016

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Fix K Means init center bug (#7872)
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X

The bug happens when X is sparse and initial cluster centroids are
given. In this case the means of each of X's columns are computed and
subtracted from init for no reason.

To reproduce:

   import numpy as np
   import scipy
   from sklearn.cluster import KMeans
   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data

   '''Get a local optimum'''
   centers = KMeans(n_clusters=3).fit(X).cluster_centers_

   '''Fit starting from a local optimum shouldn't change the solution'''
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_
   )

   '''The same should be true when X is sparse, but wasn't before the bug fix'''
   X_sparse = scipy.sparse.csr_matrix(X)
   np.testing.assert_allclose(
      centers,
      KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_
   )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment