Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[MRG] Issue #2185: Fixed MinibatchKMeans bad center reallocation which caused ... #2355

Closed
wants to merge 5 commits into from

6 participants

@kpysniak

Issue #2185

The PR addresses a problem in kmeans_minibatch. The problem involves wrong random reassignments of centroids. Currently, we choose n centroids out of k samples with given probabilities. However, the problem is that we do not check or control if we pick one centroid more than once. To correctly pick n unique labels out of k samples with some given probability, we need to repeat n times:
1) choose one sample
2) add it to set of chosen samples
3) make it unavailable for following pickings
4) repeat

Let me know what you think about this. Let me know if it is correct and I will add tests.

Thanks!

sklearn/cluster/k_means_.py
@@ -858,10 +858,16 @@ def _mini_batch_step(X, x_squared_norms, centers, counts,
# Flip the ordering of the distances.
distances -= distances.max()
distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+
+ labels = np.array(range(0, number_of_reassignments))
+
+ # picking number_of_reassingments centers
+ # with probability to their
@larsmans Owner

to their?

it should be: "with their relative probability"

@larsmans Owner

This still has to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
@ogrisel Owner
ogrisel added a note

This helper method should be made private with a leading _ in its name.

@ogrisel Owner
ogrisel added a note

Also there is a typo: relative_probabilties => relative_probabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilties), len(labels))
+ iterations = min(iterations, no_picks)
+ for p in range(0, iterations):
+ # picking one value from set of elements
+ # with their relative probabilties
+ rand_val = random_state.rand()*relative_probabilties.sum()
@ogrisel Owner
ogrisel added a note

Please run the pep8 linter on this source file and fix any reported error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilties), len(labels))
+ iterations = min(iterations, no_picks)
+ for p in range(0, iterations):
@ogrisel Owner
ogrisel added a note

range(iterations) is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

Could you please add a unit test for the pick_unique_labels helper function that checks that labels with large relative probabilities tend to be more often selected than others? You can do that by fixing the calling pick_unique_labels with different values of random_state (for instance all_random_states = [check_random_state(i) for i in range(100)]) and summing the number of times each label was picked.

@kpysniak

I made the changes, is it what you meant?

sklearn/cluster/tests/test_k_means.py
@@ -622,3 +624,33 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)
+
+
+def test_pick_unique_labels():
+ all_random_states = np.array([check_random_state(i) for i in range(500)])
+ relative_probabilities = np.array([(i+1)*(i+2) for i in range(10)])
@ogrisel Owner
ogrisel added a note

Can you please run pep8 on this file and fix the errors?

Why did you use something as complicated as (i+1)*(i+2)? Why not just np.arange(10) for instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

Looks good, +1 for merge once pep8 formatting is fixed :)

@kpysniak

I ran pep8 and pyflakes on the current version, but it didn't print any errors. Am I missing something? :)

@ogrisel
Owner

Indeed, it used to be the case that (i+1)*(i+2) was reported as an error by the pep8 tool and (i + 1) * (i + 2) was the correct notation. Apparently newer version of the pep8 style linter are more tolerant: it just checks the consistency.

I still find the variant with spaces (i + 1) * (i + 2) more readable.

sklearn/cluster/k_means_.py
@@ -553,6 +553,31 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels))
+ iterations = min(iterations, no_picks)
@GaelVaroquaux Owner

'min' can take multiple arguments. I'd prefer the above 2 lines to be one line:

    iterations = min(len(relative_probabilities), len(labels), no_picks)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@kpysniak

Thanks for the comments, I hope it looks good now

@ogrisel
Owner

I think this is ready for merge. I checked the face patches example and this fixes the repeated patches issue.

@ogrisel
Owner

+1 for merging, thanks very much @kpysniak

sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
@larsmans Owner

I don't understand why the distances are called relative_probabilities here. It'd be clearer if the argument were just called distances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
@larsmans Owner

no_picks is also a confusing name. We abbreviate "number of" to num or n, not no. And I don't see why this isn't called n_reassignments or something similar.

@ogrisel Owner
ogrisel added a note

True, or even n_picks as we usually do in the rest of the lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels), no_picks)
+ for p in range(iterations):
+ # picking one value from set of elements
+ # with their relative probabilities
+ rand_val = random_state.rand()*relative_probabilities.sum()
@larsmans Owner

PEP8: spaces around binary operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
Owner

When I rig the document clustering example to force cluster reassignments (init_size=100, batch_size=100, reassign_ratio=.9), it crashes:

Minibatch iteration 9/18900:mean batch inertia: 0.984477, ewa inertia: 1.010413 
Traceback (most recent call last):
  File "../../examples/document_clustering.py", line 185, in <module>
    km.fit(X)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 1235, in fit
    verbose=self.verbose)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 893, in _mini_batch_step
    random_state)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 570, in _pick_unique_labels
    picks[p] = labels[new_pick]
IndexError: index 94 is out of bounds for axis 0 with size 19
sklearn/cluster/k_means_.py
((7 lines not shown))
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels), no_picks)
+ for p in range(iterations):
+ # picking one value from set of elements
+ # with their relative probabilities
+ rand_val = random_state.rand()*relative_probabilities.sum()
+ new_pick = np.searchsorted(relative_probabilities.cumsum(), rand_val)
+
+ # taking note of pick
+ picks[p] = labels[new_pick]
+
+ # label has been picked
+ # it should not be picked next time
+ relative_probabilities = np.delete(relative_probabilities, new_pick)
@larsmans Owner

np.delete constructs a new array, so this loop takes quadratic time in the number of clusters. Can the array be modified in-place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@kpysniak

@larsmans did you manage to reproduce that exception? I tried to run the document clustering example, but it worked for me.

@amueller amueller commented on the diff
sklearn/cluster/tests/test_k_means.py
@@ -622,3 +624,33 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)
+
+
+def test_pick_unique_labels():
+ all_random_states = np.array([check_random_state(i) for i in range(500)])
@amueller Owner
amueller added a note

do we really want to do this? we don't usually do this in other places, right? can't we just make the array such that the probability of the test passing randomly is small?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

I think this PR should have a regression test on k-means. I am just thinking about a way to vectorize. I don't think this is the problem I am seeing but it is definitely a bug and the right fix, thanks.

@amueller
Owner

Ok, nevermind, I don't understand what the problem was / is. Have to read again.

@amueller amueller commented on the diff
sklearn/cluster/k_means_.py
@@ -554,6 +554,29 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(distances, n_reassigns, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(n_reassigns, dtype=np.int)
@amueller Owner
amueller added a note

shouldn't picks be of length iterations? Otherwise it will contain zeros... On the other hand, can n_reassigns be more than len(distances)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

Ok, got it now. Also understood why it can't be vectorized ^^ I think the length of picks is a possible source of bugs. Maybe just raise if distances is shorter than n_reassigns?
Still think a regression test would be cool if possible. Initializing the means explicitly on some malformed data and using partial_fit that should be doable.

@kpysniak

Sure, I'll add some regression tests and send PR soon.

@amueller
Owner

FYI I would consider this PR blocking for 0.15. Not that I would be doing anything about it currently :-/

@amueller
Owner

Ok I really don't get the reassignment stuff. The "distances" array has either the wrong shape and / or remains zero at all times when MiniBatchKMeans.partial_fit is called. It should be of shape n_samples but is of shape n_clusters. However, if it is not of shape n_clusters, it will not be computed in _labels_intertia and will remain zero, leading to identical new_centers, even with the new algorithm.

@amueller
Owner

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

@arjoly
Owner

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

np.random.choice have been added in numpy 1.7

@amueller
Owner

How hard do you think it would be to backport it?

@arjoly
Owner

With a quick look to choice, the implementation is mainly python based. I haven't seen any function that would be new or not already backported.

You would need to transform the choice method of the RandomState class by a choice function with a random_state argument for reproducibility as with sklearn.utils.random.sample_without_replacement.

@amueller
Owner

Thanks I will give that a shot!

@amueller
Owner

This was fixed in #2638

@amueller amueller closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 62 additions and 4 deletions.
  1. +30 −4 sklearn/cluster/k_means_.py
  2. +32 −0 sklearn/cluster/tests/test_k_means.py
View
34 sklearn/cluster/k_means_.py
@@ -554,6 +554,29 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(distances, n_reassigns, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(n_reassigns, dtype=np.int)
@amueller Owner
amueller added a note

shouldn't picks be of length iterations? Otherwise it will contain zeros... On the other hand, can n_reassigns be more than len(distances)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ distances = np.ma.masked_array(distances)
+
+ # making sure we do not exceed any array
+ iterations = min(len(distances), n_reassigns)
+ for p in range(iterations):
+ # picking one value from set of elements
+ # with their relative probabilities
+ rand_val = random_state.rand() * distances.sum()
+ new_pick = np.searchsorted(distances.cumsum(), rand_val)
+
+ # taking note of pick
+ picks[p] = new_pick
+
+ # label has been picked
+ # it should not be picked next time
+ distances[new_pick] = np.ma.masked
+
+ return picks
+
+
class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
"""K-Means clustering
@@ -860,10 +883,13 @@ def _mini_batch_step(X, x_squared_norms, centers, counts,
# Flip the ordering of the distances.
distances -= distances.max()
distances *= -1
- rand_vals = random_state.rand(n_reassigns)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+
+ # picking number_of_reassingments centers
+ # with probability to their
+ new_centers = _pick_unique_labels(distances,
+ n_reassigns,
+ random_state)
+
if verbose:
print("[MiniBatchKMeans] Reassigning %i cluster centers."
% n_reassigns)
View
32 sklearn/cluster/tests/test_k_means.py
@@ -21,9 +21,11 @@
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster.k_means_ import _labels_inertia
from sklearn.cluster.k_means_ import _mini_batch_step
+from sklearn.cluster.k_means_ import _pick_unique_labels
from sklearn.cluster._k_means import csr_row_norm_l2
from sklearn.datasets.samples_generator import make_blobs
from sklearn.externals.six.moves import cStringIO as StringIO
+from sklearn.utils import check_random_state
# non centered, sparse centers to check the
@@ -622,3 +624,33 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)
+
+
+def test_pick_unique_labels():
+ all_random_states = np.array([check_random_state(i) for i in range(500)])
@amueller Owner
amueller added a note

do we really want to do this? we don't usually do this in other places, right? can't we just make the array such that the probability of the test passing randomly is small?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ relative_probabilities = np.array([(i + 1) * (i + 2) for i in range(10)])
+ labels = range(10, 20)
+ no_single_picks = 1
+ no_multiple_picks = 4
+
+ counts_single = np.zeros((len(labels), 1), dtype=np.int)
+ counts_multiple = np.zeros((len(labels), 1), dtype=np.int)
+
+ for random_state in all_random_states:
+ single_pick = _pick_unique_labels(relative_probabilities,
+ no_single_picks, random_state)
+ counts_single[single_pick] = counts_single[single_pick] + 1
+
+ multiple_pick = _pick_unique_labels(relative_probabilities,
+ no_multiple_picks, random_state)
+ counts_multiple[multiple_pick] = counts_multiple[multiple_pick] + 1
+
+ assert_array_equal(counts_single, np.array([[0], [9], [8],
+ [25], [47], [47],
+ [68], [76], [97],
+ [123]]))
+
+ assert_array_equal(counts_multiple, np.array([[14], [37], [68],
+ [114], [160], [217],
+ [297], [325], [366],
+ [402]]))
Something went wrong with that request. Please try again.