[MRG] Issue #2185: Fixed MinibatchKMeans bad center reallocation which caused ... #2355

Closed
wants to merge 5 commits into
from

Projects

None yet

6 participants

Contributor

Issue #2185

The PR addresses a problem in kmeans_minibatch. The problem involves wrong random reassignments of centroids. Currently, we choose n centroids out of k samples with given probabilities. However, the problem is that we do not check or control if we pick one centroid more than once. To correctly pick n unique labels out of k samples with some given probability, we need to repeat n times:

  1. choose one sample
  2. add it to set of chosen samples
  3. make it unavailable for following pickings
  4. repeat

Let me know what you think about this. Let me know if it is correct and I will add tests.

Thanks!

@larsmans larsmans and 1 other commented on an outdated diff Aug 10, 2013
sklearn/cluster/k_means_.py
@@ -858,10 +858,16 @@ def _mini_batch_step(X, x_squared_norms, centers, counts,
# Flip the ordering of the distances.
distances -= distances.max()
distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+
+ labels = np.array(range(0, number_of_reassignments))
+
+ # picking number_of_reassingments centers
+ # with probability to their
larsmans
larsmans Aug 10, 2013 Owner

to their?

kpysniak
kpysniak Aug 10, 2013 Contributor

it should be: "with their relative probability"

larsmans
larsmans Aug 26, 2013 Owner

This still has to be fixed.

@ogrisel ogrisel commented on an outdated diff Aug 19, 2013
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
ogrisel
ogrisel Aug 19, 2013 Owner

This helper method should be made private with a leading _ in its name.

ogrisel
ogrisel Aug 19, 2013 Owner

Also there is a typo: relative_probabilties => relative_probabilities.

@ogrisel ogrisel commented on an outdated diff Aug 19, 2013
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilties), len(labels))
+ iterations = min(iterations, no_picks)
+ for p in range(0, iterations):
+ # picking one value from set of elements
+ # with their relative probabilties
+ rand_val = random_state.rand()*relative_probabilties.sum()
ogrisel
ogrisel Aug 19, 2013 Owner

Please run the pep8 linter on this source file and fix any reported error.

@ogrisel ogrisel commented on an outdated diff Aug 19, 2013
sklearn/cluster/k_means_.py
@@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None):
X, x_squared_norms, self.cluster_centers_)
return self
+
+
+def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilties), len(labels))
+ iterations = min(iterations, no_picks)
+ for p in range(0, iterations):
ogrisel
ogrisel Aug 19, 2013 Owner

range(iterations) is enough.

Owner
ogrisel commented Aug 19, 2013

Could you please add a unit test for the pick_unique_labels helper function that checks that labels with large relative probabilities tend to be more often selected than others? You can do that by fixing the calling pick_unique_labels with different values of random_state (for instance all_random_states = [check_random_state(i) for i in range(100)]) and summing the number of times each label was picked.

Contributor

I made the changes, is it what you meant?

@ogrisel ogrisel commented on an outdated diff Aug 24, 2013
sklearn/cluster/tests/test_k_means.py
@@ -622,3 +624,33 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)
+
+
+def test_pick_unique_labels():
+ all_random_states = np.array([check_random_state(i) for i in range(500)])
+ relative_probabilities = np.array([(i+1)*(i+2) for i in range(10)])
ogrisel
ogrisel Aug 24, 2013 Owner

Can you please run pep8 on this file and fix the errors?

Why did you use something as complicated as (i+1)*(i+2)? Why not just np.arange(10) for instance?

Owner
ogrisel commented Aug 24, 2013

Looks good, +1 for merge once pep8 formatting is fixed :)

Contributor

I ran pep8 and pyflakes on the current version, but it didn't print any errors. Am I missing something? :)

Owner
ogrisel commented Aug 25, 2013

Indeed, it used to be the case that (i+1)*(i+2) was reported as an error by the pep8 tool and (i + 1) * (i + 2) was the correct notation. Apparently newer version of the pep8 style linter are more tolerant: it just checks the consistency.

I still find the variant with spaces (i + 1) * (i + 2) more readable.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Aug 25, 2013
sklearn/cluster/k_means_.py
@@ -553,6 +553,31 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels))
+ iterations = min(iterations, no_picks)
GaelVaroquaux
GaelVaroquaux Aug 25, 2013 Owner

'min' can take multiple arguments. I'd prefer the above 2 lines to be one line:

    iterations = min(len(relative_probabilities), len(labels), no_picks)
Contributor

Thanks for the comments, I hope it looks good now

Owner
ogrisel commented Aug 25, 2013

I think this is ready for merge. I checked the face patches example and this fixes the repeated patches issue.

Owner
ogrisel commented Aug 25, 2013

+1 for merging, thanks very much @kpysniak

@larsmans larsmans commented on an outdated diff Aug 26, 2013
sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
larsmans
larsmans Aug 26, 2013 Owner

I don't understand why the distances are called relative_probabilities here. It'd be clearer if the argument were just called distances.

@larsmans larsmans and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
larsmans
larsmans Aug 26, 2013 Owner

no_picks is also a confusing name. We abbreviate "number of" to num or n, not no. And I don't see why this isn't called n_reassignments or something similar.

ogrisel
ogrisel Aug 26, 2013 Owner

True, or even n_picks as we usually do in the rest of the lib.

@larsmans larsmans commented on an outdated diff Aug 26, 2013
sklearn/cluster/k_means_.py
@@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(relative_probabilities, labels,
+ no_picks, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels), no_picks)
+ for p in range(iterations):
+ # picking one value from set of elements
+ # with their relative probabilities
+ rand_val = random_state.rand()*relative_probabilities.sum()
larsmans
larsmans Aug 26, 2013 Owner

PEP8: spaces around binary operators.

Owner

When I rig the document clustering example to force cluster reassignments (init_size=100, batch_size=100, reassign_ratio=.9), it crashes:

Minibatch iteration 9/18900:mean batch inertia: 0.984477, ewa inertia: 1.010413 
Traceback (most recent call last):
  File "../../examples/document_clustering.py", line 185, in <module>
    km.fit(X)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 1235, in fit
    verbose=self.verbose)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 893, in _mini_batch_step
    random_state)
  File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 570, in _pick_unique_labels
    picks[p] = labels[new_pick]
IndexError: index 94 is out of bounds for axis 0 with size 19
@larsmans larsmans commented on an outdated diff Aug 26, 2013
sklearn/cluster/k_means_.py
+ picks = np.zeros(no_picks, dtype=np.int)
+
+ # making sure we do not exceed any array
+ iterations = min(len(relative_probabilities), len(labels), no_picks)
+ for p in range(iterations):
+ # picking one value from set of elements
+ # with their relative probabilities
+ rand_val = random_state.rand()*relative_probabilities.sum()
+ new_pick = np.searchsorted(relative_probabilities.cumsum(), rand_val)
+
+ # taking note of pick
+ picks[p] = labels[new_pick]
+
+ # label has been picked
+ # it should not be picked next time
+ relative_probabilities = np.delete(relative_probabilities, new_pick)
larsmans
larsmans Aug 26, 2013 Owner

np.delete constructs a new array, so this loop takes quadratic time in the number of clusters. Can the array be modified in-place?

Contributor

@larsmans did you manage to reproduce that exception? I tried to run the document clustering example, but it worked for me.

@amueller amueller commented on the diff Nov 7, 2013
sklearn/cluster/tests/test_k_means.py
@@ -622,3 +624,33 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)
+
+
+def test_pick_unique_labels():
+ all_random_states = np.array([check_random_state(i) for i in range(500)])
amueller
amueller Nov 7, 2013 Owner

do we really want to do this? we don't usually do this in other places, right? can't we just make the array such that the probability of the test passing randomly is small?

Owner
amueller commented Nov 7, 2013

I think this PR should have a regression test on k-means. I am just thinking about a way to vectorize. I don't think this is the problem I am seeing but it is definitely a bug and the right fix, thanks.

Owner
amueller commented Nov 7, 2013

Ok, nevermind, I don't understand what the problem was / is. Have to read again.

@amueller amueller commented on the diff Nov 7, 2013
sklearn/cluster/k_means_.py
@@ -554,6 +554,29 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None,
return centers
+def _pick_unique_labels(distances, n_reassigns, random_state):
+ # array of labels randomly picked
+ picks = np.zeros(n_reassigns, dtype=np.int)
amueller
amueller Nov 7, 2013 Owner

shouldn't picks be of length iterations? Otherwise it will contain zeros... On the other hand, can n_reassigns be more than len(distances)?

Owner
amueller commented Nov 7, 2013

Ok, got it now. Also understood why it can't be vectorized ^^ I think the length of picks is a possible source of bugs. Maybe just raise if distances is shorter than n_reassigns?
Still think a regression test would be cool if possible. Initializing the means explicitly on some malformed data and using partial_fit that should be doable.

Contributor
kpysniak commented Nov 8, 2013

Sure, I'll add some regression tests and send PR soon.

Owner
amueller commented Dec 3, 2013

FYI I would consider this PR blocking for 0.15. Not that I would be doing anything about it currently :-/

Owner
amueller commented Dec 3, 2013

Ok I really don't get the reassignment stuff. The "distances" array has either the wrong shape and / or remains zero at all times when MiniBatchKMeans.partial_fit is called. It should be of shape n_samples but is of shape n_clusters. However, if it is not of shape n_clusters, it will not be computed in _labels_intertia and will remain zero, leading to identical new_centers, even with the new algorithm.

Owner
amueller commented Dec 6, 2013

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

Owner
arjoly commented Dec 6, 2013

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

np.random.choice have been added in numpy 1.7

Owner
amueller commented Dec 6, 2013

How hard do you think it would be to backport it?

Owner
arjoly commented Dec 7, 2013

With a quick look to choice, the implementation is mainly python based. I haven't seen any function that would be new or not already backported.

You would need to transform the choice method of the RandomState class by a choice function with a random_state argument for reproducibility as with sklearn.utils.random.sample_without_replacement.

Owner
amueller commented Dec 7, 2013

Thanks I will give that a shot!

Owner
amueller commented Jan 9, 2015

This was fixed in #2638

@amueller amueller closed this Jan 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment