Skip to content

# scikit-learn/scikit-learn

### Subversion checkout URL

You can clone with HTTPS or Subversion.

Loading…

# [MRG] Issue #2185: Fixed MinibatchKMeans bad center reallocation which caused ...#2355

Closed
wants to merge 5 commits into from
+62 −4

### 6 participants

Issue #2185

The PR addresses a problem in kmeans_minibatch. The problem involves wrong random reassignments of centroids. Currently, we choose n centroids out of k samples with given probabilities. However, the problem is that we do not check or control if we pick one centroid more than once. To correctly pick n unique labels out of k samples with some given probability, we need to repeat n times:
1) choose one sample
2) add it to set of chosen samples
3) make it unavailable for following pickings
4) repeat

Let me know what you think about this. Let me know if it is correct and I will add tests.

Thanks!

sklearn/cluster/k_means_.py
 @@ -858,10 +858,16 @@ def _mini_batch_step(X, x_squared_norms, centers, counts, # Flip the ordering of the distances. distances -= distances.max() distances *= -1 - rand_vals = random_state.rand(number_of_reassignments) - rand_vals *= distances.sum() - new_centers = np.searchsorted(distances.cumsum(), - rand_vals) + + labels = np.array(range(0, number_of_reassignments)) + + # picking number_of_reassingments centers + # with probability to their
 Owner larsmans added a note Aug 10, 2013 to their? kpysniak added a note Aug 10, 2013 it should be: "with their relative probability" Owner larsmans added a note Aug 26, 2013 This still has to be fixed. to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
 @@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None): X, x_squared_norms, self.cluster_centers_) return self + + +def pick_unique_labels(relative_probabilties, labels, no_picks, random_state):
 Owner ogrisel added a note Aug 19, 2013 This helper method should be made private with a leading `_` in its name. Owner ogrisel added a note Aug 19, 2013 Also there is a typo: `relative_probabilties` => `relative_probabilities`. to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
 @@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None): X, x_squared_norms, self.cluster_centers_) return self + + +def pick_unique_labels(relative_probabilties, labels, no_picks, random_state): + # array of labels randomly picked + picks = np.zeros(no_picks, dtype=np.int) + + # making sure we do not exceed any array + iterations = min(len(relative_probabilties), len(labels)) + iterations = min(iterations, no_picks) + for p in range(0, iterations): + # picking one value from set of elements + # with their relative probabilties + rand_val = random_state.rand()*relative_probabilties.sum()
 Owner ogrisel added a note Aug 19, 2013 Please run the pep8 linter on this source file and fix any reported error. to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
 @@ -1270,3 +1276,27 @@ def partial_fit(self, X, y=None): X, x_squared_norms, self.cluster_centers_) return self + + +def pick_unique_labels(relative_probabilties, labels, no_picks, random_state): + # array of labels randomly picked + picks = np.zeros(no_picks, dtype=np.int) + + # making sure we do not exceed any array + iterations = min(len(relative_probabilties), len(labels)) + iterations = min(iterations, no_picks) + for p in range(0, iterations):
 Owner ogrisel added a note Aug 19, 2013 `range(iterations)` is enough. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

Could you please add a unit test for the `pick_unique_labels` helper function that checks that labels with large relative probabilities tend to be more often selected than others? You can do that by fixing the calling `pick_unique_labels` with different values of `random_state` (for instance `all_random_states = [check_random_state(i) for i in range(100)`]) and summing the number of times each label was picked.

I made the changes, is it what you meant?

sklearn/cluster/tests/test_k_means.py
 @@ -622,3 +624,33 @@ def test_k_means_function(): # to many clusters desired assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1) + + +def test_pick_unique_labels(): + all_random_states = np.array([check_random_state(i) for i in range(500)]) + relative_probabilities = np.array([(i+1)*(i+2) for i in range(10)])
 Owner ogrisel added a note Aug 24, 2013 Can you please run pep8 on this file and fix the errors? Why did you use something as complicated as `(i+1)*(i+2)`? Why not just `np.arange(10)` for instance? to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

Looks good, +1 for merge once pep8 formatting is fixed :)

I ran pep8 and pyflakes on the current version, but it didn't print any errors. Am I missing something? :)

Owner

Indeed, it used to be the case that `(i+1)*(i+2)` was reported as an error by the pep8 tool and `(i + 1) * (i + 2)` was the correct notation. Apparently newer version of the pep8 style linter are more tolerant: it just checks the consistency.

I still find the variant with spaces `(i + 1) * (i + 2)` more readable.

sklearn/cluster/k_means_.py
 @@ -553,6 +553,31 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(relative_probabilities, labels, + no_picks, random_state): + # array of labels randomly picked + picks = np.zeros(no_picks, dtype=np.int) + + # making sure we do not exceed any array + iterations = min(len(relative_probabilities), len(labels)) + iterations = min(iterations, no_picks)
 Owner GaelVaroquaux added a note Aug 25, 2013 'min' can take multiple arguments. I'd prefer the above 2 lines to be one line: ``` iterations = min(len(relative_probabilities), len(labels), no_picks) ``` to join this conversation on GitHub. Already have an account? Sign in to comment

Thanks for the comments, I hope it looks good now

Owner

I think this is ready for merge. I checked the face patches example and this fixes the repeated patches issue.

Owner

+1 for merging, thanks very much @kpysniak

sklearn/cluster/k_means_.py
 @@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(relative_probabilities, labels,
 Owner larsmans added a note Aug 26, 2013 I don't understand why the distances are called `relative_probabilities` here. It'd be clearer if the argument were just called `distances`. to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
 @@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(relative_probabilities, labels, + no_picks, random_state):
 Owner larsmans added a note Aug 26, 2013 `no_picks` is also a confusing name. We abbreviate "number of" to `num` or `n`, not `no`. And I don't see why this isn't called `n_reassignments` or something similar. Owner ogrisel added a note Aug 26, 2013 True, or even `n_picks` as we usually do in the rest of the lib. to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/k_means_.py
 @@ -553,6 +553,30 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(relative_probabilities, labels, + no_picks, random_state): + # array of labels randomly picked + picks = np.zeros(no_picks, dtype=np.int) + + # making sure we do not exceed any array + iterations = min(len(relative_probabilities), len(labels), no_picks) + for p in range(iterations): + # picking one value from set of elements + # with their relative probabilities + rand_val = random_state.rand()*relative_probabilities.sum()
 Owner larsmans added a note Aug 26, 2013 PEP8: spaces around binary operators. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

When I rig the document clustering example to force cluster reassignments (`init_size=100`, `batch_size=100`, `reassign_ratio=.9`), it crashes:

``````Minibatch iteration 9/18900:mean batch inertia: 0.984477, ewa inertia: 1.010413
Traceback (most recent call last):
File "../../examples/document_clustering.py", line 185, in <module>
km.fit(X)
File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 1235, in fit
verbose=self.verbose)
File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 893, in _mini_batch_step
random_state)
File "/scratch/home/src/scikit-learn/sklearn/cluster/k_means_.py", line 570, in _pick_unique_labels
picks[p] = labels[new_pick]
IndexError: index 94 is out of bounds for axis 0 with size 19
``````
sklearn/cluster/k_means_.py
 ((7 lines not shown)) + picks = np.zeros(no_picks, dtype=np.int) + + # making sure we do not exceed any array + iterations = min(len(relative_probabilities), len(labels), no_picks) + for p in range(iterations): + # picking one value from set of elements + # with their relative probabilities + rand_val = random_state.rand()*relative_probabilities.sum() + new_pick = np.searchsorted(relative_probabilities.cumsum(), rand_val) + + # taking note of pick + picks[p] = labels[new_pick] + + # label has been picked + # it should not be picked next time + relative_probabilities = np.delete(relative_probabilities, new_pick)
 Owner larsmans added a note Aug 26, 2013 `np.delete` constructs a new array, so this loop takes quadratic time in the number of clusters. Can the array be modified in-place? to join this conversation on GitHub. Already have an account? Sign in to comment
added some commits
 kpysniak `Issue #2185: Fixed MinibatchKMeans bad center reallocation which caus…` `…ed duplicate centers` `39b3d9c` kpysniak `made _pick_unique_labels private, fixed misspellings, added unit test…` `… _pick_unique_labels` `990637a` kpysniak `Fixed formatting issue` `b68e14c` kpysniak `Squashed multiple min function to just one min call` `6e82de0` kpysniak `Improved complexity of pick_unique_labels by using masked_arrays and …` `…fixed naming` `0b98f78`

@larsmans did you manage to reproduce that exception? I tried to run the document clustering example, but it worked for me.

commented on the diff
sklearn/cluster/tests/test_k_means.py
 @@ -622,3 +624,33 @@ def test_k_means_function(): # to many clusters desired assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1) + + +def test_pick_unique_labels(): + all_random_states = np.array([check_random_state(i) for i in range(500)])
 Owner amueller added a note Nov 7, 2013 do we really want to do this? we don't usually do this in other places, right? can't we just make the array such that the probability of the test passing randomly is small? to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

I think this PR should have a regression test on k-means. I am just thinking about a way to vectorize. I don't think this is the problem I am seeing but it is definitely a bug and the right fix, thanks.

Owner

Ok, nevermind, I don't understand what the problem was / is. Have to read again.

commented on the diff
sklearn/cluster/k_means_.py
 @@ -554,6 +554,29 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(distances, n_reassigns, random_state): + # array of labels randomly picked + picks = np.zeros(n_reassigns, dtype=np.int)
 Owner amueller added a note Nov 7, 2013 shouldn't `picks` be of length `iterations`? Otherwise it will contain zeros... On the other hand, can `n_reassigns` be more than `len(distances)`? to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

Ok, got it now. Also understood why it can't be vectorized ^^ I think the length of `picks` is a possible source of bugs. Maybe just `raise` if `distances` is shorter than `n_reassigns`?
Still think a regression test would be cool if possible. Initializing the means explicitly on some malformed data and using `partial_fit` that should be doable.

Sure, I'll add some regression tests and send PR soon.

Owner

FYI I would consider this PR blocking for 0.15. Not that I would be doing anything about it currently :-/

Owner

Ok I really don't get the reassignment stuff. The "distances" array has either the wrong shape and / or remains zero at all times when `MiniBatchKMeans.partial_fit` is called. It should be of shape `n_samples` but is of shape `n_clusters`. However, if it is not of shape `n_clusters`, it will not be computed in `_labels_intertia` and will remain zero, leading to identical `new_centers`, even with the new algorithm.

Owner

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

Owner

See #2638 for remaining fixes. I think we should use np.random.choice instead of writing up a new function. I'm not sure if we need to backport it but it seems better than doing a custom implementation.

np.random.choice have been added in numpy 1.7

Owner

How hard do you think it would be to backport it?

Owner

With a quick look to choice, the implementation is mainly python based. I haven't seen any function that would be new or not already backported.

You would need to transform the `choice` method of the `RandomState` class by a `choice` function with a `random_state` argument for reproducibility as with `sklearn.utils.random.sample_without_replacement`.

Owner

Thanks I will give that a shot!

Owner

This was fixed in #2638

closed this
to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 27, 2013
1. kpysniak authored
`…ed duplicate centers`
2. kpysniak authored
`… _pick_unique_labels`
3. kpysniak authored
4. kpysniak authored
5. kpysniak authored
`…fixed naming`
This page is out of date. Refresh to see the latest.
34 sklearn/cluster/k_means_.py
 @@ -554,6 +554,29 @@ def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, return centers +def _pick_unique_labels(distances, n_reassigns, random_state): + # array of labels randomly picked + picks = np.zeros(n_reassigns, dtype=np.int) Owner amueller added a note Nov 7, 2013 shouldn't `picks` be of length `iterations`? Otherwise it will contain zeros... On the other hand, can `n_reassigns` be more than `len(distances)`? to join this conversation on GitHub. Already have an account? Sign in to comment + distances = np.ma.masked_array(distances) + + # making sure we do not exceed any array + iterations = min(len(distances), n_reassigns) + for p in range(iterations): + # picking one value from set of elements + # with their relative probabilities + rand_val = random_state.rand() * distances.sum() + new_pick = np.searchsorted(distances.cumsum(), rand_val) + + # taking note of pick + picks[p] = new_pick + + # label has been picked + # it should not be picked next time + distances[new_pick] = np.ma.masked + + return picks + + class KMeans(BaseEstimator, ClusterMixin, TransformerMixin): """K-Means clustering @@ -860,10 +883,13 @@ def _mini_batch_step(X, x_squared_norms, centers, counts, # Flip the ordering of the distances. distances -= distances.max() distances *= -1 - rand_vals = random_state.rand(n_reassigns) - rand_vals *= distances.sum() - new_centers = np.searchsorted(distances.cumsum(), - rand_vals) + + # picking number_of_reassingments centers + # with probability to their + new_centers = _pick_unique_labels(distances, + n_reassigns, + random_state) + if verbose: print("[MiniBatchKMeans] Reassigning %i cluster centers." % n_reassigns)
32 sklearn/cluster/tests/test_k_means.py
 @@ -21,9 +21,11 @@ from sklearn.cluster import MiniBatchKMeans from sklearn.cluster.k_means_ import _labels_inertia from sklearn.cluster.k_means_ import _mini_batch_step +from sklearn.cluster.k_means_ import _pick_unique_labels from sklearn.cluster._k_means import csr_row_norm_l2 from sklearn.datasets.samples_generator import make_blobs from sklearn.externals.six.moves import cStringIO as StringIO +from sklearn.utils import check_random_state # non centered, sparse centers to check the @@ -622,3 +624,33 @@ def test_k_means_function(): # to many clusters desired assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1) + + +def test_pick_unique_labels(): + all_random_states = np.array([check_random_state(i) for i in range(500)]) Owner amueller added a note Nov 7, 2013 do we really want to do this? we don't usually do this in other places, right? can't we just make the array such that the probability of the test passing randomly is small? to join this conversation on GitHub. Already have an account? Sign in to comment + relative_probabilities = np.array([(i + 1) * (i + 2) for i in range(10)]) + labels = range(10, 20) + no_single_picks = 1 + no_multiple_picks = 4 + + counts_single = np.zeros((len(labels), 1), dtype=np.int) + counts_multiple = np.zeros((len(labels), 1), dtype=np.int) + + for random_state in all_random_states: + single_pick = _pick_unique_labels(relative_probabilities, + no_single_picks, random_state) + counts_single[single_pick] = counts_single[single_pick] + 1 + + multiple_pick = _pick_unique_labels(relative_probabilities, + no_multiple_picks, random_state) + counts_multiple[multiple_pick] = counts_multiple[multiple_pick] + 1 + + assert_array_equal(counts_single, np.array([[0], [9], [8], + [25], [47], [47], + [68], [76], [97], + [123]])) + + assert_array_equal(counts_multiple, np.array([[14], [37], [68], + [114], [160], [217], + [297], [325], [366], + [402]]))
Something went wrong with that request. Please try again.