[MRG] Fix MiniBatchKMeans #3376

Merged
merged 25 commits into from Jul 14, 2014

Conversation

Projects
None yet
5 participants
Owner

GaelVaroquaux commented Jul 14, 2014

This is a follow up on @amueller's PR:
#2638

The code was rebased on master

A minor test failure on my box was fixed

The strategy has been switched to uniform drawing of centers in the population.

@ogrisel, @amueller, please review. I believe that this should go in for 0.15.

Coverage Status

Coverage decreased (-0.12%) when pulling 6af3615 on GaelVaroquaux:fix_mb_kmeans into cdf3b7b on scikit-learn:master.

@ogrisel ogrisel commented on an outdated diff Jul 14, 2014

sklearn/cluster/k_means_.py
- number_of_reassignments = to_reassign.sum()
- if number_of_reassignments:
- # Pick new clusters amongst observations with probability
- # proportional to their closeness to their center.
- # Flip the ordering of the distances.
- distances -= distances.max()
- distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+ to_reassign = counts < reassignment_ratio * counts.max()
+ # maximally assign reassignment_ratio*batch_size samples as clusters
+ if to_reassign.sum() > .5 * X.shape[0]:
+ indices_dont_reassign = np.argsort(counts)[int(.5 * X.shape[0]):]
+ to_reassign[indices_dont_reassign] = False
@ogrisel

ogrisel Jul 14, 2014

Owner

The comment "# maximally assign reassignment_ratio*batch_size samples as clusters" does not seem to describe what is implemented here.

Furthermore I don't understand why we would reassign 50% of the batch size. Instead I think should reassign max 20% of the center at once, that is int(.2 * self.n_clusters).

WDYT?

@ogrisel ogrisel commented on an outdated diff Jul 14, 2014

sklearn/cluster/k_means_.py
- # Flip the ordering of the distances.
- distances -= distances.max()
- distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+ to_reassign = counts < reassignment_ratio * counts.max()
+ # maximally assign reassignment_ratio*batch_size samples as clusters
+ if to_reassign.sum() > .5 * X.shape[0]:
+ indices_dont_reassign = np.argsort(counts)[int(.5 * X.shape[0]):]
+ to_reassign[indices_dont_reassign] = False
+ n_reassigns = to_reassign.sum()
+ if n_reassigns:
+ # Pick new clusters amongst observations with uniform probability
+ new_centers = choice(X.shape[0], replace=False, size=n_reassigns)
@ogrisel

ogrisel Jul 14, 2014

Owner

Needs to be seeded with self.random_state.

@ogrisel ogrisel commented on the diff Jul 14, 2014

sklearn/cluster/k_means_.py
labels[dist < mindist] = center_id
mindist = np.minimum(dist, mindist)
+ if n_samples == distances.shape[0]:
+ # distances will be changed in-place
+ distances[:] = mindist
@ogrisel

ogrisel Jul 14, 2014

Owner

I think distances no longer need to be computed as we do uniform random reassignments now.

@ogrisel ogrisel commented on the diff Jul 14, 2014

sklearn/cluster/k_means_.py
@@ -1276,7 +1318,7 @@ def partial_fit(self, X, y=None):
# reassignment too often, to allow for building up counts
random_reassign = self.random_state_.randint(
10 * (1 + self.counts_.min())) == 0
- distances = np.zeros(self.n_clusters, dtype=np.float64)
+ distances = np.zeros(X.shape[0], dtype=np.float64)
@ogrisel

ogrisel Jul 14, 2014

Owner

Looks fishy. Why would the distances shape change?

@vene vene commented on an outdated diff Jul 14, 2014

sklearn/cluster/k_means_.py
The vector in which we keep track of the numbers of elements in a
cluster. This array is MODIFIED IN PLACE
- distances: array, dtype float64, shape (n_samples), optional
+ distances : array, dtype float64, shape (n_samples), optional
If not None, should be a pre-allocated array that will be used to store
the distances of each sample to it's closest center.
@vene

vene Jul 14, 2014

Owner

s/it's/its

I know it's not your change but still :)

amueller and others added some commits Dec 4, 2013

@amueller @GaelVaroquaux amueller TST add test for minibatch k-means reassigns. 5174565
@amueller @GaelVaroquaux amueller FIX bug where random state was reset in each call to partial_fit, suc…
…h that reassignment either occured never or always.
d7a557a
@amueller @GaelVaroquaux amueller FIX bug where distances were of illegal shape 3f3d8ad
@amueller @GaelVaroquaux amueller FIX crash when to_reassign.sum() > X.shape[0] 5141b9e
@amueller @GaelVaroquaux amueller TST split test for partial_fit and fit f130079
@amueller @GaelVaroquaux amueller TST add additional test for batch_size > n_samples in fit. b8ca2ac
@amueller @GaelVaroquaux amueller FIX use choice in minibatch_kmeans b008e5d
@amueller @GaelVaroquaux amueller FIX another (two?) bugs: the default code path didn't ever compute di…
…stances.
b654f04
@amueller @GaelVaroquaux amueller MISC rename random.pyx to _random.pyx f7d14b3
@amueller @GaelVaroquaux amueller ENH backport np.random.choice. 1ea8fca
@amueller @GaelVaroquaux amueller FIX broken tests for minibatch k-means 4c97c6a
@amueller @GaelVaroquaux amueller skip doctests 7a04b8d
@amueller @GaelVaroquaux amueller simplify tests c2d59d2
@amueller @GaelVaroquaux amueller remove redundant code 72d9e25
@amueller @GaelVaroquaux amueller FIX up the rest of the stuff. Introduce .5 as a magic number. Hurray …
…for magic numbers?
730683e
@amueller @GaelVaroquaux amueller DOC more / better documentation 7f5a784
@amueller @GaelVaroquaux amueller FIX reset counts of reassigned clusters. 894c530
@amueller @GaelVaroquaux amueller ENH never reassign everything, more meaningful tests. a39049b
@GaelVaroquaux GaelVaroquaux TEST: fix test failing due to numeric instability 05e65a6
@GaelVaroquaux GaelVaroquaux COSMIT: address misc comments e191366
@GaelVaroquaux GaelVaroquaux Minibach k-means: change reassignment to uniform
Use a uniform sampling of points in the reassignment strategy, as it
is the strategy that carries the less assumptions on the data and the
usecase.
c8397ec
@GaelVaroquaux GaelVaroquaux MISC: avoid deprecation warning 0861f89
@GaelVaroquaux GaelVaroquaux MISC: minor changes in MBKmeans
Address the gist of @ogrisel's comments
40f8b49

@vene vene commented on an outdated diff Jul 14, 2014

sklearn/cluster/k_means_.py
- reassignment_ratio: float, optional
+ reassignment_ratio : float, optional
Control the fraction of the maximum number of counts for a
center to be reassigned. A higher value means that low count
centers are more easily reassigned, which means that the
@vene

vene Jul 14, 2014

Owner

I find the phrasing "more easily reassigned" a bit confusing, how about "more likely to be reassigned"?

@vene vene commented on the diff Jul 14, 2014

sklearn/utils/tests/test_extmath.py
@@ -72,8 +72,8 @@ def test_random_weights():
mode, score = weighted_mode(x, w, axis=1)
- np.testing.assert_array_equal(mode, mode_result)
- np.testing.assert_array_almost_equal(score.ravel(), w[:, :5].sum(1))
+ assert_array_equal(mode, mode_result)
+ assert_array_almost_equal(score.ravel(), w[:, :5].sum(1))
@vene

vene Jul 14, 2014

Owner

is this axis=1?

@vene vene commented on the diff Jul 14, 2014

sklearn/cluster/k_means_.py
The generator used to initialize the centers. If an integer is
given, it fixes the seed. Defaults to the global numpy random
number generator.
- random_reassign: boolean, optional
- If True, centers with very low counts are
- randomly-reassigned to observations in dense areas.
+ random_reassign : boolean, optional
@vene

vene Jul 14, 2014

Owner

Just wondering: is this parameter needed, or could it be controlled through reassignment_ratio=0?

This could leave room for later adding reassign_method={random, near, far...} without needing to deprecate anything.

@GaelVaroquaux

GaelVaroquaux Jul 14, 2014

Owner

Just wondering: is this parameter needed, or could it be controlled through
reassignment_ratio=0?

If was there before (in release 0.14)

@vene vene and 1 other commented on an outdated diff Jul 14, 2014

sklearn/cluster/k_means_.py
@@ -836,30 +880,29 @@ def _mini_batch_step(X, x_squared_norms, centers, counts,
if random_reassign and reassignment_ratio > 0:
random_state = check_random_state(random_state)
# Reassign clusters that have very low counts
- to_reassign = np.logical_or(
- (counts <= 1), counts <= reassignment_ratio * counts.max())
- number_of_reassignments = to_reassign.sum()
- if number_of_reassignments:
- # Pick new clusters amongst observations with probability
- # proportional to their closeness to their center.
- # Flip the ordering of the distances.
- distances -= distances.max()
- distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+ to_reassign = counts < reassignment_ratio * counts.max()
+ # maximally assign .5 * batch_size samples as clusters
@vene

vene Jul 14, 2014

Owner

This comment (the "as clusters") part is misleading. I'd say "reassign at most .5 * batch_size clusters", no?

@ogrisel

ogrisel Jul 14, 2014

Owner

I find both formulations correct and equivalent.

@vene vene commented on the diff Jul 14, 2014

sklearn/cluster/k_means_.py
- # proportional to their closeness to their center.
- # Flip the ordering of the distances.
- distances -= distances.max()
- distances *= -1
- rand_vals = random_state.rand(number_of_reassignments)
- rand_vals *= distances.sum()
- new_centers = np.searchsorted(distances.cumsum(),
- rand_vals)
+ to_reassign = counts < reassignment_ratio * counts.max()
+ # maximally assign .5 * batch_size samples as clusters
+ if to_reassign.sum() > .5 * X.shape[0]:
+ indices_dont_reassign = np.argsort(counts)[int(.5 * X.shape[0]):]
+ to_reassign[indices_dont_reassign] = False
+ n_reassigns = to_reassign.sum()
+ if n_reassigns:
+ # Pick new clusters amongst observations with uniform probability
@vene

vene Jul 14, 2014

Owner

Really "pick new centers". Sorry but the comments gave me a hard time understanding what the code does. Maybe it's just me, in the end it's all fairly standard.

Owner

ogrisel commented Jul 14, 2014

This looks good. I ran the cluster/plot_dict_face_patches.py example increasing the number of patches to 13x13 == 169 and it works great.

The document clustering example seems to have the same lever of performance as previously.

+1 for merge!

Owner

GaelVaroquaux commented Jul 14, 2014

Merging this bastard!

GaelVaroquaux merged commit 0597dc1 into scikit-learn:master Jul 14, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

Coverage Status

Coverage decreased (-0.12%) when pulling 45aa833 on GaelVaroquaux:fix_mb_kmeans into 14d03f6 on scikit-learn:master.

Owner

amueller commented Jul 14, 2014

Hurray!

Owner

GaelVaroquaux commented Jul 14, 2014

Hurray!

You bet! You did most of the work. Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment