Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

MinibatchKMeans bad center reallocation causes duplicate centers #2185

Closed
ogrisel opened this Issue Jul 22, 2013 · 17 comments

Comments

Projects
None yet
4 participants
Owner

ogrisel commented Jul 22, 2013

For instance have a look at:

http://scikit-learn.org/dev/auto_examples/cluster/plot_dict_face_patches.html

some of the centroids are duplicated, presumably because of a bug in the bad cluster reallocation heuristic.

Owner

GaelVaroquaux commented Jul 25, 2013

I suggest that we punt on this for 0.14: I don't think that we have time to address it.

Owner

ogrisel commented Jul 25, 2013

I am fine with re-scheduling this for 0.15 or 0.14.1 if we do a bugfix release.

@kpysniak kpysniak added a commit to kpysniak/scikit-learn that referenced this issue Aug 10, 2013

@kpysniak kpysniak Issue #2185: Fixed MinibatchKMeans bad center reallocation which caus…
…ed duplicate centers
9b406d8
Owner

larsmans commented Aug 26, 2013

Can someone confirm my understanding of the current algorithm? When some clusters are too small, we pick a bunch of points, preferring those that are close to their own cluster centers, and replace the too-small clusters with clusters centered at those points.

I see how this tries to break clusters evenly, but I don't see how it finds high-density regions. E.g. when there are singleton clusters, won't those have a high probability of getting picked as new centroids again? The only point in a singleton will have distance zero to the centroid.

Owner

ogrisel commented Aug 26, 2013

Over time different singleton from different minibatches are randomly selected so its likely that at some point some isolated new point will actually come from a high density region, I think.

Owner

larsmans commented Aug 26, 2013

... and points are likely to originate in dense regions a priori, so we can just select them uniformly at random. Doh!

Owner

ogrisel commented Aug 26, 2013

But on the other hand it's more likely that the potentially missing centers leave in areas far away from the existing well populated cluster centers. We probably need to do more testing on real life data (for instance the face patches & text documents) and measure the final inertia of each reallocation strategy to settle the debate.

@kpysniak would you like to work on such an evaluation script in a http://gist.github.com ?

Owner

larsmans commented Aug 26, 2013

(Please disregard previous comment, it's monday and all.)

Contributor

kpysniak commented Aug 26, 2013

Sure, it's interesting! I'll let you know if I have any questions, thanks for the opportunity.

On Aug 26, 2013, at 5:45 AM, Olivier Grisel notifications@github.com wrote:

Its more likely that the potentially missing center leave in area far away from the existing well populated clusters. We probably need to do more testing on real life data (for instance the face patches & text documents) and measure the final inertia of each reallocation strategy to settle the debate.

@kpysniak would you like to work on such an evaluation script in a http://gist.github.com ?


Reply to this email directly or view it on GitHub.

@kpysniak kpysniak added a commit to kpysniak/scikit-learn that referenced this issue Aug 27, 2013

@kpysniak kpysniak Issue #2185: Fixed MinibatchKMeans bad center reallocation which caus…
…ed duplicate centers
39b3d9c
Contributor

kpysniak commented Sep 1, 2013

The script can be found here: https://gist.github.com/kpysniak/6396643. However, to run it, you need some of my changes that can be found on my branch here: https://github.com/kpysniak/scikit-learn/tree/kmeansInertiaTest . I made some tests on document clustering, but didn't produce any useful results (the final inertia was very similar for all the cases). I think we need to find some problem where the rate of reassignments is very substantial. Would you suggest anything?

Owner

ogrisel commented Sep 2, 2013

Here is an output when I run this on my box:

('km_unique_labels_inertia average: ', 18295.715021004369, ', std: ', 17.144739522787624)
('km_nonunique_labels_inertia average: ', 18287.268003610596, ', std: ', 12.494793904669343)
('km_unique_uniform_labels_inertia average: ', 18299.902133868283, ', std: ', 16.864643134048041)
('km_nonunique_uniform_labels_inertia average: ', 18291.754242812476, ', std: ', 17.668623463881538)

This is on the 20 newsgroups data. The result is not conclusive. Could you please adapt your script to run on the image patches example? It would also be interesting to compute the inertia of clustering with reallocation disabled.

Owner

ogrisel commented Sep 2, 2013

You could also try on a smaller subset of 20newsgroups (e.g. 4 categories) but with more centers: e.g. 100.

Contributor

kpysniak commented Sep 8, 2013

The new script is in https://gist.github.com/kpysniak/6481192

The results are:
km_unique_labels: (mean, std): (90460.144427, 5349.811842) time: (mean, std): (7.512647, 0.470400)
km_nonunique_labels: (mean, std): (89700.156669, 4637.428358) time: (mean, std): (7.660844, 0.467091)
km_unique_uniform_labels: (mean, std): (78354.578439, 4689.920094) time: (mean, std): (8.206789, 0.653612)
km_nonunique_uniform_labels: (mean, std): (81409.221792, 5294.389171) time: (mean, std): (7.611472, 0.745147)

Those results are more conclusive. @ogrisel @larsmans what do you think about those results?

We could either choose one of those methods (unique uniformly-distributed centroids?), or give the user an option to choose one of them. They perform differently in terms of both inertia and time.

Owner

ogrisel commented Sep 8, 2013

Could you try to run the same experiment on a 4 categories subset of the 20 newsgroups dataset with a larger number of cluster centers (e.g. 100)?

Could you also also include the results for MinibatchKMeans with bad centers reallocation disabled?

If the unique uniform labels strategy is always the best then we might just implement this one. I am not a big fan of growing the number of hyper-parameters if there is a clear winning strategy that almost always works best on many different yet non-toy datasets.

Contributor

kpysniak commented Sep 16, 2013

@ogrisel The same experiment on a 4 categories subset of the 20 newsgroups datasets yielded:
km_unique_labels: (mean, std): (1886.295297, 33.987037) time: (mean, std): (3.072696, 0.525144)
km_nonunique_labels: (mean, std): (1911.166938, 38.941494) time: (mean, std): (2.933915, 0.356862)
km_unique_uniform_labels: (mean, std): (1890.847212, 11.237832) time: (mean, std): (3.027502, 0.404976)
km_nonunique_uniform_labels: (mean, std): (1892.659669, 20.930056) time: (mean, std): (2.867299, 0.377220)

Unique label strategy seems to be the best. The changes are ready in my PR. @ogrisel is it ready to be merged?

Owner

ogrisel commented Sep 17, 2013

How many cluster centers in the last bench?

Owner

ogrisel commented Sep 17, 2013

Could you also please re-rerun all the tests:

  • 4 categories 20 newsgroups with 100 centers

and

  • the faces patches example

By adding a line for the case where you disable bad center reallocation completely as requested previously?

Owner

GaelVaroquaux commented Jul 14, 2014

Fixed by #3376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment