Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add Kmeans parameter for pruning small clusters #848

Open
amueller opened this Issue May 10, 2012 · 5 comments

Comments

Projects
None yet
4 participants
Owner

amueller commented May 10, 2012

In Kmeans, often some clusters have only very little data. This might happen for all random initializations.
For this case, I would like to have an option to set a minimum cluster size, after which a cluster is dropped and a new one is created.

Owner

ogrisel commented May 10, 2012

I have started investigated with this for the online variant (minibatch kmeans): clusters that have not been updated recently are deleted and recently seen samples that are far from any cluster centers are used to init a new set of cluster center using the kmeans++ init on those samples: if you need to reinit 3 cluster centers out of 100 pickup 3% of the samples of the last batch and run kmeans++ init on them to find the new centers and then proceeed as usual.

This trategy works great for stabilizing minibatch kmeans on dense low dimensional bloby (multi modal) data sets. However it does not work at all on sparse high dim data (e.g. the 20 newsgroups dataset). The reason is probably the norm of the new centers that is very different from the nearly converged good centers. It is very likely that using maximum cosine similarity or dot product instead of minium euclidean distances for cluster assignment would help fix the issue.

However current kmeans implementation is using hardcoded euclidean distances low level in the cython code (for efficiency reasons). Making the metric customizable (yet efficient) would be required to address this issue properly for the online case.

Contributor

ldirer commented Jul 20, 2014

Is this issue referring to batch k-means or mini-batch k-means?
Mini-batch k-means already implements this in the master.

Owner

amueller commented Jul 20, 2014

I meant batch.

Member

raghavrv commented Nov 28, 2014

@amueller Could I take this up?

Owner

amueller commented Dec 1, 2014 edited by raghavrv

@raghavrv sure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment