Change KMeans n_init default value to 1. #9729

amueller · 2017-09-11T14:55:18Z

As mentioned in #9430, pydaal gets a speedup of 30x, mostly because they ignore n_init.
I feel like n_init=10 is a pretty odd choice. We don't really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it's worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-09-11T21:51:23Z

Yes, let's change it. Or default to have an inverse relationship with dataset size.

…

On 12 September 2017 at 00:55, Andreas Mueller ***@***.***> wrote: As mentioned in #9430 <#9430>, pydaal gets a speedup of 30x, mostly because they ignore n_init. I feel like n_init=10 is a pretty odd choice. We don't really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive. Not sure if it's worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9729>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zVQKD2vCDZr95LKaUQ7JLHwS6vUks5shUnagaJpZM4PTSrW> .

amueller · 2018-07-12T15:42:31Z

Standard deprecation cycle then?

murielgrobler · 2018-07-15T18:01:06Z

I created the warning for the future deprecation. Thanks for all the guidance at Scipy2018!

amueller · 2018-07-15T20:23:26Z

@murielgrobler Thanks!

GaelVaroquaux · 2018-07-24T13:15:35Z

I am -1for this. A single init by default for kmeans will not get good results.

We could revisit our init strategy to do only a partial converge. We could also change it to 5 by default. But I think that it is bad practice to race for speed at the detriment of quality of results.

GaelVaroquaux · 2018-07-24T13:17:23Z

@jeremiedbb is working on speeding up kmeans, and will gain some speed soon hopefully.

amueller · 2018-07-24T18:19:49Z

Why 5 though? That seems pretty arbitrary. And I think the gains with kmeans++ are very small. We don't to random restarts on any other non-convex optimization, right? It seems pretty inconsistent.

Also, it's not necessarily speed I'm optimizing for, it's user surprise.

amueller · 2018-07-24T18:20:38Z

oh, sorry 10, not 5. Why 10? ;)

ni3-k · 2018-10-05T19:24:11Z

Can I work on it?

murielgrobler · 2018-11-12T18:24:47Z

Yes, sure - thanks!

…

On Fri, Oct 5, 2018 at 2:26 PM Nitin Kshatriya ***@***.***> wrote: Can I work on it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9729 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMspLSm1o09TYy8izexOtvd5otjEeEA2ks5uh7JNgaJpZM4PTSrW> .

amueller · 2018-11-12T18:38:53Z

I don't think there's a consensus on what to do, though.

GaelVaroquaux · 2018-11-23T09:20:11Z

I don't think there's a consensus on what to do, though.

I am willing to reconsider my opinion if what we are doing is significantly different from what other packages are doing.

jeremiedbb · 2022-04-04T13:38:44Z

I've always found the default n_init=10 really weird. What it does is run 10 KMeans with different random_states and pick the best. This is like a grid search (no cv) on the random_state.

I think the n_init parameter is often overlooked and I agree that most probably don't even know they are running KMeans 10 times and think it's pretty slow. It's also not consistent with our other clustering estimators as others already said. If I were to do it, I'd set the default to 1 but better document that kmeans is sensitive to randomness and what to do to about that.

In addition, the default init is "k-means++", which doesn't require as many runs as with "random" inits to get as good or even better clusterings.

GaelVaroquaux · 2022-04-04T13:41:53Z

This is like a grid search (no cv) on the random_state.

With non convex models this makes sense.

jeremiedbb · 2022-04-04T13:48:58Z

With non convex models this makes sense.

It does, but it's only one way to try to find a better local minimum. There are many other techniques. This one in particular is really not efficient since it does the whole optimization for each seed. More subtle strategies would for instance start the optimisation scheme and do some random restarts on "bad clusters".

Imo, This default is currently forcing the users to use the really naive search for a better minimum.

thomasjpfan · 2022-04-06T21:11:57Z

In the example: Empirical evaluation of the impact of k-means initialization, it does show that n_init > 1 leads to an improvement for init="random". For k-means++, n_init does not make a difference. If we go by the example, then we can have a n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

GaelVaroquaux · 2022-04-07T07:35:23Z

n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

Sounds like a good idea.

amueller added the Question label Sep 11, 2017

amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted and removed Question labels Jul 12, 2018

amueller changed the title ~~RFC KMeans n_init default value?~~ Change KMeans n_init default value to 1. Jul 12, 2018

murielgrobler mentioned this issue Jul 15, 2018

[MRG] EHN: Change default value of n_init to 1 in KMeans #11530

Closed

amueller removed good first issue Easy with clear instructions to resolve help wanted labels Sep 29, 2018

jeremiedbb mentioned this issue Jul 21, 2020

MNT change n_init default in KMeans #17964

Closed

cmarmo added Needs Decision Requires decision module:cluster and removed Easy Well-defined and straightforward way to resolve labels Apr 21, 2021

Micky774 mentioned this issue Apr 3, 2022

EHN Change default value of n_init in cluster.KMeans and cluster.k_means #23038

Merged

cmarmo removed the Needs Decision Requires decision label May 6, 2022

jeremiedbb closed this as completed in #23038 May 25, 2022

glemaitre mentioned this issue Nov 24, 2022

Reconsider the change for n_init in KMeans and MiniBatchKMeans #25022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change KMeans n_init default value to 1. #9729

Change KMeans n_init default value to 1. #9729

amueller commented Sep 11, 2017

jnothman commented Sep 11, 2017 via email

amueller commented Jul 12, 2018

murielgrobler commented Jul 15, 2018

amueller commented Jul 15, 2018

GaelVaroquaux commented Jul 24, 2018

GaelVaroquaux commented Jul 24, 2018

amueller commented Jul 24, 2018

amueller commented Jul 24, 2018

ni3-k commented Oct 5, 2018

murielgrobler commented Nov 12, 2018 via email

amueller commented Nov 12, 2018

GaelVaroquaux commented Nov 23, 2018 via email

jeremiedbb commented Apr 4, 2022

GaelVaroquaux commented Apr 4, 2022 via email

jeremiedbb commented Apr 4, 2022

thomasjpfan commented Apr 6, 2022

GaelVaroquaux commented Apr 7, 2022 via email

Change KMeans n_init default value to 1. #9729

Change KMeans n_init default value to 1. #9729

Comments

amueller commented Sep 11, 2017

jnothman commented Sep 11, 2017 via email

amueller commented Jul 12, 2018

murielgrobler commented Jul 15, 2018

amueller commented Jul 15, 2018

GaelVaroquaux commented Jul 24, 2018

GaelVaroquaux commented Jul 24, 2018

amueller commented Jul 24, 2018

amueller commented Jul 24, 2018

ni3-k commented Oct 5, 2018

murielgrobler commented Nov 12, 2018 via email

amueller commented Nov 12, 2018

GaelVaroquaux commented Nov 23, 2018 via email

jeremiedbb commented Apr 4, 2022

GaelVaroquaux commented Apr 4, 2022 via email

jeremiedbb commented Apr 4, 2022

thomasjpfan commented Apr 6, 2022

GaelVaroquaux commented Apr 7, 2022 via email