Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change KMeans n_init default value to 1. #9729

Closed
amueller opened this issue Sep 11, 2017 · 17 comments · Fixed by #23038
Closed

Change KMeans n_init default value to 1. #9729

amueller opened this issue Sep 11, 2017 · 17 comments · Fixed by #23038

Comments

@amueller
Copy link
Member

As mentioned in #9430, pydaal gets a speedup of 30x, mostly because they ignore n_init.
I feel like n_init=10 is a pretty odd choice. We don't really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it's worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

@jnothman
Copy link
Member

jnothman commented Sep 11, 2017 via email

@amueller amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted and removed Question labels Jul 12, 2018
@amueller
Copy link
Member Author

Standard deprecation cycle then?

@amueller amueller changed the title RFC KMeans n_init default value? Change KMeans n_init default value to 1. Jul 12, 2018
@murielgrobler
Copy link
Contributor

I created the warning for the future deprecation. Thanks for all the guidance at Scipy2018!

@amueller
Copy link
Member Author

@murielgrobler Thanks!

@GaelVaroquaux
Copy link
Member

I am -1for this. A single init by default for kmeans will not get good results.

We could revisit our init strategy to do only a partial converge. We could also change it to 5 by default. But I think that it is bad practice to race for speed at the detriment of quality of results.

@GaelVaroquaux
Copy link
Member

@jeremiedbb is working on speeding up kmeans, and will gain some speed soon hopefully.

@amueller
Copy link
Member Author

Why 5 though? That seems pretty arbitrary. And I think the gains with kmeans++ are very small. We don't to random restarts on any other non-convex optimization, right? It seems pretty inconsistent.

Also, it's not necessarily speed I'm optimizing for, it's user surprise.

@amueller
Copy link
Member Author

oh, sorry 10, not 5. Why 10? ;)

@amueller amueller removed good first issue Easy with clear instructions to resolve help wanted labels Sep 29, 2018
@ni3-k
Copy link

ni3-k commented Oct 5, 2018

Can I work on it?

@murielgrobler
Copy link
Contributor

murielgrobler commented Nov 12, 2018 via email

@amueller
Copy link
Member Author

I don't think there's a consensus on what to do, though.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 23, 2018 via email

@jeremiedbb
Copy link
Member

I've always found the default n_init=10 really weird. What it does is run 10 KMeans with different random_states and pick the best. This is like a grid search (no cv) on the random_state.

I think the n_init parameter is often overlooked and I agree that most probably don't even know they are running KMeans 10 times and think it's pretty slow. It's also not consistent with our other clustering estimators as others already said. If I were to do it, I'd set the default to 1 but better document that kmeans is sensitive to randomness and what to do to about that.

In addition, the default init is "k-means++", which doesn't require as many runs as with "random" inits to get as good or even better clusterings.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 4, 2022 via email

@jeremiedbb
Copy link
Member

With non convex models this makes sense.

It does, but it's only one way to try to find a better local minimum. There are many other techniques. This one in particular is really not efficient since it does the whole optimization for each seed. More subtle strategies would for instance start the optimisation scheme and do some random restarts on "bad clusters".

Imo, This default is currently forcing the users to use the really naive search for a better minimum.

@thomasjpfan
Copy link
Member

In the example: Empirical evaluation of the impact of k-means initialization, it does show that n_init > 1 leads to an improvement for init="random". For k-means++, n_init does not make a difference. If we go by the example, then we can have a n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 7, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants