-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change KMeans n_init default value to 1. #9729
Comments
Yes, let's change it. Or default to have an inverse relationship with
dataset size.
…On 12 September 2017 at 00:55, Andreas Mueller ***@***.***> wrote:
As mentioned in #9430
<#9430>, pydaal gets a
speedup of 30x, mostly because they ignore n_init.
I feel like n_init=10 is a pretty odd choice. We don't really do random
restarts anywhere else by default (afaik), and in the age of large datasets
this is really pretty expensive.
Not sure if it's worth a deprecation cycle but this is pretty non-obvious
behavior that potentially makes us look bad.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9729>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zVQKD2vCDZr95LKaUQ7JLHwS6vUks5shUnagaJpZM4PTSrW>
.
|
Standard deprecation cycle then? |
I created the warning for the future deprecation. Thanks for all the guidance at Scipy2018! |
@murielgrobler Thanks! |
I am -1for this. A single init by default for kmeans will not get good results. We could revisit our init strategy to do only a partial converge. We could also change it to 5 by default. But I think that it is bad practice to race for speed at the detriment of quality of results. |
@jeremiedbb is working on speeding up kmeans, and will gain some speed soon hopefully. |
Why 5 though? That seems pretty arbitrary. And I think the gains with kmeans++ are very small. We don't to random restarts on any other non-convex optimization, right? It seems pretty inconsistent. Also, it's not necessarily speed I'm optimizing for, it's user surprise. |
oh, sorry 10, not 5. Why 10? ;) |
Can I work on it? |
Yes, sure - thanks!
…On Fri, Oct 5, 2018 at 2:26 PM Nitin Kshatriya ***@***.***> wrote:
Can I work on it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9729 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMspLSm1o09TYy8izexOtvd5otjEeEA2ks5uh7JNgaJpZM4PTSrW>
.
|
I don't think there's a consensus on what to do, though. |
I don't think there's a consensus on what to do, though.
I am willing to reconsider my opinion if what we are doing is
significantly different from what other packages are doing.
|
I've always found the default I think the In addition, the default |
This is like a grid search (no cv) on the random_state.
With non convex models this makes sense.
|
It does, but it's only one way to try to find a better local minimum. There are many other techniques. This one in particular is really not efficient since it does the whole optimization for each seed. More subtle strategies would for instance start the optimisation scheme and do some random restarts on "bad clusters". Imo, This default is currently forcing the users to use the really naive search for a better minimum. |
In the example: Empirical evaluation of the impact of k-means initialization, it does show that |
n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.
Sounds like a good idea.
|
As mentioned in #9430, pydaal gets a speedup of 30x, mostly because they ignore n_init.
I feel like
n_init=10
is a pretty odd choice. We don't really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.Not sure if it's worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.
The text was updated successfully, but these errors were encountered: