-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider the change for n_init
in KMeans
and MiniBatchKMeans
#25022
Comments
pinging @jeremiedbb @ogrisel @thomasjpfan @Micky774 |
It's not silent, it follows a proper deprecation cycle.
Well it's in the changelog. Like any other change of default value it may have an impact on the results. Default values are not chosen to give the best quality in all situations (this is just impossible) nor the best efficiency, but a good tradeoff between those. The snippet in the linked discussion shows a increase of around 3% on the inertia after the change of default. I think it's more than acceptable for a 10x speed-up. Looks like a good deal to me. The same snippet shows that increasing n_init to 100 would give better results by 1-2%, I don't think it would be a reasonnable argument to change the default to 100 :) (If it were only me, the default would be 1 for all inits. Doing many inits means tuning the random_state. Doing random inits is just a naive way of trying to end up in a better local minima for a non convex problem. There are others, some better, solutions to achieve this. This is why I don't like doing that by default because most users are not aware of that.) |
Right now the docstring says:
I would make that at least easier for users by saying something like "you might consider a larger I would also be okay with keeping |
I think you refer to daal4py which was more recently renamed to https://github.com/intel/scikit-learn-intelex#-acceleration on some datasets, the Intel version is indeed 50x faster than scikit-learn but the benchmark code is using That being said, I am ok with keeping the switch but we should definitely currently in
Also a question for @gittar: we don't really care about 2%-3% of SSE on purely random data since this would probably not lead to a meaningful improvement in expectation (on held-out evaluation data). Could you re-try similar experiments on a semi-structured dataset (e.g. wide blob-ish clusters + uniform noise background) with a random train-test split and evaluate the impact of |
thx for your comments and apologies for the "silently". I was only looking at the dev code before my initial discussion post #25016. I became aware of the long discussion history and the planned deprecation measures only later. I will prepare answers to all your questions a.s.a.p. |
To me it would be more interesting to do the opposite from a pedagogical point of view. That is defaulting to one and explain that the problem is non-convex so it'll end up in a local minima and one of the possible ways to find a better minima and thus improve the SSE could be to do random inits, at the cost of a substantial slow down. Imo it makes the discussion more interesting and focused around the optimization problem rather than just a matter of speed. |
As discussed irl with @ogrisel and @glemaitre, I think we can remove the blocker tag to not hold the release more. It's probably more a matter of documentation and we can still roll back, if discussions end up that way, between the release candidate and the finale release. |
@glemaitre wrote
Definitely. In my experience this is much more often the case than not. The 3x3 cluster example from the docs is special in the sense that the number of clusters (9) is very small and known (which is usually not the case). This gives k-means++ an unusually good chance to get it "right", i.e. put one center in each cluster. Once this is achieved, the solution is saturated and cannot get any better, so no more trials are needed. Slightly modifying this example gives a diffent picture. If we chose 4x4 clusters and set the parameter IMO the probability of k-means++ to find a very good (low SSE) solution
For detailed simulation results (comparing greedy |
@ogrisel wrote:
Is the structured example in https://github.com/gittar/skl-discussion/blob/main/n-init.ipynb any helpful? Regarding statements like "2%-3% of SSE are not of interest (on random data)" or as @jeremiedb stated "an increase of 3% is a good deal if one gets a 10x speedup" I would say: This is typically what you get from multiple runs of k-means++. The total cost is obviously proportional to the number of runs. The maximum gain in SSE is limited by the expected distance of a single k-means++ run from the (usually unknown) optimal solution. Since k-means++ is quite good, this distance is often below 10% IMO. IMO it is the user who has to decide whether the gain is worth the cost. And instead of repeating k-means++ it might be more economical to use other methods which get to better solutions at lower costs. (ad for "breathing k-means" deleted :-) ) Regarding the original question (Should the default of
Regarding the question for experiments with held-out test data: I never considered doing this since in my understanding the Please see my draft paper (sent via email) for relevant empirical results. |
@jeremiedbb wrote:
This is also a valid point. Each randomized algorithm can be "improved" by running it several times and picking the best result. Whether this makes a lot of sense depends on the variation in the results. If the variation is high, so is the expected payoff in doing several trials and vice versa. Any particular default number of trials (like 10 or 3) is quite arbitrary, however, and may not be noted by users of the algorithm giving them a wrong impression of the efficiency and quality of the algorithm. Therefore, from a principled point of view, one could -- as @jeremiedbb suggests -- always set In this context I looked at other ocurrences of |
There's no deprecation note because the default will not change for these estimators. It makes me think that we should maybe do the change of default to match the default of KMeans. |
I do like this idea in a pure sense.
💯 I think it makes much more sense to frame it as such, especially since this can be done equivalently via random seed perturbations. |
So it seems that we went forward with the deprecation cycle. Not sure to recall what was the issue. Closing. |
I open this PR to reconsider the changes introduced in #23038.
We decided to use a single initialization when using
init="kmeans++
. In the original issue (#9729), it seems that we based our choice on two aspects:pydaal
While 1. is not an absolute ground truth, we might have overlooked some aspects in 2. Indeed, the example from #25016 illustrates a decrease in statistical performance. In this latest benchmark, the data do not have an underlying structure and limiting the number of initialization is detrimental in this case.
We should reconsider because it could be quite surprising behaviour.
Possible resolutions are:
main
n_init=10
for bothinit="random"
andinit="kmeans++"
n_init
from 1 to 5 wheninit="kmeans++"
@gittar I am wondering if you encounter other situations where a single initialization with
"kmeans++"
would go sideways.The text was updated successfully, but these errors were encountered: