Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider the change for n_init in KMeans and MiniBatchKMeans #25022

Closed
glemaitre opened this issue Nov 24, 2022 · 13 comments
Closed

Reconsider the change for n_init in KMeans and MiniBatchKMeans #25022

glemaitre opened this issue Nov 24, 2022 · 13 comments

Comments

@glemaitre
Copy link
Member

glemaitre commented Nov 24, 2022

I open this PR to reconsider the changes introduced in #23038.

We decided to use a single initialization when using init="kmeans++. In the original issue (#9729), it seems that we based our choice on two aspects:

  1. the default parameter used in pydaal
  2. the benchmark of the following examples: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_stability_low_dim_dense.html

While 1. is not an absolute ground truth, we might have overlooked some aspects in 2. Indeed, the example from #25016 illustrates a decrease in statistical performance. In this latest benchmark, the data do not have an underlying structure and limiting the number of initialization is detrimental in this case.

We should reconsider because it could be quite surprising behaviour.

Possible resolutions are:

  • keep the behaviour implemented in main
  • revert to n_init=10 for both init="random" and init="kmeans++"
  • change n_init from 1 to 5 when init="kmeans++"

@gittar I am wondering if you encounter other situations where a single initialization with "kmeans++" would go sideways.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Nov 24, 2022
@glemaitre glemaitre added Blocker and removed Needs Triage Issue requires triage labels Nov 24, 2022
@glemaitre glemaitre added this to the 1.2 milestone Nov 24, 2022
@glemaitre
Copy link
Member Author

pinging @jeremiedbb @ogrisel @thomasjpfan @Micky774

@jeremiedbb
Copy link
Member

planned change of default value for n_init parameter of KMeans to auto will silently worsen results of existing code

It's not silent, it follows a proper deprecation cycle.

If this change is made as planned this should IMO be communicated very clearly.

Well it's in the changelog. Like any other change of default value it may have an impact on the results. Default values are not chosen to give the best quality in all situations (this is just impossible) nor the best efficiency, but a good tradeoff between those.

The snippet in the linked discussion shows a increase of around 3% on the inertia after the change of default. I think it's more than acceptable for a 10x speed-up. Looks like a good deal to me. The same snippet shows that increasing n_init to 100 would give better results by 1-2%, I don't think it would be a reasonnable argument to change the default to 100 :)

(If it were only me, the default would be 1 for all inits. Doing many inits means tuning the random_state. Doing random inits is just a naive way of trying to end up in a better local minima for a non convex problem. There are others, some better, solutions to achieve this. This is why I don't like doing that by default because most users are not aware of that.)

@adrinjalali
Copy link
Member

Right now the docstring says:

Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see Clustering sparse data with k-means).

I would make that at least easier for users by saying something like "you might consider a larger n_init such as n_init=10 for high dimensional data, or high dimensional sparse data".

I would also be okay with keeping n_init as 10, and tell users they can get a 10x speedup but probably a less satisfactory result by setting it to 1.

@ogrisel
Copy link
Member

ogrisel commented Nov 24, 2022

the default parameter used in pydaal

I think you refer to daal4py which was more recently renamed to scikit-learn-intelex. The benchmark are here:

https://github.com/intel/scikit-learn-intelex#-acceleration

on some datasets, the Intel version is indeed 50x faster than scikit-learn but the benchmark code is using n_init=1 explicitly:

That being said, I am ok with keeping the switch but we should definitely currently in main improve the docstring as suggested by @adrinjalali in:

I would make that at least easier for users by saying something like "you might consider a larger n_init such as n_init=10 for high dimensional data, or high dimensional sparse data".

Also a question for @gittar: we don't really care about 2%-3% of SSE on purely random data since this would probably not lead to a meaningful improvement in expectation (on held-out evaluation data). Could you re-try similar experiments on a semi-structured dataset (e.g. wide blob-ish clusters + uniform noise background) with a random train-test split and evaluate the impact of n_init on the SSE measured on the test data for the clusters returned by k-means++ for different n_init values (e.g. 1 vs 10 or 5 vs 10)?

@gittar
Copy link
Contributor

gittar commented Nov 25, 2022

thx for your comments and apologies for the "silently". I was only looking at the dev code before my initial discussion post #25016. I became aware of the long discussion history and the planned deprecation measures only later. I will prepare answers to all your questions a.s.a.p.

@jeremiedbb
Copy link
Member

and tell users they can get a 10x speedup but probably a less satisfactory result by setting it to 1.

To me it would be more interesting to do the opposite from a pedagogical point of view. That is defaulting to one and explain that the problem is non-convex so it'll end up in a local minima and one of the possible ways to find a better minima and thus improve the SSE could be to do random inits, at the cost of a substantial slow down. Imo it makes the discussion more interesting and focused around the optimization problem rather than just a matter of speed.

@jeremiedbb
Copy link
Member

jeremiedbb commented Nov 25, 2022

As discussed irl with @ogrisel and @glemaitre, I think we can remove the blocker tag to not hold the release more. It's probably more a matter of documentation and we can still roll back, if discussions end up that way, between the release candidate and the finale release.

@jeremiedbb jeremiedbb removed this from the 1.2 milestone Nov 25, 2022
@gittar
Copy link
Contributor

gittar commented Nov 27, 2022

@glemaitre wrote

@gittar I am wondering if you encounter other situations where a single initialization with "kmeans++" would go sideways.

Definitely. In my experience this is much more often the case than not. The 3x3 cluster example from the docs is special in the sense that the number of clusters (9) is very small and known (which is usually not the case). This gives k-means++ an unusually good chance to get it "right", i.e. put one center in each cluster. Once this is achieved, the solution is saturated and cannot get any better, so no more trials are needed.

Slightly modifying this example gives a diffent picture. If we chose 4x4 clusters and set the parameter n_clusters twice as high as the actual number of clusters in the data., k-means++ will get it right only with a very small probabilty. Here additional trials can be expected to improve the mean solution. (see https://github.com/gittar/skl-discussion/blob/main/n-init.ipynb)

IMO the probability of k-means++ to find a very good (low SSE) solution

  • decreases with the number of clusters in the data
  • decreases if the $k$ is larger than the number of clusters in the data, e.g., if $k$ is 2 times as large as the number of clusters, so that for same-sized clusters there should be 2 centroids per cluster.

For detailed simulation results (comparing greedy $k$-means++ with 1 and 10 trials and vanilla $k$-means++ with 10 trials on 45 different $k$-means problems) please see my draft paper (sent via email).

@gittar
Copy link
Contributor

gittar commented Nov 27, 2022

@ogrisel wrote:

Also a question for @gittar: we don't really care about 2%-3% of SSE on purely random data since this would probably not lead to a meaningful improvement in expectation (on held-out evaluation data). Could you re-try similar experiments on a semi-structured dataset (e.g. wide blob-ish clusters + uniform noise background) with a random train-test split and evaluate the impact of n_init on the SSE measured on the test data for the clusters returned by k-means++ for different n_init values (e.g. 1 vs 10 or 5 vs 10)?

Is the structured example in https://github.com/gittar/skl-discussion/blob/main/n-init.ipynb any helpful? Regarding statements like "2%-3% of SSE are not of interest (on random data)" or as @jeremiedb stated "an increase of 3% is a good deal if one gets a 10x speedup" I would say: This is typically what you get from multiple runs of k-means++. The total cost is obviously proportional to the number of runs. The maximum gain in SSE is limited by the expected distance of a single k-means++ run from the (usually unknown) optimal solution. Since k-means++ is quite good, this distance is often below 10% IMO.

IMO it is the user who has to decide whether the gain is worth the cost. And instead of repeating k-means++ it might be more economical to use other methods which get to better solutions at lower costs. (ad for "breathing k-means" deleted :-) )

Regarding the original question (Should the default of n_init be changed to 1?) I see the following reasons not to do it:

  • by having the default at 10 for many years you have set a de-facto standard equally to using the greedy variant of k-means++. Scitkit learn's KMeans class is IMO the most used k-means implementation by far and calling it with default parameters is likely built into many GUI-driven applications and integrated in automated processing pipelines. Changing the default - even with advance deprecation warnings on stderr - might go unnoticed in many productive systems and suddenly worsen the KMeans results.
  • regarding speed comparison with other systems like scikit-learn-intelex: if they claim a 50x speedup but do not consider that scikit-learn does 10 initializations per call, that is - probably just due to sloppiness - a false claim by an order of magnitude. The speedup is only 5x, of course. Such statistics should also compare the quality of results, this would make it easier to double-check the respective claims. But the problem here is IMO not the default value of 10, but rather the inexact reporting of results.

Regarding the question for experiments with held-out test data: I never considered doing this since in my understanding the $k$-means problem is defined wrt. a single and completely known given data set and the stated goal is to minimize the SSE for that data set using a specific number $k$ of centroids. I would expect performance on data subsets always to be highly correlated as long as they are sampled iid from the same distribution.

Please see my draft paper (sent via email) for relevant empirical results.

@gittar
Copy link
Contributor

gittar commented Nov 28, 2022

@jeremiedbb wrote:

(If it were only me, the default would be 1 for all inits. Doing many inits means tuning the random_state. Doing random inits is just a naive way of trying to end up in a better local minima for a non convex problem. There are others, some better, solutions to achieve this. This is why I don't like doing that by default because most users are not aware of that.)

This is also a valid point. Each randomized algorithm can be "improved" by running it several times and picking the best result. Whether this makes a lot of sense depends on the variation in the results. If the variation is high, so is the expected payoff in doing several trials and vice versa.

Any particular default number of trials (like 10 or 3) is quite arbitrary, however, and may not be noted by users of the algorithm giving them a wrong impression of the efficiency and quality of the algorithm.

Therefore, from a principled point of view, one could -- as @jeremiedbb suggests -- always set n_init=1 and give the general hint that for all randomized algorithms performing several trials may improve the result at the cost of increased computation.

In this context I looked at other ocurrences of n_init and found n_init=10 for sklearn.cluster.SpectralClustering and sklearn.cluster.SpectralCoclustering and sklearn.cluster.SpectralBiclustering for the use with KMeans and without deprecation note. Outside sklearn.cluster I saw n_init=8 for sklearn.manifold.smacofand n_init=4 for sklearn.manifold.MDS. The same arguments might apply here as for KMeans.

@jeremiedbb
Copy link
Member

jeremiedbb commented Nov 29, 2022

In this context I looked at other ocurrences of n_init and found n_init=10 for sklearn.cluster.SpectralClustering and sklearn.cluster.SpectralCoclustering and sklearn.cluster.SpectralBiclustering for the use with KMeans and without deprecation note

There's no deprecation note because the default will not change for these estimators. It makes me think that we should maybe do the change of default to match the default of KMeans.

@Micky774
Copy link
Contributor

Therefore, from a principled point of view, one could -- as @jeremiedbb suggests -- always set n_init=1 and give the general hint that for all randomized algorithms performing several trials may improve the result at the cost of increased computation.

I do like this idea in a pure sense.

and tell users they can get a 10x speedup but probably a less satisfactory result by setting it to 1.

To me it would be more interesting to do the opposite from a pedagogical point of view. That is defaulting to one and explain that the problem is non-convex so it'll end up in a local minima and one of the possible ways to find a better minima and thus improve the SSE could be to do random inits, at the cost of a substantial slow down. Imo it makes the discussion more interesting and focused around the optimization problem rather than just a matter of speed.

💯

I think it makes much more sense to frame it as such, especially since this can be done equivalently via random seed perturbations.

@glemaitre
Copy link
Member Author

So it seems that we went forward with the deprecation cycle. Not sure to recall what was the issue. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants