Replies: 1 comment
-
I open #25022 since this is a real issue and we would need to act on it before the next release. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In version 1.2 the new value
auto
will be introduced for theKMeans
parametern_init
and in version 1.4auto
is planned to become the default value forn_init
:scikit-learn/sklearn/cluster/_kmeans.py
Lines 1199 to 1206 in a7698a8
For uses of
KMeans
withinit='k-means++'
(which is the default) this will silently change the default value ofn_init
from 10 to 1.This can (and will in many cases) strongly worsen the results of existing code. The code below illustrates the difference of having 1 vs. 10 as value of
n_init
. In each call ofcheck()
a new data set is of 1000 points is generated andKMeans
with default methodkmeans++
is run 10 times withn_init=10
and 10 times withn_init=1
. Finally the mean percentage deterioration from havingn_init=1
instead odn_init=10
is computed. It is always a deterioration and is between 1.81% and 3.52% in the shown experiments. This can be quite significant depending on the application.My question is wether it is a good idea to make this change which will likely worsen the result quality of every call of KMeans with default parameters (or with explicit setting of
init='k-means++'
). While not a breaking change in the syntactic sense it is a breaking change numerically since manyKMeans
results can be expected to deteriorate. If this change is made as planned this should IMO be communicated very clearly.Opinions?
Example Output:
mean error for n_init=1 is 2.85% higher than for n_init=10
mean error for n_init=1 is 3.52% higher than for n_init=10
mean error for n_init=1 is 3.25% higher than for n_init=10
mean error for n_init=1 is 1.81% higher than for n_init=10
mean error for n_init=1 is 2.68% higher than for n_init=10
mean error for n_init=1 is 2.91% higher than for n_init=10
mean error for n_init=1 is 2.01% higher than for n_init=10
mean error for n_init=1 is 3.02% higher than for n_init=10
mean error for n_init=1 is 2.25% higher than for n_init=10
mean error for n_init=1 is 2.75% higher than for n_init=10
Beta Was this translation helpful? Give feedback.
All reactions