New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using ndarray as init
for KMeans
raises a ValueError
#26657
Conversation
…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
Actually, here's a gist that shows the problem: https://gist.github.com/bnsh/b5bfdf45ea820c3d4b4a2f30da4f63ea It will work if you change the requirements.txt to have numpy==1.24 but fails if numpy==1.25. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we need:
(i) a non-regression test that must be added to sklearn/cluster/tests/test_kmeans.py
. I propose the following:
@pytest.mark.parametrize(
"init, expected_n_init",
[
("k-means++", 1),
("random", "default"),
(
lambda X, n_clusters, random_state: random_state.uniform(
size=(n_clusters, X.shape[1])
),
"default",
),
("array-like", 1),
],
)
@pytest.mark.parametrize("Estimator", [KMeans, MiniBatchKMeans])
def test_kmeans_init_auto_with_initial_centroids(Estimator, init, expected_n_init):
"""Check that `n_init="auto"` chooses the right number of initializations.
Non-regression test for #26657:
https://github.com/scikit-learn/scikit-learn/pull/26657
"""
n_sample, n_features, n_clusters = 100, 10, 5
X = np.random.randn(n_sample, n_features)
if init == "array-like":
init = np.random.randn(n_clusters, n_features)
if expected_n_init == "default":
expected_n_init = 3 if Estimator is MiniBatchKMeans else 10
kmeans = Estimator(n_clusters=n_clusters, init=init, n_init="auto").fit(X)
assert kmeans._n_init == expected_n_init
(iii) we need to update the documentation to document the number of iterations to know how many iterations will be run for the 4 different cases. We should update both KMeans
and MiniBatchKMeans
.
(iii) an entry in the changelog in doc/whats_new/v1.3.rst
to document the fix.
sklearn/cluster/_kmeans.py
Outdated
@@ -897,10 +897,9 @@ def _check_params_vs_input(self, X, default_n_init=None): | |||
) | |||
self._n_init = default_n_init | |||
if self._n_init == "auto": | |||
if self.init == "k-means++": | |||
self._n_init = default_n_init | |||
if (not _is_arraylike_not_scalar(self.init)) and self.init == "k-means++": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should as well deal with the callable. So let's do the following:
if self._n_init == "auto":
if isinstance(self.init, str):
self._n_init = 1 if self.init == "k-means++" else default_n_init
elif callable(self.init):
self._n_init = default_n_init
else: # array-like
self._n_init = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @glemaitre ... OK, so I added the changes you mention.. (Tho I changed it a little bit from what you wrote.. I more explicitly demand that self.init be either "k-means++" or "random" if it's a string, and either a callable or array like otherwise, and it raises an Exception otherwise.) I also added the docstring changes and the test case, but I don't know what I'm expected to do with the doc/whats_new/v1.3.rst
... Can you please tell me what exactly I should be doing there? (That seems to be the only thing that's not passing CI/CD checks...)
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I kind of figured out the Changelog thing. Or at least it's passing the CI/CD check now. I still don't know tho, if the text in there is any good..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @bnsh. Here are some suggestions
… of by a generic validation tool." @jeremiedbb
Great, thanks @jeremiedbb ! I just made the changes you suggest.. |
init
for KMeans
raises a ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. Thanks @bnsh
…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
Reference Issues/PRs
What does this implement/fix? Explain your changes.
The problem seems to stem from numpy==1.25, where comparison of a string to a np.array results in this error:
My change just checks to see if self.init is an array before it checks to see if self.init == "kmeans++".
Any other comments?