Using ndarray as `init` for `KMeans` raises a ValueError #26657

bnsh · 2023-06-21T21:39:04Z

…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The problem seems to stem from numpy==1.25, where comparison of a string to a np.array results in this error:

  File "/usr/local/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py", line 878, in _check_params_vs_input
    if self.init == "k-means++":
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My change just checks to see if self.init is an array before it checks to see if self.init == "kmeans++".

Any other comments?

…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

bnsh · 2023-06-21T21:39:53Z

Actually, here's a gist that shows the problem: https://gist.github.com/bnsh/b5bfdf45ea820c3d4b4a2f30da4f63ea

It will work if you change the requirements.txt to have numpy==1.24 but fails if numpy==1.25.

github-actions · 2023-06-21T21:41:09Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 04d751d. Link to the linter CI: here}

glemaitre

In addition, we need:

(i) a non-regression test that must be added to sklearn/cluster/tests/test_kmeans.py. I propose the following:

@pytest.mark.parametrize(
    "init, expected_n_init",
    [
        ("k-means++", 1),
        ("random", "default"),
        (
            lambda X, n_clusters, random_state: random_state.uniform(
                size=(n_clusters, X.shape[1])
            ),
            "default",
        ),
        ("array-like", 1),
    ],
)
@pytest.mark.parametrize("Estimator", [KMeans, MiniBatchKMeans])
def test_kmeans_init_auto_with_initial_centroids(Estimator, init, expected_n_init):
    """Check that `n_init="auto"` chooses the right number of initializations.

    Non-regression test for #26657:
    https://github.com/scikit-learn/scikit-learn/pull/26657
    """
    n_sample, n_features, n_clusters = 100, 10, 5
    X = np.random.randn(n_sample, n_features)
    if init == "array-like":
        init = np.random.randn(n_clusters, n_features)
    if expected_n_init == "default":
        expected_n_init = 3 if Estimator is MiniBatchKMeans else 10

    kmeans = Estimator(n_clusters=n_clusters, init=init, n_init="auto").fit(X)
    assert kmeans._n_init == expected_n_init

(iii) we need to update the documentation to document the number of iterations to know how many iterations will be run for the 4 different cases. We should update both KMeans and MiniBatchKMeans.

(iii) an entry in the changelog in doc/whats_new/v1.3.rst to document the fix.

glemaitre · 2023-06-22T14:54:19Z

sklearn/cluster/_kmeans.py

@@ -897,10 +897,9 @@ def _check_params_vs_input(self, X, default_n_init=None):
            )
            self._n_init = default_n_init
        if self._n_init == "auto":
-            if self.init == "k-means++":
+            self._n_init = default_n_init
+            if (not _is_arraylike_not_scalar(self.init)) and self.init == "k-means++":


We should as well deal with the callable. So let's do the following:

if self._n_init == "auto": if isinstance(self.init, str): self._n_init = 1 if self.init == "k-means++" else default_n_init elif callable(self.init): self._n_init = default_n_init else: # array-like self._n_init = 1

Hi @glemaitre ... OK, so I added the changes you mention.. (Tho I changed it a little bit from what you wrote.. I more explicitly demand that self.init be either "k-means++" or "random" if it's a string, and either a callable or array like otherwise, and it raises an Exception otherwise.) I also added the docstring changes and the test case, but I don't know what I'm expected to do with the doc/whats_new/v1.3.rst... Can you please tell me what exactly I should be doing there? (That seems to be the only thing that's not passing CI/CD checks...)

Thanks!

Actually, I think I kind of figured out the Changelog thing. Or at least it's passing the CI/CD check now. I still don't know tho, if the text in there is any good..

jeremiedbb

Thanks for the PR @bnsh. Here are some suggestions

sklearn/cluster/_kmeans.py

doc/whats_new/v1.3.rst

sklearn/cluster/_kmeans.py

@jeremiedbb

… of by a generic validation tool." @jeremiedbb

bnsh · 2023-06-23T15:00:28Z

Great, thanks @jeremiedbb ! I just made the changes you suggest..

glemaitre

Otherwise LGTM. Thanks @bnsh

doc/whats_new/v1.3.rst

When init is specified as a numpy array, it errors with "ValueError: …

2b0a9bd

…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

github-actions bot added the module:cluster label Jun 21, 2023

Merge branch 'main' into main

dd34560

glemaitre self-requested a review June 22, 2023 14:44

glemaitre reviewed Jun 22, 2023

View reviewed changes

bnsh added 9 commits June 22, 2023 12:51

Handle callable inits as well as array like inits. Added test case.

7be0252

Formatting for Black

6fcc539

Added n_init changes to KMeans and MiniBatchKMeans doc strings.

79b3b05

Typo. parametrize not parameterize.

992ade2

Black: Line too long.

717cace

More black formatting woes.

1103a5d

Added n_init docs for kmeans function, added a . for more formatting.

a6d2581

Added entry in Changelog

00b8abc

Pull Request needs to be in backquotes not double quotes.

6bbd3ad

bnsh requested a review from glemaitre June 22, 2023 21:42

Comment change: unbalanced backquotes and single quotes.

1f42d0c

jeremiedbb reviewed Jun 23, 2023

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

bnsh added 2 commits June 23, 2023 10:56

"The check that init has a valid type and value is already taken care…

0895f9d

… of by a generic validation tool." @jeremiedbb

Excessive detail for "What's New"

30bc7ea

bnsh requested a review from jeremiedbb June 23, 2023 15:01

bnsh added 3 commits June 23, 2023 11:03

Change to docstring: Simplified initialization cases.

e4088eb

Formatting. Whitespace.

aa37c7c

Needs a period.

e62bd30

betatim changed the title ~~When init is specified as a numpy array, it errors with "ValueError: …~~ Using ndarray as init for KMeans raises a ValueError Jun 26, 2023

glemaitre approved these changes Jun 26, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

Update v1.3.rst

04d751d

jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jun 26, 2023

jeremiedbb enabled auto-merge (squash) June 26, 2023 14:17

jeremiedbb merged commit 2579841 into scikit-learn:main Jun 26, 2023
29 checks passed

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jun 29, 2023

FIX Using custom init for KMeans does a single init (scikit-learn#26657)

470d39b

jeremiedbb pushed a commit that referenced this pull request Jun 29, 2023

FIX Using custom init for KMeans does a single init (#26657)

4250f80

punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023

FIX Using custom init for KMeans does a single init (scikit-learn#26657)

f66c904

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

FIX Using custom init for KMeans does a single init (scikit-learn#26657)

d210a90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ndarray as `init` for `KMeans` raises a ValueError #26657

Using ndarray as `init` for `KMeans` raises a ValueError #26657

bnsh commented Jun 21, 2023

bnsh commented Jun 21, 2023

github-actions bot commented Jun 21, 2023 •

edited

glemaitre left a comment

glemaitre Jun 22, 2023

bnsh Jun 22, 2023 •

edited

bnsh Jun 22, 2023 •

edited

jeremiedbb left a comment

bnsh commented Jun 23, 2023

glemaitre left a comment

Using ndarray as init for KMeans raises a ValueError #26657

Using ndarray as init for KMeans raises a ValueError #26657

Conversation

bnsh commented Jun 21, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

bnsh commented Jun 21, 2023

github-actions bot commented Jun 21, 2023 • edited

✔️ Linting Passed

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Jun 22, 2023

Choose a reason for hiding this comment

bnsh Jun 22, 2023 • edited

Choose a reason for hiding this comment

bnsh Jun 22, 2023 • edited

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

bnsh commented Jun 23, 2023

glemaitre left a comment

Choose a reason for hiding this comment

Using ndarray as `init` for `KMeans` raises a ValueError #26657

Using ndarray as `init` for `KMeans` raises a ValueError #26657

github-actions bot commented Jun 21, 2023 •

edited

bnsh Jun 22, 2023 •

edited

bnsh Jun 22, 2023 •

edited