Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ndarray as init for KMeans raises a ValueError #26657

Merged
merged 18 commits into from Jun 26, 2023

Conversation

bnsh
Copy link
Contributor

@bnsh bnsh commented Jun 21, 2023

…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The problem seems to stem from numpy==1.25, where comparison of a string to a np.array results in this error:

  File "/usr/local/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py", line 878, in _check_params_vs_input
    if self.init == "k-means++":
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My change just checks to see if self.init is an array before it checks to see if self.init == "kmeans++".

Any other comments?

…The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
@bnsh
Copy link
Contributor Author

bnsh commented Jun 21, 2023

Actually, here's a gist that shows the problem: https://gist.github.com/bnsh/b5bfdf45ea820c3d4b4a2f30da4f63ea

It will work if you change the requirements.txt to have numpy==1.24 but fails if numpy==1.25.

@github-actions
Copy link

github-actions bot commented Jun 21, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 04d751d. Link to the linter CI: here

@glemaitre glemaitre self-requested a review June 22, 2023 14:44
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we need:

(i) a non-regression test that must be added to sklearn/cluster/tests/test_kmeans.py. I propose the following:

@pytest.mark.parametrize(
    "init, expected_n_init",
    [
        ("k-means++", 1),
        ("random", "default"),
        (
            lambda X, n_clusters, random_state: random_state.uniform(
                size=(n_clusters, X.shape[1])
            ),
            "default",
        ),
        ("array-like", 1),
    ],
)
@pytest.mark.parametrize("Estimator", [KMeans, MiniBatchKMeans])
def test_kmeans_init_auto_with_initial_centroids(Estimator, init, expected_n_init):
    """Check that `n_init="auto"` chooses the right number of initializations.

    Non-regression test for #26657:
    https://github.com/scikit-learn/scikit-learn/pull/26657
    """
    n_sample, n_features, n_clusters = 100, 10, 5
    X = np.random.randn(n_sample, n_features)
    if init == "array-like":
        init = np.random.randn(n_clusters, n_features)
    if expected_n_init == "default":
        expected_n_init = 3 if Estimator is MiniBatchKMeans else 10

    kmeans = Estimator(n_clusters=n_clusters, init=init, n_init="auto").fit(X)
    assert kmeans._n_init == expected_n_init

(iii) we need to update the documentation to document the number of iterations to know how many iterations will be run for the 4 different cases. We should update both KMeans and MiniBatchKMeans.

(iii) an entry in the changelog in doc/whats_new/v1.3.rst to document the fix.

@@ -897,10 +897,9 @@ def _check_params_vs_input(self, X, default_n_init=None):
)
self._n_init = default_n_init
if self._n_init == "auto":
if self.init == "k-means++":
self._n_init = default_n_init
if (not _is_arraylike_not_scalar(self.init)) and self.init == "k-means++":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should as well deal with the callable. So let's do the following:

        if self._n_init == "auto":
            if isinstance(self.init, str):
                self._n_init = 1 if self.init == "k-means++" else default_n_init
            elif callable(self.init):
                self._n_init = default_n_init
            else:  # array-like
                self._n_init = 1

Copy link
Contributor Author

@bnsh bnsh Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glemaitre ... OK, so I added the changes you mention.. (Tho I changed it a little bit from what you wrote.. I more explicitly demand that self.init be either "k-means++" or "random" if it's a string, and either a callable or array like otherwise, and it raises an Exception otherwise.) I also added the docstring changes and the test case, but I don't know what I'm expected to do with the doc/whats_new/v1.3.rst... Can you please tell me what exactly I should be doing there? (That seems to be the only thing that's not passing CI/CD checks...)

Thanks!

Copy link
Contributor Author

@bnsh bnsh Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I kind of figured out the Changelog thing. Or at least it's passing the CI/CD check now. I still don't know tho, if the text in there is any good..

@bnsh bnsh requested a review from glemaitre June 22, 2023 21:42
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bnsh. Here are some suggestions

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@bnsh
Copy link
Contributor Author

bnsh commented Jun 23, 2023

Great, thanks @jeremiedbb ! I just made the changes you suggest..

@bnsh bnsh requested a review from jeremiedbb June 23, 2023 15:01
@betatim betatim changed the title When init is specified as a numpy array, it errors with "ValueError: … Using ndarray as init for KMeans raises a ValueError Jun 26, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Thanks @bnsh

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jun 26, 2023
@jeremiedbb jeremiedbb enabled auto-merge (squash) June 26, 2023 14:17
@jeremiedbb jeremiedbb merged commit 2579841 into scikit-learn:main Jun 26, 2023
29 checks passed
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jun 29, 2023
punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants