Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN Change default value of n_init in cluster.KMeans and cluster.k_means #23038

Merged
merged 42 commits into from
May 25, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 3, 2022

Reference Issues/PRs

Fixes #9729
Resolves #11530 (stalled)

What does this implement/fix? Explain your changes.

Begins deprecation cycle for changing default value of n_init in cluster.KMeans and cluster.k_means to 5.

Any other comments?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to address #9729 (comment) and motivate the change from 10 to 5.

sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@jeremiedbb
Copy link
Member

jeremiedbb commented Apr 4, 2022

I think we still need to address #9729 (comment) and motivate the change from 10 to 5.

Let's continue the discussion in #9729 before commiting too deeply into this

@Micky774
Copy link
Contributor Author

Micky774 commented Apr 7, 2022

If we go by the example, then we can have a n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

Per discussion in #9729 (comment) I'll go ahead and proceed with introducing auto-resolution for n_init.

@jeremiedbb
Copy link
Member

I guess it's worth to apply the same pattern to MiniBatchKMeans

@Micky774
Copy link
Contributor Author

Micky774 commented Apr 7, 2022

I guess it's worth to apply the same pattern to MiniBatchKMeans

Should we go ahead and complete this PR first and then open a separate one for MiniBatchKMeans just to keep the scope of the PR small?

sklearn/cluster/_kmeans.py Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved
sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

Micky774 commented Apr 8, 2022

I guess it's worth to apply the same pattern to MiniBatchKMeans

Should we go ahead and complete this PR first and then open a separate one for MiniBatchKMeans just to keep the scope of the PR small?

I actually went ahead and extended it to MiniBatchKMeans in this PR. The two estimators share a common _check_params method through _BaseKMeans which already validates n_init-->_n_init. In this case I think it makes more sense to take care of them together since otherwise _n_init would need to be resolved outside of _check_params which is an entirely separate change that would need to be undone if we wanted to add this feature to MiniBatchKMeans later anyways.

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@jeremiedbb jeremiedbb added this to the 1.2 milestone Apr 29, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many tests are failing because we turn future warnings into errors in the CI. For each of these you can either filter the warning or explicitly set n_init, whichever is more appropriate.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
Micky774 and others added 6 commits May 8, 2022 16:43
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ci failures look like they are related to this PR.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel
Copy link
Member

ogrisel commented May 24, 2022

@jeremiedbb I let you check that your comments where addressed the way you intended before merging.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Micky774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change KMeans n_init default value to 1.
5 participants