Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Don't normalize sample weights in KMeans #17848

Merged
merged 5 commits into from
Jul 8, 2020

Conversation

jeremiedbb
Copy link
Member

sample_weights are normalized such that their sum = n_samples. Doing this normalization has absolutely no impact on the clustering. Moreover it's currently bugged since we don't invert the normalization to report the inertia (Fixes #16594).

This is extracted from #17622 to facilitate the reviews.

@rth @ogrisel @glemaitre

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also more consistent with how other clusterers deal with sample weight

@jnothman
Copy link
Member

jnothman commented Jul 6, 2020

I wonder if it's worth updating the docstring of fit to indicate that sample_weight is equivalent to replication?

@jeremiedbb
Copy link
Member Author

I wonder if it's worth updating the docstring of fit to indicate that sample_weight is equivalent to replication?

@jnothman, well it's not in many situations. In KMeans with the k-means++ or random init. In MinibatchKMeans no matter the init. Because sampling X will be different with duplicated points. So I'm not sure it's worth to advertise that when it only works in rare situations

@jnothman
Copy link
Member

jnothman commented Jul 6, 2020 via email

:mod:`sklearn.cluster`
.........................

- |Fix| Fixed a bug in :class:`cluster.KMeans` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the inertia will change, we should add this change in the Changed Model section.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 7514a05 into scikit-learn:master Jul 8, 2020
@glemaitre
Copy link
Member

Thanks @jeremiedbb

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 3, 2020
`sample_weight` should not be normalized in KMeans. The weight magnitude
should have an influence on the `inertia_`, larger the weights, larger should be
the inertia.
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
`sample_weight` should not be normalized in KMeans. The weight magnitude
should have an influence on the `inertia_`, larger the weights, larger should be
the inertia.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation is wrong about KMeans.inertia_
3 participants