-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Don't normalize sample weights in KMeans #17848
MNT Don't normalize sample weights in KMeans #17848
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also more consistent with how other clusterers deal with sample weight
I wonder if it's worth updating the docstring of fit to indicate that sample_weight is equivalent to replication? |
@jnothman, well it's not in many situations. In KMeans with the k-means++ or random init. In MinibatchKMeans no matter the init. Because sampling X will be different with duplicated points. So I'm not sure it's worth to advertise that when it only works in rare situations |
True, thanks.
… |
:mod:`sklearn.cluster` | ||
......................... | ||
|
||
- |Fix| Fixed a bug in :class:`cluster.KMeans` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the inertia will change, we should add this change in the Changed Model
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @jeremiedbb |
`sample_weight` should not be normalized in KMeans. The weight magnitude should have an influence on the `inertia_`, larger the weights, larger should be the inertia.
`sample_weight` should not be normalized in KMeans. The weight magnitude should have an influence on the `inertia_`, larger the weights, larger should be the inertia.
sample_weights are normalized such that their sum = n_samples. Doing this normalization has absolutely no impact on the clustering. Moreover it's currently bugged since we don't invert the normalization to report the inertia (Fixes #16594).
This is extracted from #17622 to facilitate the reviews.
@rth @ogrisel @glemaitre