Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix integer overflows in Elkan's k-means. #15057

Merged
merged 7 commits into from Sep 26, 2019
Merged

Conversation

balodja
Copy link
Contributor

@balodja balodja commented Sep 22, 2019

I have a large sample size going through k-means. It is greater than 2^31. That is causing integer overflows with subsequent segfaults in cython code.

You know, plain 32-bit integers are too small for indexing in modern world :)

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Sep 23, 2019

Thanks @balodja

This also needs a test.

I wonder if we should check/fix the same issue in other places.

@balodja
Copy link
Contributor Author

@balodja balodja commented Sep 23, 2019

Unfortunately I don't understand what kind of test that should be.

I could add testing for large arrays in k-means, but that test would consume quite large amount of memory (my estimation is around 16GB). Not many computers are capable of performing such a test. I wonder if it is reasonable to add tests that cannot be performed on the most computers.

It would be quite helpful if guide me in the right direction.

@amueller
Copy link
Member

@amueller amueller commented Sep 23, 2019

My usual instinct would be to use sparse matrices but since we iterate over samples I guess that doesn't work here :(
Btw, with that many samples, minibatch k-means (with random initialization) is likely to be much faster.

Copy link
Member

@rth rth left a comment

Thanks @balodja !

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@balodja
Copy link
Contributor Author

@balodja balodja commented Sep 24, 2019

@rth, done.

@balodja balodja requested a review from rth Sep 24, 2019
@@ -85,6 +85,11 @@ Changelog
between `n_jobs=1` and `n_jobs>1` due to the handling of the random state.
:pr:`9288` by :user:`Bryan Yang <bryanyang0528>`.

- |Fix| Fixed a bug where functions in :mod:`sklearn.cluster._k_means_elkan`
Copy link
Member

@rth rth Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is

:class:`sklearn.cluster.KMeans`

, we shouldn't mention private functions in "what's new".

Copy link
Contributor Author

@balodja balodja Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad, fixed.

@balodja balodja requested a review from rth Sep 24, 2019
Copy link
Member

@adrinjalali adrinjalali left a comment

otherwise lgtm

@@ -85,6 +85,10 @@ Changelog
between `n_jobs=1` and `n_jobs>1` due to the handling of the random state.
:pr:`9288` by :user:`Bryan Yang <bryanyang0528>`.

- |Fix| Fixed a bug where `elkan` algorithm in :class:`cluster.KMeans` was
producing Segmentation Fault on large arrays due to integer index overflow.
:pr:`15057` by :user:`Vladimir Korolev <balodja@gmail.com>`.
Copy link
Member

@adrinjalali adrinjalali Sep 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:user:`Vladimir Korolev <balodja@gmail.com>`

it needs to have your github username, not your email

Copy link
Contributor Author

@balodja balodja Sep 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Changed.

@rth rth merged commit e424ab1 into scikit-learn:master Sep 26, 2019
19 checks passed
@rth
Copy link
Member

@rth rth commented Sep 26, 2019

Thanks @balodja !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants