Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix integer overflows in Elkan's k-means. #15057

Merged
merged 7 commits into from Sep 26, 2019
Merged

Conversation

@balodja
Copy link
Contributor

balodja commented Sep 22, 2019

I have a large sample size going through k-means. It is greater than 2^31. That is causing integer overflows with subsequent segfaults in cython code.

You know, plain 32-bit integers are too small for indexing in modern world :)

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Sep 23, 2019

Thanks @balodja

This also needs a test.

I wonder if we should check/fix the same issue in other places.

@balodja

This comment has been minimized.

Copy link
Contributor Author

balodja commented Sep 23, 2019

Unfortunately I don't understand what kind of test that should be.

I could add testing for large arrays in k-means, but that test would consume quite large amount of memory (my estimation is around 16GB). Not many computers are capable of performing such a test. I wonder if it is reasonable to add tests that cannot be performed on the most computers.

It would be quite helpful if guide me in the right direction.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 23, 2019

My usual instinct would be to use sparse matrices but since we iterate over samples I guess that doesn't work here :(
Btw, with that many samples, minibatch k-means (with random initialization) is likely to be much faster.

Copy link
Member

rth left a comment

Thanks @balodja !

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@balodja

This comment has been minimized.

Copy link
Contributor Author

balodja commented Sep 24, 2019

@rth, done.

@balodja balodja requested a review from rth Sep 24, 2019
@@ -85,6 +85,11 @@ Changelog
between `n_jobs=1` and `n_jobs>1` due to the handling of the random state.
:pr:`9288` by :user:`Bryan Yang <bryanyang0528>`.

- |Fix| Fixed a bug where functions in :mod:`sklearn.cluster._k_means_elkan`

This comment has been minimized.

Copy link
@rth

rth Sep 24, 2019

Member

The class is

:class:`sklearn.cluster.KMeans`

, we shouldn't mention private functions in "what's new".

This comment has been minimized.

Copy link
@balodja

balodja Sep 24, 2019

Author Contributor

Sorry, my bad, fixed.

@balodja balodja requested a review from rth Sep 24, 2019
Copy link
Member

adrinjalali left a comment

otherwise lgtm

@@ -85,6 +85,10 @@ Changelog
between `n_jobs=1` and `n_jobs>1` due to the handling of the random state.
:pr:`9288` by :user:`Bryan Yang <bryanyang0528>`.

- |Fix| Fixed a bug where `elkan` algorithm in :class:`cluster.KMeans` was
producing Segmentation Fault on large arrays due to integer index overflow.
:pr:`15057` by :user:`Vladimir Korolev <balodja@gmail.com>`.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 26, 2019

Member
:user:`Vladimir Korolev <balodja@gmail.com>`

it needs to have your github username, not your email

This comment has been minimized.

Copy link
@balodja

balodja Sep 26, 2019

Author Contributor

Oh, I see. Changed.

@rth rth merged commit e424ab1 into scikit-learn:master Sep 26, 2019
19 checks passed
19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No code changes detected
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch Coverage not affected when comparing d92f12a...47fef21
Details
codecov/project 96.76% remains the same compared to d92f12a
Details
scikit-learn.scikit-learn Build #20190926.29 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@rth

This comment has been minimized.

Copy link
Member

rth commented Sep 26, 2019

Thanks @balodja !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.