Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix OverflowError on DictVectorizer #15463

Merged

Conversation

@norvan
Copy link
Contributor

norvan commented Nov 2, 2019

Fixes #13045

DictVectorizer's transform step raises an OverflowError if the input data has too many samples.
Replacing indptr to a list resolves this issue without impairing performance.

Similar solution has been proposed by @rth on #9147

Checks on performance was run locally to confirm there was no increase in the duration of the transform step:

Duration was:

  • 2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
  • 2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Code:

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)
@norvan norvan changed the title Fix OverflowError on DictVectorizer [WIP] Fix OverflowError on DictVectorizer Nov 2, 2019
@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Nov 2, 2019

Thanks for the pull-request !
Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

@rth

This comment has been minimized.

Copy link
Member

rth commented Nov 2, 2019

Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

The problem is that this only happens for sparse arrays that have more than 2**31 elements which corresponds to 10s GB RAM. I think it's OK to merge it without unit test given that's idential to what we did earlier for CountVectorizer in #9147

@rth

This comment has been minimized.

Copy link
Member

rth commented Nov 2, 2019

@norvan Could you please run a small performance benchmark before and after this PR to make sure performance is not degraded? It shouldn't be, but better to double check.

For instance running evaluation code in jupyter with %timeit DictVectorizer should be enough. Thanks!

@norvan

This comment has been minimized.

Copy link
Contributor Author

norvan commented Nov 2, 2019

Ran the following code on both stable sklearn and after my changes.
Duration was:

  • 2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
  • 2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Conclusion is that the change does not impair performance in a statistically meaningful way considering the tolerance.

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)
@norvan norvan changed the title [WIP] Fix OverflowError on DictVectorizer [MRG] Fix OverflowError on DictVectorizer Nov 2, 2019
@rth
rth approved these changes Nov 2, 2019
Copy link
Member

rth left a comment

Minor comment, otherwise LGTM.

Thanks for the benchmarks @norvan !

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@jnothman jnothman merged commit 77aec1f into scikit-learn:master Nov 2, 2019
14 of 16 checks passed
14 of 16 checks passed
ci/circleci: doc Your tests are queued behind your running builds
Details
ci/circleci: doc-min-dependencies Your tests are queued behind your running builds
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20191102.99 succeeded
Details
scikit-learn.scikit-learn (Linting) Linting succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.