Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix OverflowError on DictVectorizer #15463

Merged

Conversation

norvan
Copy link
Contributor

@norvan norvan commented Nov 2, 2019

Fixes #13045

DictVectorizer's transform step raises an OverflowError if the input data has too many samples.
Replacing indptr to a list resolves this issue without impairing performance.

Similar solution has been proposed by @rth on #9147

Checks on performance was run locally to confirm there was no increase in the duration of the transform step:

Duration was:

  • 2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
  • 2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Code:

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)

@norvan norvan changed the title Fix OverflowError on DictVectorizer [WIP] Fix OverflowError on DictVectorizer Nov 2, 2019
@TomDLT
Copy link
Member

@TomDLT TomDLT commented Nov 2, 2019

Thanks for the pull-request !
Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

@rth
Copy link
Member

@rth rth commented Nov 2, 2019

Could you add a small test in sklearn/feature_extraction/tests/test_dict_vectorizer.py ?

The problem is that this only happens for sparse arrays that have more than 2**31 elements which corresponds to 10s GB RAM. I think it's OK to merge it without unit test given that's idential to what we did earlier for CountVectorizer in #9147

@rth
Copy link
Member

@rth rth commented Nov 2, 2019

@norvan Could you please run a small performance benchmark before and after this PR to make sure performance is not degraded? It shouldn't be, but better to double check.

For instance running evaluation code in jupyter with %timeit DictVectorizer should be enough. Thanks!

@norvan
Copy link
Contributor Author

@norvan norvan commented Nov 2, 2019

Ran the following code on both stable sklearn and after my changes.
Duration was:

  • 2.79 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) before the change
  • 2.83 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) after the change

Conclusion is that the change does not impair performance in a statistically meaningful way considering the tolerance.

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from numpy.random import random

n_samples = 1000000

data = [{"alpha": i, "beta": int(random()*100)} for i in range(n_samples)]

vectorizer = DictVectorizer(sparse=True, dtype=np.uint16).fit(data)

%timeit vectorized_data = vectorizer.transform(data)

@norvan norvan changed the title [WIP] Fix OverflowError on DictVectorizer [MRG] Fix OverflowError on DictVectorizer Nov 2, 2019
rth
rth approved these changes Nov 2, 2019
Copy link
Member

@rth rth left a comment

Minor comment, otherwise LGTM.

Thanks for the benchmarks @norvan !

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@jnothman jnothman merged commit 77aec1f into scikit-learn:master Nov 2, 2019
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants