New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Merged
merged 7 commits into from Nov 29, 2017

Conversation

Projects
None yet
5 participants
@rth
Member

rth commented Jun 17, 2017

Reference Issue

Continuation of #6473 (itself continuation of #6194). Fixes #6183 (in CountVectorizer); also fixes #8941 (in HashingVectorizer).

Closes #6473
Closes #6194
Closes #6183
Closes #8941

What does this implement/fix? Explain your changes.

This PR switches to 64 bit indexing by default for indptr array indices in CountVectorizer, TfidfVectorizer and HashingVectorizer. It follows the following assumptions,

  • indptr and indices attributes have the same dtype in Scipy's CSR arrays [1]
  • when 64 bit indptr and indices are given to csr_matrix constructor, it will downcast them to 32 bit if this are sufficient to hold their contents [2]
  • overflow happens in indptr which is typically negligible in size when compared to indices

As a result, in this PR,

  • in the intermediary array representation indptr is always 64 bit, while indices is either without dtype (in CountVectorizer, etc) or 32 bit (in HashingVectorizer as the hash size if 32 bit anyway).
  • the returned arrays use 32 bit indexing for both indptr and indices if this dtype is sufficient, or use 64 bit indexing otherwise (if it's supported with scipy > 0.14). Which makes this PR fully backward compatible.

Any other comments?

All the benchmark scripts and results can be found in here.

  • This PR was benchmarked on 20 newsgroups dataset with bench_text_vectorizers.py from #9086 : results before this PR (bench_master.log) and after this PR (bench_PR.log) show no degradation in performance or memory usage.

  • The overflow from the original issues were reproduced with run.sh script, and on master the output is in res_master.log. After this PR, indices in all text vectors don't overflow res_PR.log and use 64 bit indexing when necessary. At least 90 GB RAM is required by run.sh (tested on EC2 m4.10xlarge with 160 GB RAM).

  • norm='l2' will still fail due to cython code (related to #2969) (fixed in #9663)

  • These changes were reviewed in #6194 by @lesteve @ogrisel . This PR extends them to HashingVectorizer and accounts for latest modifications in vectorizers.

  • 64 bit indexing won't work on Windows

@psinger Would you be able to confirm that this fixes the overflow in CountVectorizer indexing? Thanks.

cc @jnothman

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 17, 2017

Member
Member

jnothman commented Jun 17, 2017

@rth rth changed the title from Support for 64 bit sparse array indices in text vectorizers to [MRG] Support for 64 bit sparse array indices in text vectorizers Jun 18, 2017

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jun 18, 2017

Member

Sure, will do.

Member

rth commented Jun 18, 2017

Sure, will do.

@jnothman

I'm okay with this for what it is. My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

Show outdated Hide outdated sklearn/feature_extraction/text.py Outdated
@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Aug 14, 2017

Member

Thank you for the review @jnothman ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks.

Also Windows CI is currently failing in this PR because long is 32 bit on windows, I need to fix that...

Member

rth commented Aug 14, 2017

Thank you for the review @jnothman ! I don't know how you manage to keep track of all the PRs and issues ) Sure makes sense to wait until predictors downstream are able to handle 64bit sparse arrays. Will try to make a few PR in that direction in the following weeks.

Also Windows CI is currently failing in this PR because long is 32 bit on windows, I need to fix that...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2017

Member

I don't know how you manage to keep track of all the PRs and issues

Uhh....
screen shot 2017-08-14 at 6 05 35 pm

I promise as much gets lost as gets tracked...

Member

jnothman commented Aug 14, 2017

I don't know how you manage to keep track of all the PRs and issues

Uhh....
screen shot 2017-08-14 at 6 05 35 pm

I promise as much gets lost as gets tracked...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2017

Member
Member

jnothman commented Aug 14, 2017

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Aug 14, 2017

Member

I don't know how you manage to keep track of all the PRs and issues

screen shot 2017-08-14 at 6 05 35 pm

I know, still, it's a lot of notifications per day...

btw can we get windows support in python 3 using a long long array?

Was thinking along those lines as well, will try to.

Member

rth commented Aug 14, 2017

I don't know how you manage to keep track of all the PRs and issues

screen shot 2017-08-14 at 6 05 35 pm

I know, still, it's a lot of notifications per day...

btw can we get windows support in python 3 using a long long array?

Was thinking along those lines as well, will try to.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Oct 1, 2017

Codecov Report

Merging #9147 into master will decrease coverage by 0.03%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9147      +/-   ##
==========================================
- Coverage    96.2%   96.16%   -0.04%     
==========================================
  Files         337      336       -1     
  Lines       62817    62422     -395     
==========================================
- Hits        60432    60030     -402     
- Misses       2385     2392       +7
Impacted Files Coverage Δ
sklearn/feature_extraction/text.py 95.96% <78.57%> (-0.64%) ⬇️
sklearn/linear_model/tests/test_bayes.py 85.48% <0%> (-4.06%) ⬇️
sklearn/utils/testing.py 75.94% <0%> (-1.06%) ⬇️
sklearn/utils/deprecation.py 95.83% <0%> (-1.05%) ⬇️
sklearn/decomposition/pca.py 94.55% <0%> (-0.64%) ⬇️
sklearn/preprocessing/_function_transformer.py 97.29% <0%> (-0.58%) ⬇️
sklearn/metrics/ranking.py 98.31% <0%> (-0.5%) ⬇️
sklearn/gaussian_process/correlation_models.py 64.1% <0%> (-0.46%) ⬇️
sklearn/ensemble/gradient_boosting.py 95.76% <0%> (-0.45%) ⬇️
sklearn/datasets/samples_generator.py 93.36% <0%> (-0.32%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bead39...121a95d. Read the comment docs.

codecov bot commented Oct 1, 2017

Codecov Report

Merging #9147 into master will decrease coverage by 0.03%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9147      +/-   ##
==========================================
- Coverage    96.2%   96.16%   -0.04%     
==========================================
  Files         337      336       -1     
  Lines       62817    62422     -395     
==========================================
- Hits        60432    60030     -402     
- Misses       2385     2392       +7
Impacted Files Coverage Δ
sklearn/feature_extraction/text.py 95.96% <78.57%> (-0.64%) ⬇️
sklearn/linear_model/tests/test_bayes.py 85.48% <0%> (-4.06%) ⬇️
sklearn/utils/testing.py 75.94% <0%> (-1.06%) ⬇️
sklearn/utils/deprecation.py 95.83% <0%> (-1.05%) ⬇️
sklearn/decomposition/pca.py 94.55% <0%> (-0.64%) ⬇️
sklearn/preprocessing/_function_transformer.py 97.29% <0%> (-0.58%) ⬇️
sklearn/metrics/ranking.py 98.31% <0%> (-0.5%) ⬇️
sklearn/gaussian_process/correlation_models.py 64.1% <0%> (-0.46%) ⬇️
sklearn/ensemble/gradient_boosting.py 95.76% <0%> (-0.45%) ⬇️
sklearn/datasets/samples_generator.py 93.36% <0%> (-0.32%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bead39...121a95d. Read the comment docs.

@jnothman jnothman referenced this pull request Nov 6, 2017

Closed

Vectorizing memory issue #6183

Claes-Fredrik Mannby and others added some commits Jan 19, 2016

Support new scipy sparse array indices, which can now be > 2^31 (< 2^…
…63).

This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 22, 2017

Member

Following earlier discussion, I rewrote this PR somewhat.

The current situation is that,

  • internally
    (a) for PY >3.3 on all platforms longlong indices (i.e. 64 bit) are going to be used for indptr
    (b) otherwise long indices are used for indptr (i.e. 64 bit on Unix like and 32 bit on Windows)
    The indices remain 32 bit in any case (i.e. the vocabulary size must be lower than 2e9) which allows to avoid memory overhead.

  • externally (i.e. from the users' perspective)
    (a) if either Linux (any Python version) or Win (with Python >3.3) is used, the resulting sparse arrays will have 32 bit or 64 bit indices as necessary. The 64 bit sparse array indices will only work for scipy >= 0.14, and error is raised otherwise.
    (b) if Windows is used with Python 2, only 32 bit indices are supported. A similar overflow error message would be printed to the one observed in the parent issues.

I can't come up with a way to write unit tests for the 64bit case, however, I have updated tests/benchmarks here (on Linux) that test this PR on an actual dataset that produces 64 bit indices (and takes several hours to run). Codecov shows a decrease in coverage because there are no tests for the 64bit case.

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

With respect to this point, what currently works is normalization (fixed in #9663), dimensionality reduction (LSI, NMF), and a few linear models (e.g. LogisticRegression with lbfgs/cd solver & ElasticNet). I have not tested clustering yet. Of course, there is still work to do (I added the progress status to the parent comment of #2969), but this might be enough to have some reasonable workflow.

A review would be appreciated, all tests now pass.

Member

rth commented Nov 22, 2017

Following earlier discussion, I rewrote this PR somewhat.

The current situation is that,

  • internally
    (a) for PY >3.3 on all platforms longlong indices (i.e. 64 bit) are going to be used for indptr
    (b) otherwise long indices are used for indptr (i.e. 64 bit on Unix like and 32 bit on Windows)
    The indices remain 32 bit in any case (i.e. the vocabulary size must be lower than 2e9) which allows to avoid memory overhead.

  • externally (i.e. from the users' perspective)
    (a) if either Linux (any Python version) or Win (with Python >3.3) is used, the resulting sparse arrays will have 32 bit or 64 bit indices as necessary. The 64 bit sparse array indices will only work for scipy >= 0.14, and error is raised otherwise.
    (b) if Windows is used with Python 2, only 32 bit indices are supported. A similar overflow error message would be printed to the one observed in the parent issues.

I can't come up with a way to write unit tests for the 64bit case, however, I have updated tests/benchmarks here (on Linux) that test this PR on an actual dataset that produces 64 bit indices (and takes several hours to run). Codecov shows a decrease in coverage because there are no tests for the 64bit case.

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great. Thanks for your work at #2969 towards this. I'd like to fix the most critical of those before merging this, really.

With respect to this point, what currently works is normalization (fixed in #9663), dimensionality reduction (LSI, NMF), and a few linear models (e.g. LogisticRegression with lbfgs/cd solver & ElasticNet). I have not tested clustering yet. Of course, there is still work to do (I added the progress status to the parent comment of #2969), but this might be enough to have some reasonable workflow.

A review would be appreciated, all tests now pass.

@@ -784,7 +785,8 @@ def _count_vocab(self, raw_documents, fixed_vocab):
analyze = self.build_analyzer()
j_indices = []
indptr = _make_int_array()
indptr = []

This comment has been minimized.

@rth

rth Nov 22, 2017

Member

I checked (in the above-linked benchmark) that switching from an array.array to list doesn't have any negative performance impact here. j_indices is already a list and that's where most of the data is (since j_indices is much longer than indptr); indptr doesn't really matter. This simplifies things with respect to typing (and possible overflows) though.

@rth

rth Nov 22, 2017

Member

I checked (in the above-linked benchmark) that switching from an array.array to list doesn't have any negative performance impact here. j_indices is already a list and that's where most of the data is (since j_indices is much longer than indptr); indptr doesn't really matter. This simplifies things with respect to typing (and possible overflows) though.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 26, 2017

Member

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great.

Let's hurry in a fix for the liblinear segfault (#9545). Otherwise, yes, I'm happy to see this merged.

Member

jnothman commented Nov 26, 2017

My main concern here is interoperability. Producing a large sparse matrix, but then breaking when it is pushed into a downstream predictor that only supports int32-indexed data is not great.

Let's hurry in a fix for the liblinear segfault (#9545). Otherwise, yes, I'm happy to see this merged.

@jnothman

This LGTM. Any chance @ogrisel could review?

return (indices_a, np.frombuffer(indptr, dtype=np.int32), values[:size])
indptr_a = np.frombuffer(indptr, dtype=indices_np_dtype)
if indptr[-1] > 2147483648: # = 2**31

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

It would sort of be nice if this were refactored somewhere, but I can't think of somewhere both pleasant and useful to keep it shared.

@jnothman

jnothman Nov 26, 2017

Member

It would sort of be nice if this were refactored somewhere, but I can't think of somewhere both pleasant and useful to keep it shared.

@jnothman jnothman changed the title from [MRG] Support for 64 bit sparse array indices in text vectorizers to [MRG+1] Support for 64 bit sparse array indices in text vectorizers Nov 26, 2017

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Nov 29, 2017

Contributor

Code-wise LGTM.

Even if this is not linked to this PR, by reviewing, I found strange to exactly reallocate to specific size the indices at each iteration there. I would have think that it would be less costly to double the capacity and return a shrinked array.

Contributor

glemaitre commented Nov 29, 2017

Code-wise LGTM.

Even if this is not linked to this PR, by reviewing, I found strange to exactly reallocate to specific size the indices at each iteration there. I would have think that it would be less costly to double the capacity and return a shrinked array.

rth added some commits Nov 29, 2017

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 29, 2017

Member

Thanks for the review @jnothman and @glemaitre !

I added the corresponding what's new entry.

Member

rth commented Nov 29, 2017

Thanks for the review @jnothman and @glemaitre !

I added the corresponding what's new entry.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 29, 2017

Member

Not precisely, but okay

Member

jnothman commented on doc/whats_new/v0.20.rst in b230cfd Nov 29, 2017

Not precisely, but okay

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 29, 2017

Member

You mean the formulation or the threshold value? For the latter, I find this more readable than 2**31 - 1 = 2147483647, (which raises the question of why signed instead of unsigned ints are used for positive indices). I can add a few more digits it you prefer?

Member

rth replied Nov 29, 2017

You mean the formulation or the threshold value? For the latter, I find this more readable than 2**31 - 1 = 2147483647, (which raises the question of why signed instead of unsigned ints are used for positive indices). I can add a few more digits it you prefer?

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 29, 2017

Member

I can add a few more digits it you prefer?

Merged #9147.
In some next (unrelated) PR..

Member

rth replied Nov 29, 2017

I can add a few more digits it you prefer?

Merged #9147.
In some next (unrelated) PR..

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 29, 2017

Member

No, I meant the suggestion that it's about the number of tokens, rather than the number of distinct ngram-document pairs

Member

jnothman replied Nov 29, 2017

No, I meant the suggestion that it's about the number of tokens, rather than the number of distinct ngram-document pairs

@jnothman

Merging, thanks!

@jnothman jnothman merged commit e5a5e77 into scikit-learn:master Nov 29, 2017

5 of 6 checks passed

codecov/patch 66.66% of diff hit (target 96.14%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.14% (-0.01%) compared to 57bcd07
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 29, 2017

Member

Thanks again for the review @jnothman and @glemaitre.

@mannby thank you for the initial implementation!

Member

rth commented Nov 29, 2017

Thanks again for the review @jnothman and @glemaitre.

@mannby thank you for the initial implementation!

@rth rth deleted the rth:large-number-of-features branch Nov 29, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Support for 64 bit sparse array indices in text vectorizers (s…
…cikit-learn#9147)

Support new scipy sparse array indices, which can now be > 2^31 (< 2^63).
This is needed for very large training sets.
Feature indices (based on the number of distinct features), are unlikely to need 4 bytes per value, however.
else:
# On Windows with PY2.7 long int would still correspond to 32 bit.
indices_array_dtype = "l"
indices_np_dtype = np.int_

This comment has been minimized.

@eric-wieser

eric-wieser Aug 9, 2018

Can you just use np.intp for all cases here?

@eric-wieser

eric-wieser Aug 9, 2018

Can you just use np.intp for all cases here?

This comment has been minimized.

@rth

rth Aug 14, 2018

Member

The issue is that I don't know the array.array dtype corresponding to np.intp. Both need to match in all cases since we are using np.frombuffer.

@rth

rth Aug 14, 2018

Member

The issue is that I don't know the array.array dtype corresponding to np.intp. Both need to match in all cases since we are using np.frombuffer.

This comment has been minimized.

@eric-wieser

eric-wieser Sep 20, 2018

np.dtype(np.intp).char will give you that.

@eric-wieser

eric-wieser Sep 20, 2018

np.dtype(np.intp).char will give you that.

@robguinness

This comment has been minimized.

Show comment
Hide comment
@robguinness

robguinness Sep 20, 2018

Thanks for this fix @rth et al. It helped me a lot!

robguinness commented Sep 20, 2018

Thanks for this fix @rth et al. It helped me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment