[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

gvacaliuc · 2018-06-15T21:38:51Z

added test to show example of issue in Ensure all CSR / CSC matrices that we hack are valid #7762
fixes error caused by manually manipulating sparse indices in
CountVectorizer

Reference Issues/PRs

What does this implement/fix? Explain your changes.

If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers. If the features get sorted via a call to _sort_features, the new indices were hard coded to be 32 bit. I've updated that call to make the new indices the same dtype as before.

Any other comments?

The issue was more broad than this one case, but I didn't identify any other manual modifications of sparse matrices indices / indptrs in the code. See #7762 for more details on that.

rth

Nice work on investigating! Overall I think it's the right fix.

If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers

The case of 64 bit indptr was fixed in #9147 in which case indices are also 64 bit, by side effect, due to constraints of scipy CSR matrices, if I remember correctly (see related discussion #9147 (comment)). Needing 64 bit indices is not actually something that happens too much - that would mean a vocabulary size of ~~2e9 tokens (~~ total word count in English wikipedia but of unique tokens).

However, when we do use 64bit indices/indptr I think this change would mean one less memory copy due to dtype conversion + better validity of the output array, so it's all good.

See other comments below,

rth · 2018-06-15T22:18:03Z

sklearn/feature_extraction/tests/test_text.py

+
+    vec._sort_features(X, vocabulary)
+
+    assert_equal(INDEX_DTYPE, X.indices.dtype)


assert INDEX_DTYPE == X.indices.dtype

will produce nicer error messages with pytest

good to know, will update

rth · 2018-06-15T22:20:01Z

sklearn/feature_extraction/tests/test_text.py

+            "great!": 2
+            }
+
+    vec._sort_features(X, vocabulary)


According to the docstring, this returns the reordered matrix, so let's rather validate,

Xs = vec._sort_features(X, vocabulary)

Just to be safe. Looking at the code, I agree that it does inplace modification of X, but there is no guarantee that it will continue to do that in the future.

Sure, that makes sense, I'll update.

rth · 2018-06-15T22:22:27Z

sklearn/feature_extraction/tests/test_text.py

+    # force indices and indptr to int64.
+    INDEX_DTYPE = np.int64
+    X.indices = X.indices.astype(INDEX_DTYPE, copy=False)
+    X.indptr = X.indptr.astype(INDEX_DTYPE, copy=False)


Are you sure copy=True has an impact here?

I am not sure -- I suppose it doesn't matter if it's a copy or not.

rth · 2018-06-15T22:44:44Z

sklearn/feature_extraction/tests/test_text.py

+    # If a count vectorizer has to store >= 2**31 count values, the sparse
+    # storage matrix has 64bit indices / indptrs.  This requires ~2*8*2**31
+    # bytes of memory in practice, so we just test the method that would
+    # hypothetically fail.


Maybe just say,
"""
Check that _sort_features preserves the indices dtype of sparse arrays with 64 bit indices
"""

Yep, I'll update it to something cleaner.

rth · 2018-06-15T22:45:49Z

sklearn/feature_extraction/tests/test_text.py

+    X = sparse.csr_matrix((5, 5), dtype=np.int64)
+
+    # force indices and indptr to int64.
+    INDEX_DTYPE = np.int64


INDICES_DTYPE if we want to be consistent

Sure, will update.

rth · 2018-06-15T22:47:57Z

sklearn/feature_extraction/tests/test_text.py

@@ -1097,3 +1097,28 @@ def test_vectorizers_invalid_ngram_range(vec):
    if isinstance(vec, HashingVectorizer):
        assert_raise_message(
            ValueError, message, vec.transform, ["good news everyone"])
+
+
+@pytest.mark.parametrize("vec", [CountVectorizer()])


I think there is no need to parametrize it, since we are only checking the CountVectorizer (which should be sufficient).

Yeah, I wasn't sure at first if we needed to test this with other vectorizers, but I don't think so.

gvacaliuc · 2018-06-17T20:15:01Z

Yes this will likely never happen in practice!

jnothman

CIs unhappy.

gvacaliuc · 2018-06-19T02:28:02Z

Looking into it!

rth · 2018-07-22T17:52:51Z

It would be great to have this. @gvacaliuc any chance you could fix the CI (and resolve the conflict)? Thanks!

gvacaliuc · 2018-07-24T02:28:45Z

Hey there, sorry to ghost. I'll mess around with it tonight and see if I can figure it out.

* added test to show example of issue in scikit-learn#7762 * fixes error caused by manually manipulating sparse indices in `CountVectorizer`

gvacaliuc · 2018-08-09T05:09:07Z

Sorry, life's been busy lately. Appveyor fails on 32-bit Python 2.7.8. IIUC this issue would never arise on a 32bit arch, but I'll set up a test environment to figure out what's going on.

The strange thing is that this is the relevant bit of code:

        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)
        for new_val, (term, old_val) in enumerate(sorted_features):
            vocabulary[term] = new_val
            map_index[old_val] = new_val

        X.indices = map_index.take(X.indices, mode='clip')

And appveyor is failing on that last line due to a type error (cannot cast int64 to int32). Clearly map_index and X.indices are the same dtype, so my guess off the bat is that there's some issue with trying to index the map_index array with 64bit integers on a 32bit architecture, and that under the hood numpy tries to cast whatever dtype the index array is to int32.

jnothman · 2018-08-09T05:47:06Z

I will admit I also find this quite inexplicable...

rth · 2018-08-09T06:57:52Z

I can reproduce 64 bit indexing issue you see in np.take on 32bit Linux. Opened an issue upstream numpy/numpy#11696

What we could do is to just use previous behavior (and skip the added tests) on 32 bit systems. Here is the code that could be used for detection (and skipping tests)

eric-wieser · 2018-08-09T07:07:21Z

sklearn/feature_extraction/text.py

@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary):
        Returns a reordered matrix and modifies the vocabulary in place
        """
        sorted_features = sorted(six.iteritems(vocabulary))
-        map_index = np.empty(len(sorted_features), dtype=np.int32)
+        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)


Indices should be of dtype np.intp

Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our np.intp would overflow?

scikit-learn/sklearn/feature_extraction/text.py

Lines 944 to 955 in 317a169

if indptr[-1] > 2147483648: # = 2**31 - 1

if sp_version >= (0, 14):

indices_dtype = np.int64

else:

raise ValueError(('sparse CSR array has {} non-zero '

'elements and requires 64 bit indexing, '

' which is unsupported with scipy {}. '

'Please upgrade to scipy >=0.14')

.format(indptr[-1], '.'.join(sp_version)))

else:

indices_dtype = np.int32

I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy?

At any rate, it seems to be that indices_dtype should always be np.intp - if that's not possible, then you should pick between np.int32 (on old scipy) and np.intp (on fixed scipy)

The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.

Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert.

eric-wieser · 2018-08-10T05:09:41Z

sklearn/feature_extraction/text.py

@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary):
        Returns a reordered matrix and modifies the vocabulary in place


Unrelated, but: This seems to reorder X in place too.

Indeed it does -- I can update that when we decide what the desired behavior is.

That's intentional, I think, to reduce memory usage, why would it be an issue?

It's an issue only because the documentation doesn't tell me that's going to happen.

This seems to reorder X in place too.

Agreed, we should add a note about it but in a separate PR.

rth · 2018-08-14T10:21:04Z

Thanks a lot for your feedback @eric-wieser and @pv on this !

The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.

That was my basic principle for #9147. I'll admit I have not thought too much about 32 bit architectures, beyond making the tests pass (which also run on 32 bit appveyor). It possible that there is a more elegant solution (using np.intp etc).

In this scikit-learn context, few users would need to vectorize enough text data that would produce a 64bit indexed sparse arrays: just the resulting sparse array would take >30-50 GB memory which comes closes to the inherent 64 GB RAM limitations of 32 architectures (with PAE). So currently we support this use case, with recent scipy, Unux or (Windows with Python 3) and (possibly?) 64 bit architectures only, and fail in all other cases. I don't think there is a need to support 32 architectures here.

On the other hand, we do need to support all OS, python versions and 32 bit/64 bit architectures for smaller text datasets that would produce 32 bit indexed sparse arrays.

So in the end, I am not sure what can/should be done here beyond what I proposed in #11295 (comment) but you might have more better ideas..

rth · 2018-08-14T10:24:10Z

Sorry accidental key combination... Updated unfinished comment (#11295 (comment)) above.

rth · 2019-01-03T15:25:49Z

Merged master in to resolve conflicts. @gvacaliuc please run a git pull before adding any more commits.

jnothman

Thanks!

jnothman · 2019-01-06T10:09:46Z

sklearn/feature_extraction/tests/test_text.py

+    Xs = CountVectorizer()._sort_features(X, vocabulary)
+
+    assert INDICES_DTYPE == Xs.indices.dtype
+


I think PEP8 is complaining that a blank line contains whitespace here?

jnothman · 2019-01-06T10:10:21Z

This should be a candidate for 0.20.3 (have we got a way to mark that?)

adrinjalali · 2019-01-06T22:01:31Z

Doesn't creating a milestone and putting it in there make sense? We can then backport them all I guess.

rth · 2019-01-22T20:17:30Z

Thanks for your work on this @gvacaliuc

Don't support 64bit indices on 32bit platforms at all

I think it's acceptable not to support 64 bit sparse indices on 32 bit architectures (particularly if the alternative is some performance cost on all platforms).

The most case for 32 bit arch I can think about (Raspberry PI & embedded, Windows 32 bit, web assembly) are not typically used in cases that need processing 10s of GB of sparse data.

gvacaliuc · 2019-01-22T21:48:57Z

Cool -- I'm going to add a check to see if we're on a 32bit machine here and then restrict the added test to 64bit machines as @rth suggested. Should be able to get it pushed by EOD!

rth

Thanks @gvacaliuc ! Looks good apart for a minor comment below.

rth · 2019-01-23T22:41:42Z

sklearn/feature_extraction/text.py

@@ -961,12 +962,17 @@ def _count_vocab(self, raw_documents, fixed_vocab):
                                 " contain stop words")

        if indptr[-1] > 2147483648:  # = 2**31 - 1
-            if sp_version >= (0, 14):
+            if sp_version >= (0, 14) and not _IS_32BIT:


Our current minimal supported scipy version is 0.17, so you can just remove the scipy version check and remove the last else clause.

cool, will update now!

gvacaliuc · 2019-01-25T00:26:56Z

Let me know if there's any other changes I should make, it looks like the failing check is a code coverage issue. Since I didn't add much code, I think the only test I could add is to check that 32bit python throws the error, but that might require a very large vocab 😮

jnothman · 2019-01-25T01:33:30Z

Yes, presumably because the error isn't tested in our ci

jnothman · 2019-01-26T13:18:26Z

I'm going to take @rth's comment as an approval and merge this.

jnothman · 2019-01-26T13:18:36Z

Thanks @gvacaliuc!

jnothman · 2019-01-26T13:19:29Z

Wait... Lacks a changelog entry.

@gvacaliuc Please add an entry to the change log at doc/whats_new/v0.20.rst under 0.20.3. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

rth

LGTM (apart for the missing what's new entry).

gvacaliuc · 2019-01-29T15:10:58Z

OK, just added the change log entry. Let me know if I missed anything else 😄

jnothman · 2019-01-30T00:07:15Z

Thanks @gvacaliuc

…t-learn#11295)

scikit-learn#11295)" This reverts commit 1da72e7.

…t-learn#11295)

gvacaliuc changed the title ~~fixes an issue w/ manually sparse matrix indices identified in #7762~~ [MRG] fixes an issue w/ manually sparse matrix indices identified in #7762 Jun 15, 2018

rth reviewed Jun 15, 2018

View reviewed changes

gvacaliuc force-pushed the verify-sparse-matrices branch from 9c0b813 to 6d6099d Compare June 17, 2018 20:07

jnothman reviewed Jun 17, 2018

View reviewed changes

gvacaliuc force-pushed the verify-sparse-matrices branch from 6d6099d to 59d12b8 Compare August 9, 2018 03:44

fixes an issue discussed in scikit-learn#7762

317a169

* added test to show example of issue in scikit-learn#7762 * fixes error caused by manually manipulating sparse indices in `CountVectorizer`

gvacaliuc force-pushed the verify-sparse-matrices branch from 59d12b8 to 317a169 Compare August 9, 2018 03:52

rth mentioned this pull request Aug 9, 2018

numpy.take fails with int64 dtype on i386 architecture numpy/numpy#11696

Closed

eric-wieser reviewed Aug 9, 2018

View reviewed changes

eric-wieser mentioned this pull request Aug 10, 2018

WIP: Always use np.intp for sparse matrix indices scipy/scipy#9130

Closed

eric-wieser reviewed Aug 10, 2018

View reviewed changes

rth closed this Aug 14, 2018

rth reopened this Aug 14, 2018

Merge branch 'master' into verify-sparse-matrices

242e668

jnothman approved these changes Jan 6, 2019

View reviewed changes

jnothman added Bug Regression labels Jan 8, 2019

gvacaliuc added 3 commits January 23, 2019 14:03

fixes 32bit windows issue

562b984

fixed merge conflicts due to Mapping hot fix being removed

8f2db42

removed unused import accidentally left in

6fb6923

jnothman changed the title ~~[MRG] fixes an issue w/ large sparse matrix indices identified in #7762~~ [MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer Jan 23, 2019

jnothman approved these changes Jan 23, 2019

View reviewed changes

rth reviewed Jan 23, 2019

View reviewed changes

simplified check since we don't support scipy < 0.17

74b3736

rth approved these changes Jan 26, 2019

View reviewed changes

[doc quick] added change log entry

1bafd00

indices -> indptr

0c5e34d

jnothman merged commit 5fc5c6e into scikit-learn:master Jan 30, 2019

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

1d8c534

…t-learn#11295)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

5d80b7b

…t-learn#11295)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

73200d4

…t-learn#11295)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

88d9d5f

…t-learn#11295)

jnothman mentioned this pull request Feb 19, 2019

Release 0.20.3 #13186

Merged

17 tasks

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

1da72e7

…t-learn#11295)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX an issue w/ large sparse matrix indices in CountVectorizer (

12c7663

scikit-learn#11295)" This reverts commit 1da72e7.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX an issue w/ large sparse matrix indices in CountVectorizer (

2716628

scikit-learn#11295)" This reverts commit 1da72e7.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX an issue w/ large sparse matrix indices in CountVectorizer (sciki…

b355166

…t-learn#11295)

cmarmo mentioned this pull request Apr 1, 2021

Ensure all CSR / CSC matrices that we hack are valid #7762

Closed


		vec._sort_features(X, vocabulary)

		assert_equal(INDEX_DTYPE, X.indices.dtype)

	if indptr[-1] > 2147483648: # = 2**31 - 1
	if sp_version >= (0, 14):
	indices_dtype = np.int64
	else:
	raise ValueError(('sparse CSR array has {} non-zero '
	'elements and requires 64 bit indexing, '
	' which is unsupported with scipy {}. '
	'Please upgrade to scipy >=0.14')
	.format(indptr[-1], '.'.join(sp_version)))

	else:
	indices_dtype = np.int32

		@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary):
		Returns a reordered matrix and modifies the vocabulary in place

		Xs = CountVectorizer()._sort_features(X, vocabulary)

		assert INDICES_DTYPE == Xs.indices.dtype

[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

Conversation

gvacaliuc commented Jun 15, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

rth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvacaliuc commented Jun 17, 2018

jnothman left a comment

Choose a reason for hiding this comment

gvacaliuc commented Jun 19, 2018

rth commented Jul 22, 2018

gvacaliuc commented Jul 24, 2018

gvacaliuc commented Aug 9, 2018

jnothman commented Aug 9, 2018 via email

rth commented Aug 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pv Aug 11, 2018 • edited

Choose a reason for hiding this comment

eric-wieser Aug 10, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Aug 14, 2018 • edited

rth commented Aug 14, 2018

rth commented Jan 3, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 6, 2019

adrinjalali commented Jan 6, 2019

rth commented Jan 22, 2019

gvacaliuc commented Jan 22, 2019

rth left a comment

Choose a reason for hiding this comment

rth Jan 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvacaliuc commented Jan 25, 2019 • edited

jnothman commented Jan 25, 2019 via email

jnothman commented Jan 26, 2019

jnothman commented Jan 26, 2019

jnothman commented Jan 26, 2019

rth left a comment

Choose a reason for hiding this comment

gvacaliuc commented Jan 29, 2019

jnothman commented Jan 30, 2019

pv Aug 11, 2018 •

edited

eric-wieser Aug 10, 2018 •

edited

rth commented Aug 14, 2018 •

edited

rth Jan 23, 2019 •

edited

gvacaliuc commented Jan 25, 2019 •

edited