Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

Merged
merged 9 commits into from Jan 30, 2019

Conversation

gvacaliuc
Copy link
Contributor

Reference Issues/PRs

#7762

What does this implement/fix? Explain your changes.

If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers. If the features get sorted via a call to _sort_features, the new indices were hard coded to be 32 bit. I've updated that call to make the new indices the same dtype as before.

Any other comments?

The issue was more broad than this one case, but I didn't identify any other manual modifications of sparse matrices indices / indptrs in the code. See #7762 for more details on that.

@gvacaliuc gvacaliuc changed the title fixes an issue w/ manually sparse matrix indices identified in #7762 [MRG] fixes an issue w/ manually sparse matrix indices identified in #7762 Jun 15, 2018
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on investigating! Overall I think it's the right fix.

If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers

The case of 64 bit indptr was fixed in #9147 in which case indices are also 64 bit, by side effect, due to constraints of scipy CSR matrices, if I remember correctly (see related discussion #9147 (comment)). Needing 64 bit indices is not actually something that happens too much - that would mean a vocabulary size of 2e9 tokens ( total word count in English wikipedia but of unique tokens).

However, when we do use 64bit indices/indptr I think this change would mean one less memory copy due to dtype conversion + better validity of the output array, so it's all good.

See other comments below,


vec._sort_features(X, vocabulary)

assert_equal(INDEX_DTYPE, X.indices.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert INDEX_DTYPE == X.indices.dtype

will produce nicer error messages with pytest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, will update

"great!": 2
}

vec._sort_features(X, vocabulary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docstring, this returns the reordered matrix, so let's rather validate,

Xs = vec._sort_features(X, vocabulary)

Just to be safe. Looking at the code, I agree that it does inplace modification of X, but there is no guarantee that it will continue to do that in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense, I'll update.

# force indices and indptr to int64.
INDEX_DTYPE = np.int64
X.indices = X.indices.astype(INDEX_DTYPE, copy=False)
X.indptr = X.indptr.astype(INDEX_DTYPE, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure copy=True has an impact here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure -- I suppose it doesn't matter if it's a copy or not.

# If a count vectorizer has to store >= 2**31 count values, the sparse
# storage matrix has 64bit indices / indptrs. This requires ~2*8*2**31
# bytes of memory in practice, so we just test the method that would
# hypothetically fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just say,
"""
Check that _sort_features preserves the indices dtype of sparse arrays with 64 bit indices
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll update it to something cleaner.

X = sparse.csr_matrix((5, 5), dtype=np.int64)

# force indices and indptr to int64.
INDEX_DTYPE = np.int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDICES_DTYPE if we want to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update.

@@ -1097,3 +1097,28 @@ def test_vectorizers_invalid_ngram_range(vec):
if isinstance(vec, HashingVectorizer):
assert_raise_message(
ValueError, message, vec.transform, ["good news everyone"])


@pytest.mark.parametrize("vec", [CountVectorizer()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to parametrize it, since we are only checking the CountVectorizer (which should be sufficient).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure at first if we needed to test this with other vectorizers, but I don't think so.

@gvacaliuc
Copy link
Contributor Author

Yes this will likely never happen in practice!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CIs unhappy.

@gvacaliuc
Copy link
Contributor Author

Looking into it!

@rth
Copy link
Member

rth commented Jul 22, 2018

It would be great to have this. @gvacaliuc any chance you could fix the CI (and resolve the conflict)? Thanks!

@gvacaliuc
Copy link
Contributor Author

Hey there, sorry to ghost. I'll mess around with it tonight and see if I can figure it out.

* added test to show example of issue in scikit-learn#7762
* fixes error caused by manually manipulating sparse indices in
  `CountVectorizer`
@gvacaliuc
Copy link
Contributor Author

Sorry, life's been busy lately. Appveyor fails on 32-bit Python 2.7.8. IIUC this issue would never arise on a 32bit arch, but I'll set up a test environment to figure out what's going on.

The strange thing is that this is the relevant bit of code:

        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)
        for new_val, (term, old_val) in enumerate(sorted_features):
            vocabulary[term] = new_val
            map_index[old_val] = new_val

        X.indices = map_index.take(X.indices, mode='clip')

And appveyor is failing on that last line due to a type error (cannot cast int64 to int32). Clearly map_index and X.indices are the same dtype, so my guess off the bat is that there's some issue with trying to index the map_index array with 64bit integers on a 32bit architecture, and that under the hood numpy tries to cast whatever dtype the index array is to int32.

@jnothman
Copy link
Member

jnothman commented Aug 9, 2018 via email

@rth
Copy link
Member

rth commented Aug 9, 2018

I can reproduce 64 bit indexing issue you see in np.take on 32bit Linux. Opened an issue upstream numpy/numpy#11696

What we could do is to just use previous behavior (and skip the added tests) on 32 bit systems. Here is the code that could be used for detection (and skipping tests)

@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary):
Returns a reordered matrix and modifies the vocabulary in place
"""
sorted_features = sorted(six.iteritems(vocabulary))
map_index = np.empty(len(sorted_features), dtype=np.int32)
map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices should be of dtype np.intp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our np.intp would overflow?

if indptr[-1] > 2147483648: # = 2**31 - 1
if sp_version >= (0, 14):
indices_dtype = np.int64
else:
raise ValueError(('sparse CSR array has {} non-zero '
'elements and requires 64 bit indexing, '
' which is unsupported with scipy {}. '
'Please upgrade to scipy >=0.14')
.format(indptr[-1], '.'.join(sp_version)))
else:
indices_dtype = np.int32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy?

At any rate, it seems to be that indices_dtype should always be np.intp - if that's not possible, then you should pick between np.int32 (on old scipy) and np.intp (on fixed scipy)

Copy link
Contributor

@pv pv Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.

Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert.

@@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary):
Returns a reordered matrix and modifies the vocabulary in place
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but: This seems to reorder X in place too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does -- I can update that when we decide what the desired behavior is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intentional, I think, to reduce memory usage, why would it be an issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an issue only because the documentation doesn't tell me that's going to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to reorder X in place too.

Agreed, we should add a note about it but in a separate PR.

@rth
Copy link
Member

rth commented Aug 14, 2018

Thanks a lot for your feedback @eric-wieser and @pv on this !

The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.

That was my basic principle for #9147. I'll admit I have not thought too much about 32 bit architectures, beyond making the tests pass (which also run on 32 bit appveyor). It possible that there is a more elegant solution (using np.intp etc).

In this scikit-learn context, few users would need to vectorize enough text data that would produce a 64bit indexed sparse arrays: just the resulting sparse array would take >30-50 GB memory which comes closes to the inherent 64 GB RAM limitations of 32 architectures (with PAE). So currently we support this use case, with recent scipy, Unux or (Windows with Python 3) and (possibly?) 64 bit architectures only, and fail in all other cases. I don't think there is a need to support 32 architectures here.

On the other hand, we do need to support all OS, python versions and 32 bit/64 bit architectures for smaller text datasets that would produce 32 bit indexed sparse arrays.

So in the end, I am not sure what can/should be done here beyond what I proposed in #11295 (comment) but you might have more better ideas..

@rth rth closed this Aug 14, 2018
@rth rth reopened this Aug 14, 2018
@rth
Copy link
Member

rth commented Aug 14, 2018

Sorry accidental key combination... Updated unfinished comment (#11295 (comment)) above.

@rth
Copy link
Member

rth commented Jan 3, 2019

Merged master in to resolve conflicts. @gvacaliuc please run a git pull before adding any more commits.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Xs = CountVectorizer()._sort_features(X, vocabulary)

assert INDICES_DTYPE == Xs.indices.dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PEP8 is complaining that a blank line contains whitespace here?

@jnothman
Copy link
Member

jnothman commented Jan 6, 2019

This should be a candidate for 0.20.3 (have we got a way to mark that?)

@adrinjalali
Copy link
Member

Doesn't creating a milestone and putting it in there make sense? We can then backport them all I guess.

@rth
Copy link
Member

rth commented Jan 22, 2019

Thanks for your work on this @gvacaliuc

Don't support 64bit indices on 32bit platforms at all

I think it's acceptable not to support 64 bit sparse indices on 32 bit architectures (particularly if the alternative is some performance cost on all platforms).

The most case for 32 bit arch I can think about (Raspberry PI & embedded, Windows 32 bit, web assembly) are not typically used in cases that need processing 10s of GB of sparse data.

@gvacaliuc
Copy link
Contributor Author

Cool -- I'm going to add a check to see if we're on a 32bit machine here and then restrict the added test to 64bit machines as @rth suggested. Should be able to get it pushed by EOD!

@jnothman jnothman changed the title [MRG] fixes an issue w/ large sparse matrix indices identified in #7762 [MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer Jan 23, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gvacaliuc ! Looks good apart for a minor comment below.

@@ -961,12 +962,17 @@ def _count_vocab(self, raw_documents, fixed_vocab):
" contain stop words")

if indptr[-1] > 2147483648: # = 2**31 - 1
if sp_version >= (0, 14):
if sp_version >= (0, 14) and not _IS_32BIT:
Copy link
Member

@rth rth Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current minimal supported scipy version is 0.17, so you can just remove the scipy version check and remove the last else clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, will update now!

@gvacaliuc
Copy link
Contributor Author

gvacaliuc commented Jan 25, 2019

Let me know if there's any other changes I should make, it looks like the failing check is a code coverage issue. Since I didn't add much code, I think the only test I could add is to check that 32bit python throws the error, but that might require a very large vocab 😮

@jnothman
Copy link
Member

jnothman commented Jan 25, 2019 via email

@jnothman
Copy link
Member

I'm going to take @rth's comment as an approval and merge this.

@jnothman
Copy link
Member

Thanks @gvacaliuc!

@jnothman
Copy link
Member

Wait... Lacks a changelog entry.

@gvacaliuc Please add an entry to the change log at doc/whats_new/v0.20.rst under 0.20.3. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (apart for the missing what's new entry).

@gvacaliuc
Copy link
Contributor Author

OK, just added the change log entry. Let me know if I missed anything else 😄

@jnothman jnothman merged commit 5fc5c6e into scikit-learn:master Jan 30, 2019
@jnothman
Copy link
Member

Thanks @gvacaliuc

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
@jnothman jnothman mentioned this pull request Feb 19, 2019
17 tasks
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants