New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Text vectorizers memory usage #5122
Conversation
…memory on X.sum_duplicates, build small temporary feature matrices and stack them together to create the full feature matrix with sizeof(X) + sizeof(temp_X) memory consumption.
…ed the feature matrix)
This reverts commit 554835a.
…emory at X.sum_duplicates(), build smaller temporary matrices and stack them together
…ot copied everytime
Cool idea, but the tests aren't passing. I'm also not sure if yet another option on the vectorizers is a good idea; we might as well hardcode the chunk size. |
…pty vocabulary exception before getting invalid shape from elf._get_sparse_matrix_from_indices
The tests should work now. I didn't test with older scipy versions and for scipy <= 0.11.0 you cant create empty csr matrices :/. I think it's fine to hardcode the chunksize. I could set it to 10K like the default now? |
This is very similar in thought to #4968. |
The vectorizer from #4968 is now included in the benchmark script and I have updated the results table. I think it's a far more straightforward solution for the values array getting big problem than my approach. But I don't know wether it's also usable for HashingCountVectorizer. |
I've added FastAupiffCountVectorizer, which is a little bit faster, to the test script and updated the CountVectorizer in my pull request accordingly. To my surprise the dict approach is faster than for doc in raw_documents:
doc_indices = _make_int_array()
for feature in analyze(doc):
try:
doc_indices.append(vocabulary[feature])
except KeyError:
# Ignore out-of-vocabulary items for fixed_vocab=True
continue
# sum duplicates for doc features
doc_indices = np.frombuffer(doc_indices, dtype=np.intc)
doc_indices.sort()
diff = np.concatenate(([1], np.diff(doc_indices)))
idx = np.concatenate((np.where(diff)[0], [len(doc_indices)]))
# append doc to global indices and values
j_indices.extend(doc_indices[idx[:-1]])
values.extend(np.diff(idx))
indptr.append(len(j_indices)) |
So is there any update for this issue? |
@jnothman yes. |
@jnothman Have this issue been addressed? |
It seems so, @yenchenlin1994 |
This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122
Partially superseded by #7272. Please open a new PR for |
Hi,
the other day I've tried to vectorize some big text data and wondered why
sklean would use that much memory. So I looked into it and think I've found
some points.
For testing purposes I used the enron data set. To reproduce my results without
having to modify sklearn, you can use this little script (now including the vectorizer
from #4968 ). Just unpack the enron dataset and start the script with:
Here's a little table showing the results. CountVectorizer and HashingVectorizer
are the unmodified vectorizers from sklearn. LeanCountVectorizer and
LeanHashingVectorizer are the new modified versions. AupiffCountVectorizer
is the modified version from #4968 inheriting from LeanCountVectorizer (from
CountVectorizer in parantheses). FastAupiffCountVectorizer uses 'if else' instead
of .get in inner loop and .keys() and .values insted of iteritems + an explicit
del(feature_counter) after adding a new doc:
Why is the default CountVectorizer using 5 GB memory to produce a
feature matrix which is only 1.15 GB in size when dumped to disc via joblib?
Maybe it's the vocabulary? Let's see:
Well, it's not the vocabulary, it's the feature matrix. And if
using 2.5 GB is not bad enough, either X.sum_duplicates or
sp.csr_matrix are copying some data, because I saw 4.25 GB memory
usage even with an sys.exit(0) at the end of _count_vocab.
In the LeanCountVectorizer I iterate chunk-wise over the documents,
create a temporary feature matrix for each chunk and append this
temporary feature matrix then to the final feature matrix in place.
So there's never this big values array and the data gets also copied,
but only for the small chunk based feature matrices.
After _count_vocab returned the feature matrix, the method _sort_features
is called. Although it modifies the vocabulary in place as stated in
the docstring, it does not modify the feature matrix in place, but
makes a copy via fancy indexing (return X[:, map_index]). So I changed
the _sort_features method for LeanCountVectorizer to modify the feature
matrix in place. This operation is rather slow and the main reason
why LeanCountVectorizer is slower than CountVectorizer. Maybe there's
a better way to swap the columns of the feature matrix.
And finally the _limit_features method also makes a full copy of the
feature matrix via fancy indexing, because it's high and low arguments
are set to numerical values per default so the 'if high is None and...'
condition is always false. In LeanCountVectorizer this is also changed.
The LeanHashingVectorizer has only the 'iterate in chunks over documents'
modification. And since the values array would not get as big, it uses
less memory.
best regards,
Jochen