Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Text vectorizers memory usage #5122

Closed
wants to merge 18 commits into from

Conversation

ephes
Copy link
Contributor

@ephes ephes commented Aug 15, 2015

Hi,

the other day I've tried to vectorize some big text data and wondered why
sklean would use that much memory. So I looked into it and think I've found
some points.

For testing purposes I used the enron data set. To reproduce my results without
having to modify sklearn, you can use this little script (now including the vectorizer
from #4968 ). Just unpack the enron dataset and start the script with:

time python test_enron.py ~/data/enron/maildir enron_features.dump

Here's a little table showing the results. CountVectorizer and HashingVectorizer
are the unmodified vectorizers from sklearn. LeanCountVectorizer and
LeanHashingVectorizer are the new modified versions. AupiffCountVectorizer
is the modified version from #4968 inheriting from LeanCountVectorizer (from
CountVectorizer in parantheses). FastAupiffCountVectorizer uses 'if else' instead
of .get in inner loop and .keys() and .values insted of iteritems + an explicit
del(feature_counter) after adding a new doc:

Vectorizer after 300K after 510K peak time
CountVectorizer 0.43 GB 0.84 GB 5.14 GB 237,24s
LeanCountVectorizer 1.08 GB 1.41 GB 1.87 GB 268,19s
AupiffCountVectorizer 0.41 GB 0.8 GB 1.86 GB (3.3 GB) 287,43s (280,02s)
FastAupiffCountVectorizer 0.27 GB 0.43 GB 1.69 GB 253,92s
HashingVectorizer 1.46 GB 2.75 GB 2.82 GB 243,02s
LeanHashingVectorizer 1.28 GB 1.55 GB 1.64 GB 232,62s

Why is the default CountVectorizer using 5 GB memory to produce a
feature matrix which is only 1.15 GB in size when dumped to disc via joblib?
Maybe it's the vocabulary? Let's see:

The feature matrix consists of the numpy arrays indices, indptr and values:
indices size:  835.65 MB
indptr size :    2.07 MB
values size : 1671.23 MB

Size of the feature matrix before X.sum_duplicates() is called:
2509 MB

Size of the feature matrix after X.sum_duplicates() was called:
1147 MB

vocabulary size: 100.66 MB

Well, it's not the vocabulary, it's the feature matrix. And if
using 2.5 GB is not bad enough, either X.sum_duplicates or
sp.csr_matrix are copying some data, because I saw 4.25 GB memory
usage even with an sys.exit(0) at the end of _count_vocab.

In the LeanCountVectorizer I iterate chunk-wise over the documents,
create a temporary feature matrix for each chunk and append this
temporary feature matrix then to the final feature matrix in place.
So there's never this big values array and the data gets also copied,
but only for the small chunk based feature matrices.

After _count_vocab returned the feature matrix, the method _sort_features
is called. Although it modifies the vocabulary in place as stated in
the docstring, it does not modify the feature matrix in place, but
makes a copy via fancy indexing (return X[:, map_index]). So I changed
the _sort_features method for LeanCountVectorizer to modify the feature
matrix in place. This operation is rather slow and the main reason
why LeanCountVectorizer is slower than CountVectorizer. Maybe there's
a better way to swap the columns of the feature matrix.

And finally the _limit_features method also makes a full copy of the
feature matrix via fancy indexing, because it's high and low arguments
are set to numerical values per default so the 'if high is None and...'
condition is always false. In LeanCountVectorizer this is also changed.

The LeanHashingVectorizer has only the 'iterate in chunks over documents'
modification. And since the values array would not get as big, it uses
less memory.

best regards,
Jochen

@larsmans larsmans changed the title Text vectorizers memory usage [WIP] Text vectorizers memory usage Aug 15, 2015
@larsmans
Copy link
Member

Cool idea, but the tests aren't passing. I'm also not sure if yet another option on the vectorizers is a good idea; we might as well hardcode the chunk size.

…pty vocabulary exception before getting invalid shape from elf._get_sparse_matrix_from_indices
@ephes
Copy link
Contributor Author

ephes commented Aug 15, 2015

The tests should work now. I didn't test with older scipy versions and for scipy <= 0.11.0 you cant create empty csr matrices :/.

I think it's fine to hardcode the chunksize. I could set it to 10K like the default now?

@jnothman
Copy link
Member

This is very similar in thought to #4968.

@ephes
Copy link
Contributor Author

ephes commented Aug 16, 2015

The vectorizer from #4968 is now included in the benchmark script and I have updated the results table. I think it's a far more straightforward solution for the values array getting big problem than my approach. But I don't know wether it's also usable for HashingCountVectorizer.

@ephes
Copy link
Contributor Author

ephes commented Aug 17, 2015

I've added FastAupiffCountVectorizer, which is a little bit faster, to the test script and updated the CountVectorizer in my pull request accordingly. To my surprise the dict approach is faster than
manual sum_duplicates on each document:

        for doc in raw_documents:
            doc_indices = _make_int_array()
            for feature in analyze(doc):
                try:
                    doc_indices.append(vocabulary[feature])
                except KeyError:
                    # Ignore out-of-vocabulary items for fixed_vocab=True
                    continue

            # sum duplicates for doc features
            doc_indices = np.frombuffer(doc_indices, dtype=np.intc)
            doc_indices.sort()
            diff = np.concatenate(([1], np.diff(doc_indices)))
            idx = np.concatenate((np.where(diff)[0], [len(doc_indices)]))

            # append doc to global indices and values
            j_indices.extend(doc_indices[idx[:-1]])
            values.extend(np.diff(idx))

            indptr.append(len(j_indices))

@jnothman
Copy link
Member

I'd be very happy to see FastAupiffCountVectorizer merged ASAP, with or without benchmarking the concern of #5306, whether by @aupiff or @ephes. Who's up for a clean PR doing that alone? HashingVectorizer can come later.

@fayeshine
Copy link
Contributor

So is there any update for this issue?

@jnothman
Copy link
Member

jnothman commented Nov 1, 2015

Shall I assume that both @aupiff and @ephes are no longer working on this, and we should open it to another contributor?

@aupiff
Copy link

aupiff commented Nov 4, 2015

@jnothman yes.

@yenchenlin
Copy link
Contributor

@jnothman Have this issue been addressed?
May I take a stab at it?

@jnothman
Copy link
Member

jnothman commented Feb 6, 2016

It seems so, @yenchenlin1994

rth added a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016
rth pushed a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016
This update corresponds to the FastAupiffCountVectorizer
implementation from PR scikit-learn#5122
@jnothman jnothman closed this Sep 15, 2016
@jnothman
Copy link
Member

Partially superseded by #7272. Please open a new PR for HashingVectorizer optimisations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants