[WIP] Text vectorizers memory usage #5122

ephes · 2015-08-15T11:04:17Z

Hi,

the other day I've tried to vectorize some big text data and wondered why
sklean would use that much memory. So I looked into it and think I've found
some points.

For testing purposes I used the enron data set. To reproduce my results without
having to modify sklearn, you can use this little script (now including the vectorizer
from #4968 ). Just unpack the enron dataset and start the script with:

time python test_enron.py ~/data/enron/maildir enron_features.dump

Here's a little table showing the results. CountVectorizer and HashingVectorizer
are the unmodified vectorizers from sklearn. LeanCountVectorizer and
LeanHashingVectorizer are the new modified versions. AupiffCountVectorizer
is the modified version from #4968 inheriting from LeanCountVectorizer (from
CountVectorizer in parantheses). FastAupiffCountVectorizer uses 'if else' instead
of .get in inner loop and .keys() and .values insted of iteritems + an explicit
del(feature_counter) after adding a new doc:

Vectorizer	after 300K	after 510K	peak	time
CountVectorizer	0.43 GB	0.84 GB	5.14 GB	237,24s
LeanCountVectorizer	1.08 GB	1.41 GB	1.87 GB	268,19s
AupiffCountVectorizer	0.41 GB	0.8 GB	1.86 GB (3.3 GB)	287,43s (280,02s)
FastAupiffCountVectorizer	0.27 GB	0.43 GB	1.69 GB	253,92s
HashingVectorizer	1.46 GB	2.75 GB	2.82 GB	243,02s
LeanHashingVectorizer	1.28 GB	1.55 GB	1.64 GB	232,62s

Why is the default CountVectorizer using 5 GB memory to produce a
feature matrix which is only 1.15 GB in size when dumped to disc via joblib?
Maybe it's the vocabulary? Let's see:

The feature matrix consists of the numpy arrays indices, indptr and values:
indices size:  835.65 MB
indptr size :    2.07 MB
values size : 1671.23 MB

Size of the feature matrix before X.sum_duplicates() is called:
2509 MB

Size of the feature matrix after X.sum_duplicates() was called:
1147 MB

vocabulary size: 100.66 MB

Well, it's not the vocabulary, it's the feature matrix. And if
using 2.5 GB is not bad enough, either X.sum_duplicates or
sp.csr_matrix are copying some data, because I saw 4.25 GB memory
usage even with an sys.exit(0) at the end of _count_vocab.

In the LeanCountVectorizer I iterate chunk-wise over the documents,
create a temporary feature matrix for each chunk and append this
temporary feature matrix then to the final feature matrix in place.
So there's never this big values array and the data gets also copied,
but only for the small chunk based feature matrices.

After _count_vocab returned the feature matrix, the method _sort_features
is called. Although it modifies the vocabulary in place as stated in
the docstring, it does not modify the feature matrix in place, but
makes a copy via fancy indexing (return X[:, map_index]). So I changed
the _sort_features method for LeanCountVectorizer to modify the feature
matrix in place. This operation is rather slow and the main reason
why LeanCountVectorizer is slower than CountVectorizer. Maybe there's
a better way to swap the columns of the feature matrix.

And finally the _limit_features method also makes a full copy of the
feature matrix via fancy indexing, because it's high and low arguments
are set to numerical values per default so the 'if high is None and...'
condition is always false. In LeanCountVectorizer this is also changed.

The LeanHashingVectorizer has only the 'iterate in chunks over documents'
modification. And since the values array would not get as big, it uses
less memory.

best regards,
Jochen

…memory on X.sum_duplicates, build small temporary feature matrices and stack them together to create the full feature matrix with sizeof(X) + sizeof(temp_X) memory consumption.

…ed the feature matrix)

This reverts commit 554835a.

…emory at X.sum_duplicates(), build smaller temporary matrices and stack them together

…ot copied everytime

larsmans · 2015-08-15T15:27:04Z

Cool idea, but the tests aren't passing. I'm also not sure if yet another option on the vectorizers is a good idea; we might as well hardcode the chunk size.

… 0.11.0

…pty vocabulary exception before getting invalid shape from elf._get_sparse_matrix_from_indices

ephes · 2015-08-15T19:24:20Z

The tests should work now. I didn't test with older scipy versions and for scipy <= 0.11.0 you cant create empty csr matrices :/.

I think it's fine to hardcode the chunksize. I could set it to 10K like the default now?

jnothman · 2015-08-16T08:42:47Z

This is very similar in thought to #4968.

ephes · 2015-08-16T10:11:38Z

The vectorizer from #4968 is now included in the benchmark script and I have updated the results table. I think it's a far more straightforward solution for the values array getting big problem than my approach. But I don't know wether it's also usable for HashingCountVectorizer.

…n#4968

ephes · 2015-08-17T06:22:31Z

I've added FastAupiffCountVectorizer, which is a little bit faster, to the test script and updated the CountVectorizer in my pull request accordingly. To my surprise the dict approach is faster than
manual sum_duplicates on each document:

        for doc in raw_documents:
            doc_indices = _make_int_array()
            for feature in analyze(doc):
                try:
                    doc_indices.append(vocabulary[feature])
                except KeyError:
                    # Ignore out-of-vocabulary items for fixed_vocab=True
                    continue

            # sum duplicates for doc features
            doc_indices = np.frombuffer(doc_indices, dtype=np.intc)
            doc_indices.sort()
            diff = np.concatenate(([1], np.diff(doc_indices)))
            idx = np.concatenate((np.where(diff)[0], [len(doc_indices)]))

            # append doc to global indices and values
            j_indices.extend(doc_indices[idx[:-1]])
            values.extend(np.diff(idx))

            indptr.append(len(j_indices))

jnothman · 2015-09-24T11:18:48Z

I'd be very happy to see FastAupiffCountVectorizer merged ASAP, with or without benchmarking the concern of #5306, whether by @aupiff or @ephes. Who's up for a clean PR doing that alone? HashingVectorizer can come later.

fayeshine · 2015-11-01T03:00:18Z

So is there any update for this issue?

jnothman · 2015-11-01T10:23:35Z

Shall I assume that both @aupiff and @ephes are no longer working on this, and we should open it to another contributor?

aupiff · 2015-11-04T00:38:44Z

@jnothman yes.

yenchenlin · 2016-02-05T14:56:33Z

@jnothman Have this issue been addressed?
May I take a stab at it?

jnothman · 2016-02-06T12:09:02Z

It seems so, @yenchenlin1994

This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122

jnothman · 2016-09-15T13:58:34Z

Partially superseded by #7272. Please open a new PR for HashingVectorizer optimisations.

ephes added 10 commits August 14, 2015 22:02

instead of building one big feature matrix X and using sizeof(X) * 2 …

2bcdcfd

…memory on X.sum_duplicates, build small temporary feature matrices and stack them together to create the full feature matrix with sizeof(X) + sizeof(temp_X) memory consumption.

_sort_features did not modify X in place (return X[:, map_index] copi…

03e9cf8

…ed the feature matrix)

only copy the feature matrix if features are actually removed

554835a

Revert "only copy the feature matrix if features are actually removed"

4692d34

This reverts commit 554835a.

instead of building one big feature matrix which uses sizeof(X) * 2 m…

68b6607

…emory at X.sum_duplicates(), build smaller temporary matrices and stack them together

small fix

ed8cbdc

broken test: raise ValueError on input []

989cccb

use self.dtype as type for feature matrix

b9cbab3

only copy feature matrix if needed, high and low were not None so X g…

db20847

…ot copied everytime

pep8 compliance

2a92526

larsmans changed the title ~~Text vectorizers memory usage~~ [WIP] Text vectorizers memory usage Aug 15, 2015

ephes added 2 commits August 15, 2015 20:55

X = sp.csr_matrix((0, 0), dtype=self.dtype) does not work in scipy <=…

98ca892

… 0.11.0

use None instead of empty csr_matrix as initial value of X + raise em…

d7063aa

…pty vocabulary exception before getting invalid shape from elf._get_sparse_matrix_from_indices

ephes added 2 commits August 15, 2015 21:27

small fix

8b8bdfa

hardcode chunksize (CountVectorizer has enough options)

13cb6ef

ephes mentioned this pull request Aug 16, 2015

more memory-efficient word count calculation #4968

Closed

replaced chunkwise approach with dict based approach from scikit-lear…

cc1a97e

…n#4968

ephes added 3 commits August 20, 2015 09:43

removed iteration over chunks and it replaced with dict based approach

02179c5

removed non intended *.c commits

062a942

used upstream

8e156fb

jnothman mentioned this pull request Sep 24, 2015

CountVectorizer Performance Problem #5306

Closed

jnothman added the Need Contributor label Nov 4, 2015

rth added a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016

PR scikit-learn#5122: Improved memory usage in text vectorizers

a652b1c

rth pushed a commit to rth/scikit-learn that referenced this pull request Aug 27, 2016

Improved memory usage in text vectorizers (PR scikit-learn#5122)

0a5021b

This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122

rth mentioned this pull request Aug 28, 2016

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

Merged

jnothman closed this Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Text vectorizers memory usage #5122

[WIP] Text vectorizers memory usage #5122

ephes commented Aug 15, 2015

larsmans commented Aug 15, 2015

ephes commented Aug 15, 2015

jnothman commented Aug 16, 2015

ephes commented Aug 16, 2015

ephes commented Aug 17, 2015

jnothman commented Sep 24, 2015

fayeshine commented Nov 1, 2015

jnothman commented Nov 1, 2015

aupiff commented Nov 4, 2015

yenchenlin commented Feb 5, 2016

jnothman commented Feb 6, 2016

jnothman commented Sep 15, 2016

[WIP] Text vectorizers memory usage #5122

[WIP] Text vectorizers memory usage #5122

Conversation

ephes commented Aug 15, 2015

larsmans commented Aug 15, 2015

ephes commented Aug 15, 2015

jnothman commented Aug 16, 2015

ephes commented Aug 16, 2015

ephes commented Aug 17, 2015

jnothman commented Sep 24, 2015

fayeshine commented Nov 1, 2015

jnothman commented Nov 1, 2015

aupiff commented Nov 4, 2015

yenchenlin commented Feb 5, 2016

jnothman commented Feb 6, 2016

jnothman commented Sep 15, 2016