Skip to content

CountVectorizer Rewrite #1776

Closed
wants to merge 16 commits into from

7 participants

@lqdc
lqdc commented Mar 14, 2013

Rewrote CountVectorizer fit transform. I did some tests and it performs about 40% faster. For example, here's the plot on my dataset of 50000 English documents of 150-1000 words in length. It passes all the tests in the sklearn/feature_extraction/tests/test_text.py
countvect_1

@larsmans larsmans and 1 other commented on an outdated diff Mar 14, 2013
sklearn/feature_extraction/text.py
@@ -449,7 +450,7 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
If you do not provide an a-priori dictionary and you do not use an analyzer
that does some kind of feature selection then the number of features will
- be equal to the vocabulary size found by analysing the data.
+ be equal to the vocabulary size found by analysing the data.
@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

Trailing space. Please run all code that is submitted through PEP8 and pyflakes.

@lqdc
lqdc added a note Mar 14, 2013

Alright, I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
scikit-learn member

How's the memory use? We just went through quite some trouble to bring it down six-fold.

@jaquesgrobler jaquesgrobler commented on the diff Mar 14, 2013
sklearn/feature_extraction/text.py
spmatrix = sp.coo_matrix((values, (i_indices, j_indices)),
shape=shape, dtype=self.dtype)
- if self.binary:
- spmatrix.data.fill(1)
+ # weird bug -- this doesn't work on some computers
@jaquesgrobler
scikit-learn member

If you have any useful outputs or info from encountering this bug, you could open up a separate issue so it doesn't go forgotten.. (unless you've done that already and I can't see it :) )

@lqdc
lqdc added a note Mar 14, 2013

I have to test it some more. It doesn't seem to work in this case without the assignment on my machine. I just wanted to finish this first.

@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

But the binary conversion is now done elsewhere, right?

@lqdc
lqdc added a note Mar 14, 2013

I did the binary conversion in the fit_transform method instead. It works there for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 1 other commented on an outdated diff Mar 14, 2013
sklearn/feature_extraction/text.py
spmatrix = sp.coo_matrix((values, (i_indices, j_indices)),
shape=shape, dtype=self.dtype)
- if self.binary:
- spmatrix.data.fill(1)
+ # weird bug -- this doesn't work on some computers
+ # if self.binary:
+ # spmatrix = spmatrix.data.fill(1)
+ # print spmatrix.todense()
@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

This shouldn't work due to the assignment to spmatrix.

@lqdc
lqdc added a note Mar 14, 2013

The assignment was done out of desperation. Here is what I'm getting from the unit tests after removing the assignment:
New code:

if self.binary:
    spmatrix.data.fill(1)
    print "self.binary", self.binary
    print spmatrix.todense()

(mismatch 20.0%)
x: array([[1, 1, 1, 0, 0],
[1, 1, 0, 1, 1]])
y: array([[3, 1, 1, 0, 0],
[1, 2, 0, 1, 1]])
-------------------- >> begin captured stdout << ---------------------
self.binary True
[[3 1 1 0 0]
[1 2 0 1 1]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 1 other commented on an outdated diff Mar 14, 2013
sklearn/feature_extraction/text.py
+ for colptr in xrange(len(cscmatrix.indptr)-1):
+ len_data_slice = len(cscmatrix.data[cscmatrix.indptr[colptr]:cscmatrix.indptr[colptr+1]])
+ if len_data_slice <= high and len_data_slice >= low:
+ kept_indices.append(colptr)
+ else:
+ removed_indices.add(colptr)
+ s_kept_indices = set(kept_indices)
+ new_mapping = dict((v,i) for i,v in enumerate(kept_indices))
+ feature_to_position = dict((k,new_mapping[v]) for k,v in six.iteritems(feature_to_position) if v in s_kept_indices)
+ return cscmatrix[:,kept_indices], feature_to_position, removed_indices
+
+ def _get_max_features(self, csc_m, feature_to_position, stop_words_, max_features):
+ '''cuts only the top max_features from the matrix and feature_to_position
+ cut features are added to stop_words_
+ '''
+ a = np.argsort(np.squeeze(np.asarray(csc_m.sum(axis=0))))[::-1][:max_features]
@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

This expression is far too complicated and the variable names are too short. I suggest something like:

feature_freqs = ns.asarray(csc_m.sum(axis=0)).ravel()
to_keep = np.argsort(feature_freqs)[::-1][:max_features]
@lqdc
lqdc added a note Mar 14, 2013

Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 1 other commented on an outdated diff Mar 14, 2013
sklearn/feature_extraction/text.py
-
- # the term_counts and document_counts might be useful statistics, are
- # we really sure want we want to drop them? They take some memory but
- # can be useful for corpus introspection
- return self._term_counts_to_matrix(n_doc, i_indices, j_indices, values)
+ csc_m, feature_to_position, stop_words_ = \
+ self._remove_highandlow(
+ csc_m, feature_to_position,
+ max_doc_count, min_doc_count)
+ if max_features and max_features < csc_m.shape[1]:
+ csc_m, feature_to_position, stop_words_ = \
+ self._get_max_features(
+ csc_m, feature_to_position, stop_words_, max_features)
+ self.stop_words_ = stop_words_
+ self.vocabulary_ = feature_to_position
+ return csc_m.tocoo()
@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

There's, AFAIC, no strict requirement to return a COO matrix. In fact, CSC is preferable because some estimators actually want that format.

@lqdc
lqdc added a note Mar 14, 2013

I thought so too but the current implementation of the inverse_transform and possibly other functions do things very inefficiently if the matrix is not csr and they don't convert to csr. For Ex in inverse_transform:

return [inverse_vocabulary[X[i, :].nonzero()[1]].ravel()
                for i in range(n_samples)]
@larsmans
scikit-learn member
larsmans added a note Mar 14, 2013

inverse_transform is a convenience function for model inspection and not really crucial to learning. COO is currently produced because it was a good tradeoff between speed and ease of implementation, but if you can produce a better format such as CSR or CSC, by all means, return that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@lqdc
lqdc commented Mar 14, 2013

Memory use seems to be lower than before the update.

@larsmans
scikit-learn member

That's great news. I've yet to do a proper review (don't hold your breath), and I'd like feedback from @ogrisel before merging, but in principle I'm all for this change.

@lqdc
lqdc commented Mar 14, 2013

What is going on with the documentation building? Keeps throwing errors. I thought it's because of the additional sorting option.

@larsmans
scikit-learn member

Expected:

CountVectorizer(analyzer=u'word', binary=False, charset=u'utf-8',
        charset_error=u'strict', dtype=<type 'numpy.int64'>,
        input=u'content', lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None, sort_features=True)

Got:

CountVectorizer(analyzer=u'word', binary=False, charset=u'utf-8',
        charset_error=u'strict', dtype=<type 'numpy.int64'>,
        input=u'content', lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None,
        sort_features=True, stop_words=None, strip_accents=None,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None,
        vocabulary=None)

It's the order in which attributes appear in the repr. Nose is quite strict about that.

@mblondel
scikit-learn member

Can you summarize the main idea of your optimization? Would it be possible to output a CSR matrix instead of CSC?

Also, I would add a sparse_format constructor argument to let the user choose the sparse format. The default value could be COO for maintaining backward compatibility.

@larsmans
scikit-learn member

I don't really like the idea of putting even more options on CountVectorizer...

@lqdc
lqdc commented Mar 15, 2013

I guess the easiest way to return csr would be to call

csc_m.tocsr()

at the end which would take very little time compared to everything else.

The main thing was taking advantage of the fact that when constructing a coo matrix, duplicates are implicitly added. So the addition is handled by scipy instead of Counter. The other part was that instead of updating various dictionaries and sets, feature pruning was done using the pre-constructed sparse matrix.

The matrix was converted to CSC because features are on the columns and not the rows. Of course, we could have worked on rows instead and then transposed at the end, but it wouldn't gain much speed at the expense of readability.

Also, I think it is good to decide on a common matrix type to return for all the methods, or make it a point to explicitly check what type of sparse matrix is being passed to each method.

@mblondel
scikit-learn member

It's just that CSR format is more commonly used so being able to output in CSR format directly without conversion is nice. However, CountVectorizer is for smallish datasets so requiring the conversion is acceptable (thanksfully, HashingVectorizer outputs to CSR directly).

@mblondel
scikit-learn member

BTW, thanks for the explanation. You could add a description of the general idea as comment in the source.

@mblondel
scikit-learn member

I don't really like the idea of putting even more options on CountVectorizer...

My only worry is that the current PR breaks user code which assumes COO output (probably unlikely, though).

@larsmans
scikit-learn member

Some scipy.sparse conversions already have a copy argument, but apparently CSC->CSR doesn't...

@mblondel
scikit-learn member

I think that the only conversion methods with a copy argument are the ones for converting a format to itself.

@lqdc
lqdc commented Mar 15, 2013

Alright so did we decide on CSC, CSR or COO?

@larsmans
scikit-learn member

I'm pro whatever is built up internally, leaving the conversion to subsequent modules. If you do a COO->CSR conversion and the next module in a Pipeline transforms the data to CSC, we're wasting time.

(Actually COO's tocsr and tocsc have a copy argument.)

@lqdc
lqdc commented Mar 15, 2013

So the advantage of returning COO is that it doesn't break existing code and that it can be converted to CSC and CSR in place in a very short time. Advantage of CSC is that it wouldn't require conversion in downstream algorithms if they use CSC. The advantage of CSR is that most downstream algos use CSR. I think COO might be best in this case.

@larsmans
scikit-learn member

But in your current setup, you'd still need a CSC->COO transformation, right?

@lqdc
lqdc commented Mar 15, 2013

right so it would be one conversion less in most cases if it's CSC, but it would break the existing code. It probably isn't a big deal in most situations because COO is not very versatile and would be converted to something else anyway. Alright, I'm leaving it as CSC then.

@mblondel
scikit-learn member

(Actually COO's tocsr and tocsc have a copy argument.)

Doesn't seem quite right: https://github.com/scipy/scipy/blob/v0.11.0/scipy/sparse/coo.py#L282
Or am I looking at the wrong place?

@lqdc
lqdc commented Mar 15, 2013

Yeah you're right. So it's not in place except for the csc -> coo : https://github.com/scipy/scipy/blob/v0.11.0/scipy/sparse/compressed.py#L534
Anyway, so then CSC seems like the best choice because this is the only way to potentially avoid data copying. I'm leaving it as csc.

@ogrisel
scikit-learn member
ogrisel commented Mar 15, 2013

I think we should keep CSC as the default output layout if it's the natural internal representation used by the vectorizer implementation. We could add an output_layout constructor parameter that would default to "csr" but also allow the transform method to call tocsr()automatically at the end when set to output_layout="csr".

@lqdc
lqdc commented Mar 15, 2013

you mean default to "csc"?
the other problem is that if we don't explicitly mention that it's CSC or don't provide an option to change, it might be confusing for some people because the transform method returns COO.

@ogrisel ogrisel commented on an outdated diff Mar 15, 2013
sklearn/feature_extraction/text.py
+ def _sort_features_and_matrix(self, cscmatrix, feature_to_position):
+ '''sort dict by keys and assign values to the sorted key index'''
+ sorted_by_name_dict = {}
+ sorted_features = sorted(feature_to_position)
+ new_positions = []
+ for i, k in enumerate(sorted_features):
+ sorted_by_name_dict[k] = i
+ new_positions.append(feature_to_position[k])
+ return cscmatrix[:, new_positions], sorted_by_name_dict
+
+ def _remove_highandlow(self, cscmatrix,
+ feature_to_position, high, low):
+ '''removes features that are in more documents
+ than high and less documents than low
+ does not remove documents with zero features
+ '''
@ogrisel
scikit-learn member
ogrisel added a note Mar 15, 2013

Cosmetics: could you align the style of the docstrings to the format used in the rest of the library? PEP257: http://www.python.org/dev/peps/pep-0257/

In this case, that would be:

    def _remove_highandlow(self, cscmatrix, feature_to_position, high, low):
        """Remove too rare or too common features

        Prune features that are non zero in more samples than high or less
        documents than low.

        This does not prune samples with zero features.

        """
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member
ogrisel commented Mar 15, 2013

Yeah, default to "csc" sorry.

@mblondel
scikit-learn member

We could add an output_layout constructor parameter

Such an option would make sense if we default it to COO to keep backward compatibility but if we use CSC by default it is not very useful: calling tocsr() or tocoo() is just one line in user code.

@mblondel
scikit-learn member

Given that CountVectorizer is not for large scale data anyway, I'm ok with keeping CSC as output format and not adding a new constructor option.

@ogrisel
scikit-learn member
ogrisel commented Mar 17, 2013

Such an option would make sense if we default it to COO to keep backward compatibility but if we use CSC by default it is not very useful: calling tocsr() or tocoo() is just one line in user code.

But you cannot insert this call in a pipeline but arguably the models downstream will do it automatically anyway.

@vene
scikit-learn member
vene commented Mar 18, 2013

I think user code shouldn't expect a specific sparse matrix format, our estimators certainly shouldn't -- they should use the check_arrays-style functions. IMHO it's fine to return the natural format and avoid extra complications and sources of inconsistencies (why does transformer X have this parameter but Y doesn't?)

@lqdc lqdc closed this Mar 22, 2013
@lqdc lqdc reopened this Mar 22, 2013
@lqdc
lqdc commented Mar 22, 2013

I have done some major modifications since I parallelized it. Do I submit as a part of this pull request or make a separate one?

@larsmans
scikit-learn member

Please make it a separate PR; then we can choose to pull this one and later decide if we want the parallel version.

(Sorry to keep you waiting, it looks like everybody is really busy right now.)

@larsmans
scikit-learn member

If nobody objects, I'd like to merge this. One last remark: if not sorting the features is only 2% slower, then I'd rather simplify the API by removing the sort_features option. TfidfVectorizer already has a large number of options.

@lqdc
lqdc commented Apr 17, 2013

It looks like it depends a lot on the number of features because sorting is O(n*log(n)) wrt feature num:
ngram_1
ngram_12

I should add the sort_features argument to Tfidf vectorizer.

@larsmans
scikit-learn member

Thanks for the quick research. I still feel we shouldn't add an extra argument at this point, for the sake of support: CountVectorizer is something that new users are going to use and ask questions about, so the fewer options it has, the better. Also, the speedup is still significant, hats off to you.

@lqdc
lqdc commented Apr 18, 2013

countvect_1

It is literally 2x faster without sorting though! It would be even higher difference if the ngram range is bigger.
Why not just get rid of sorting altogether?

@larsmans
scikit-learn member

Sure, you can find a setting for ngram_range that makes it really, really slow. But such settings are a bad idea anyway as they will tend to overfit. @ogrisel, @mblondel, do we want this sorting?

@mblondel
scikit-learn member

As long as one can retrieve which column does a word correspond to, I think it's ok if the features are not sorted... Any compelling argument they should?

@larsmans
scikit-learn member

I honestly don't care, I just don't want the extra option. AFAIC, sorting can be left to the user.

@lqdc
lqdc commented Apr 19, 2013

is this good to go?

@larsmans
scikit-learn member

I've just pushed a rebased version as 3444bfb that still does the sorting; since it's not advertised, we can still remove it. Closing the PR for now.

@larsmans larsmans closed this Apr 19, 2013
@lqdc
lqdc commented Apr 19, 2013

Alright, @larsmans but the re-factoring I did after that is an additional 30% faster because we don't have to add i_indices and values one at a time. Values are all ones for example so we can do it in one go almost instantly.

@larsmans
scikit-learn member

I took the entire PR and rebased it into a single commit.

@lqdc
lqdc commented Apr 19, 2013

What about 0459127 and 3df6559?

@larsmans
scikit-learn member

They should be in there... aren't they?

@lqdc
lqdc commented Apr 19, 2013

No, as far as I can see, they are not there. _count_fixed_vocab function is not there for example. I apologize for committing them so much later. I figured they would be a part of the parallel version but that one fell through.

@amueller
scikit-learn member

Does this still need to be fixed?

@larsmans
scikit-learn member

Yes. I can do it tomorrow (it needs manual rebasing or merging).

@lqdc
lqdc commented Apr 24, 2013

Should I submit a separate PR?

@larsmans
scikit-learn member

I just pushed the two missing commits as d562b46. Sorry for the delay, I'm on a tight schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.