Skip to content

Support for token processor. Fixes #1156 #1537

Closed
wants to merge 29 commits into from

6 participants

@wrichert
wrichert commented Jan 8, 2013

No description provided.

@wrichert
wrichert commented Jan 8, 2013

In the current implementation, CountVectorizer keeps at least one copy per preprocessing step (preprocessing, tokenizing, etc.).

I think we could rewrite it using generators only, which would be more memory and performance friendly.

@amueller
scikit-learn member
amueller commented Jan 8, 2013

Hi @wrichert. Thanks for the PR. In general I think it looks good.
The test is in the wrong place, though.
The tests for the feature_extaction module are in features_extraction/tests.

@wrichert
@wrichert
wrichert commented Jan 9, 2013

Done.

@wrichert

@amueller Is there anything needed from my side to get this PR into the next release?

@amueller
scikit-learn member

@wrichert Sorry I have not that much time to review right now. Will try tonight. Maybe @ogrisel wants to have a look.

@amueller
scikit-learn member

Maybe a short example in form of a doctest in the narrative or docstring would be nice?

@wrichert

@amueller Any chance that this is included in 0.13?

@amueller
scikit-learn member

If you find two devs to review and merge it. I won't have time, sorry.

@amueller
scikit-learn member

I'll try to have a look soon. Sorry, I'm still pretty busy atm.

@ogrisel ogrisel commented on an outdated diff Feb 7, 2013
sklearn/feature_extraction/text.py
@@ -365,6 +371,14 @@ def build_tokenizer(self):
token_pattern = re.compile(self.token_pattern)
return lambda doc: token_pattern.findall(doc)
+ def build_token_processor(self):
+ """Return a function that processes the tokens.
+ This can be useful, e.g., for introducing stemming, etc.
@ogrisel
scikit-learn member
ogrisel added a note Feb 7, 2013

Cosmetics: please add a blank line above this line (PEP 257).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on an outdated diff Feb 7, 2013
sklearn/feature_extraction/text.py
return lambda doc: self._word_ngrams(
- tokenize(preprocess(self.decode(doc))), stop_words)
+ [process_token(tok) for tok in tokenize(preprocess(self.decode(doc)))], stop_words)
@ogrisel
scikit-learn member
ogrisel added a note Feb 7, 2013

Could you also filter out items that are None so that the token_proprocessor can also be used for filtering unwanted tokens?

@ogrisel
scikit-learn member
ogrisel added a note Feb 7, 2013

That will require to wrap the tokens in an additional list comprehension though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member
ogrisel commented Feb 7, 2013

Could you please rebase this branch on top of master (or merge master into it if you are not familiar with rebasing).

Then fix pep8 issues reported by: http://pypi.python.org/pypi/pep8

@wrichert
wrichert commented Feb 8, 2013

Sure. Will address your suggestions beginning of next week as I won't have time before that.

@amueller
scikit-learn member

This broke the mldata fixtures.

@amueller
scikit-learn member

Did you push the py3k stuff into master yesterday on purpose? Was there a PR?

scikit-learn member

I thought I was only pushing the safe changes, but I pushed the wrong commit. Apologies, I'll try to revert the failing part.

scikit-learn member

thanks :) I didn't really look through the changes, I was just a bit surprised and saw that travis is failing ;)

@amueller
scikit-learn member

Thanks :) maybe going through a PR would be good so we can rely on travis. Thanks for working on the py3k stuff!

@amueller
scikit-learn member

hurray! :)

scikit-learn member

It looks like you forgot to import the print function in cross_validation.rst (maybe we need a fixture?)

scikit-learn member

I'm on it. I think some % formatting should do the trick.

@ogrisel
scikit-learn member

I would have written ['feat1', 'feat2', 'feat3'] and so on to get a valid python literal representation of the example.

scikit-learn member

Fixed.

@ogrisel
scikit-learn member

Interesting. Is this a consequence of some property of murmurhash?

scikit-learn member

It's a consequence of the way modulo works + the assumption that murmurhash produces uniformly distributed values. Suppose n_features is, say, 3, and we'd use a 2-bit hash function. Then

h       i
---------
0 % 3 = 0
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0

So column zero gets the load.

(The same thing happens when you do rand() % n in C, or when you're implementing vanilla closed hashing.)

@wrichert

Hmm, I rebased the "Support for token processor. Fixes #1537" and resolved all the conflicts, but I'm not sure whether I did the right thing, seeing lots of other commits appearing in this threads.

@amueller
scikit-learn member

the dtype parameter is missing but used in the function body.

@ogrisel
scikit-learn member
ogrisel commented Feb 28, 2013

Indeed please, start a new branch off the current master and cherry pick just the commits or files that are related to issue #1156 to make the review easier.

@wrichert wrichert closed this Mar 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.