Skip to content
This repository

Support for token processor. Fixes #1156 #1537

Closed
wants to merge 29 commits into from

6 participants

Willi Richert Andreas Mueller Olivier Grisel Lars Buitinck Mathieu Blondel Gilles Louppe
Willi Richert

No description provided.

Willi Richert

In the current implementation, CountVectorizer keeps at least one copy per preprocessing step (preprocessing, tokenizing, etc.).

I think we could rewrite it using generators only, which would be more memory and performance friendly.

Andreas Mueller
Owner

Hi @wrichert. Thanks for the PR. In general I think it looks good.
The test is in the wrong place, though.
The tests for the feature_extaction module are in features_extraction/tests.

Willi Richert
Willi Richert

Done.

Willi Richert

@amueller Is there anything needed from my side to get this PR into the next release?

Andreas Mueller
Owner

@wrichert Sorry I have not that much time to review right now. Will try tonight. Maybe @ogrisel wants to have a look.

Andreas Mueller
Owner

Maybe a short example in form of a doctest in the narrative or docstring would be nice?

Willi Richert

@amueller Any chance that this is included in 0.13?

Andreas Mueller
Owner

If you find two devs to review and merge it. I won't have time, sorry.

Andreas Mueller
Owner

I'll try to have a look soon. Sorry, I'm still pretty busy atm.

sklearn/feature_extraction/text.py
@@ -365,6 +371,14 @@ def build_tokenizer(self):
365 371 token_pattern = re.compile(self.token_pattern)
366 372 return lambda doc: token_pattern.findall(doc)
367 373
  374 + def build_token_processor(self):
  375 + """Return a function that processes the tokens.
  376 + This can be useful, e.g., for introducing stemming, etc.
1
Olivier Grisel Owner
ogrisel added a note

Cosmetics: please add a blank line above this line (PEP 257).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_extraction/text.py
((5 lines not shown))
389 404
390 405 return lambda doc: self._word_ngrams(
391   - tokenize(preprocess(self.decode(doc))), stop_words)
  406 + [process_token(tok) for tok in tokenize(preprocess(self.decode(doc)))], stop_words)
2
Olivier Grisel Owner
ogrisel added a note

Could you also filter out items that are None so that the token_proprocessor can also be used for filtering unwanted tokens?

Olivier Grisel Owner
ogrisel added a note

That will require to wrap the tokens in an additional list comprehension though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Olivier Grisel
Owner

Could you please rebase this branch on top of master (or merge master into it if you are not familiar with rebasing).

Then fix pep8 issues reported by: http://pypi.python.org/pypi/pep8

Willi Richert

Sure. Will address your suggestions beginning of next week as I won't have time before that.

Andreas Mueller

This broke the mldata fixtures.

Andreas Mueller
Owner

Did you push the py3k stuff into master yesterday on purpose? Was there a PR?

Owner

I thought I was only pushing the safe changes, but I pushed the wrong commit. Apologies, I'll try to revert the failing part.

Owner

thanks :) I didn't really look through the changes, I was just a bit surprised and saw that travis is failing ;)

Andreas Mueller
Owner

Thanks :) maybe going through a PR would be good so we can rely on travis. Thanks for working on the py3k stuff!

Andreas Mueller
Owner

hurray! :)

Owner

It looks like you forgot to import the print function in cross_validation.rst (maybe we need a fixture?)

Owner

I'm on it. I think some % formatting should do the trick.

Olivier Grisel

I would have written ['feat1', 'feat2', 'feat3'] and so on to get a valid python literal representation of the example.

Owner

Fixed.

Olivier Grisel

Interesting. Is this a consequence of some property of murmurhash?

Owner

It's a consequence of the way modulo works + the assumption that murmurhash produces uniformly distributed values. Suppose n_features is, say, 3, and we'd use a 2-bit hash function. Then

h       i
---------
0 % 3 = 0
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0

So column zero gets the load.

(The same thing happens when you do rand() % n in C, or when you're implementing vanilla closed hashing.)

Willi Richert

Hmm, I rebased the "Support for token processor. Fixes #1537" and resolved all the conflicts, but I'm not sure whether I did the right thing, seeing lots of other commits appearing in this threads.

Andreas Mueller

the dtype parameter is missing but used in the function body.

Olivier Grisel
Owner

Indeed please, start a new branch off the current master and cherry pick just the commits or files that are related to issue #1156 to make the review easier.

Willi Richert wrichert closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 29 unique commits by 7 authors.

Jan 08, 2013
Willi Richert wrichert Support for token processor. Fixes #1156 9cf5a81
Willi Richert wrichert moved test to proper place 6319516
Jan 14, 2013
Willi Richert wrichert Added example for token_processor to the docs 5211bc8
Willi Richert wrichert Clarifying description of vectorizers' preprocessors. f759070
Feb 09, 2013
Olivier Grisel ogrisel P3K use six to have a python 2 & 3 compatible code base 1967a0b
Lars Buitinck larsmans P3K StringIO vs BytesIO 0fe516e
Lars Buitinck larsmans DOC fix failing doctest due to unicode_literals 1c0d422
Feb 10, 2013
Mathieu Blondel mblondel Cosmit: more explicit xlabel. b445537
Mathieu Blondel mblondel Cosmit: more explicit label. 135636e
Gilles Louppe glouppe Merge pull request #1668 from glouppe/adaboost-tree
[MRG] Precompute X_argsorted in AdaBoost
b126421
Lars Buitinck larsmans DOC whitespace in doctest 4c60052
Lars Buitinck larsmans BUG revert P3K changes that broke mldata tests c809368
Feb 11, 2013
Lars Buitinck larsmans rm gender classification example
Accidentally committed in 1967a0b.
041abe3
Andreas Mueller amueller ENH get rid of imports in test_common by checking by names, not classes. b99dbab
Andreas Mueller amueller ENH fix test_estimators_overwrite_params to also test regressors and …
…transformers. Then fix all the regressors and transformers ... meh!
a3b55f4
Andreas Mueller amueller ENH set the random state to avoid heisenfailures 6920295
Andreas Mueller amueller COSMIT pep8, removing unused imports 7d57b9e
Lars Buitinck larsmans P3K death to the print statement 85ec0fd
Lars Buitinck larsmans P3K fix broken doctest and add forgotten print_function import c481ff8
Feb 13, 2013
Lars Buitinck larsmans DOC no more need for compute_importances in trees
Prevents warning from doctest.
cc401c9
Lars Buitinck larsmans DOC copyedit FeatureHasher narrative 9f6c8bb
Lars Buitinck larsmans ENH move covtype loading to sklearn.datasets 16e595a
Lars Buitinck larsmans TST covertype loader 81428f7
Willi Richert wrichert Allowing to drop tokens in token_processor(); pep8 work 44bc5ee
Willi Richert wrichert Support for token processor. Fixes #1156 bb5c56e
Willi Richert wrichert Added example for token_processor to the docs e7e9697
Willi Richert wrichert Clarifying description of vectorizers' preprocessors. db35143
Willi Richert wrichert Reapplying skipped commit f5a595a
Feb 14, 2013
Willi Richert wrichert Merge branch 'token-processor' of https://github.com/wrichert/scikit-…
…learn into token-processor

Conflicts:
	doc/modules/feature_extraction.rst
	sklearn/feature_extraction/text.py
fceb1a7
Something went wrong with that request. Please try again.