Cleaned and updated version of token-processor support #1735

Closed
wants to merge 9 commits into
from

Projects

None yet

10 participants

@wrichert
wrichert commented Mar 5, 2013

This is the cleaned up version of pull request #1537 to fix #1156.

@mblondel mblondel and 1 other commented on an outdated diff Mar 5, 2013
...arn/feature_extraction/tests/test_count_vectorizer.py
@@ -0,0 +1,58 @@
+"""
+Test the CountVectorizer feature extractor
@mblondel
mblondel Mar 5, 2013 Member

Can you move these tests to test_text.py?

@larsmans
Member
larsmans commented Mar 5, 2013

I understand from the examples that the processor is called once per token. I'd prefer it to receive a stream (iterable) of tokens and produce one as well, to eliminate function call overhead, allow context-sensitive rules (POS/NER taggers!), and allow production of multiple tokens given a single one.

I'd also call it a token filter to mirror Lucene (and Unix) terminology.

@larsmans larsmans commented on an outdated diff Mar 5, 2013
doc/modules/feature_extraction.rst
@@ -714,6 +719,20 @@ Some tips and tricks:
(Note that this will not filter out punctuation.)
+
+ For the sake of simplicity, the following example will remove all vowels before counting:
+
+ >>> def vowel_remover(word):
+ ... for vowel in "aeiou":
+ ... word = word.replace(vowel, "")
+ ... return word
+ ...
+ >>> vectorizer = CountVectorizer(token_processor=vowel_remover)
+ >>> print vectorizer.build_analyzer()(u"color colour")
+ [u'clr', u'clr']
+
@larsmans
larsmans Mar 5, 2013 Member

I'd prefer a more interesting example. E.g., using my proposed stream API,

def to_british(tokens):
    """Heuristic British->American spelling converter."""
    for t in tokens:
        t = re.sub(r"(...)our$", "\1or", t)
        t = re.sub(r"([bt])re$", r"\1er", t)
        t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
        t = re.sub(r"ogue$", "og", t)
        yield t
@wrichert
wrichert commented Mar 5, 2013

@larsmans Totally agree both with your stream API suggestion and the example. Addressed it in wrichert@083205e.

@amueller
Member
amueller commented Mar 9, 2013

This can no longer be merged. Could you please rebase?

@amueller
Member

Sorry I was to late on IRC. After rebase, you can just force-push in the same branch and the PR will automatically be updated. Thanks :)

@wrichert

I just force-pushed. Hopefully, the universe does not collapse now...

@larsmans
Member

I'm afraid my Py3 compat changes broke this PR again...

@wrichert

@larsmans I've rebased again. Could you give it a try before it is outdated again? :-)

@larsmans
Member

I'm juggling a lot of projects simultaneously ATM, but I'll see if I can find some time. Don't hold your breath.

@jseabold
Contributor

Just wanted to bump and +1 this PR.

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
sklearn/feature_extraction/tests/test_text.py
from io import StringIO
+JUNK = (
+ "aa aa aa aa aaa aaa aaaa",
+)
+
+JUNK = (
+ "aa aa aa aa aaa aaa aaaa",
+)
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

Looking at the diff on github, these lines seem repeated. I don't understand. Is there a reason, or is this just a oversight?

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
sklearn/feature_extraction/tests/test_text.py
@@ -424,7 +432,7 @@ def test_vectorizer():
# test the direct tfidf vectorizer
# (equivalent to term count vectorizer + tfidf transformer)
train_data = iter(ALL_FOOD_DOCS[:-1])
- tv = TfidfVectorizer(norm='l1')
+ tv = TfidfVectorizer(norm='l1', min_df=1)
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

Very naive question (I don't know this part of the codebase well): why is there a 'min_df' added in the test. Does this PR require a change in existing tests, and if so, is this a change in behavior that might break backward compatibility?

@wrichert
wrichert Jun 5, 2013

At the time I edited this change, the default was still min_df=2. Now that it has changed to min_df=1 in the Vectorizer classes, I would still like to keep this argument in the test to make things explicit.

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
sklearn/feature_extraction/tests/test_text.py
@@ -802,3 +811,74 @@ def test_pickling_transformer():
assert_array_equal(
copy.fit_transform(X).toarray(),
orig.fit_transform(X).toarray())
+
+
+def test_token_processor_1():
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

Any chance that you could give a better name than '_1' (same remark for below)

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
sklearn/feature_extraction/tests/test_text.py
+
+ counts = vectorized.toarray()[0]
+ assert_equal(counts[word_vect.vocabulary_['aa']], 4)
+ assert_equal(counts[word_vect.vocabulary_['aaa']], 2)
+ assert_equal(counts[word_vect.vocabulary_['aaaa']], 1)
+
+
+def test_token_processor_2():
+ def to_british(tokens):
+ """Heuristic British->American spelling converter."""
+ for t in tokens:
+ t = re.sub(r"(...)our$", r"\1or", t)
+ t = re.sub(r"([bt])re$", r"\1er", t)
+ t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
+ t = re.sub(r"ogue$", "og", t)
+ yield t
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

This is a nice application. It should probably go as an example in the narrative documentation.

@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

OK, stupid remark: I hadn't seen in the narrative documentation that you already had this example. My apologies.

@wrichert
wrichert Jun 5, 2013

Nice example by courtesy of @larsmans :-) (see thread above)

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
sklearn/feature_extraction/text.py
@@ -207,6 +207,15 @@ def get_stop_words(self):
"""Build or fetch the effective stop words list"""
return _check_stop_list(self.stop_words)
+ def build_token_processor(self):
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

I don't really like the fact that this method is public. I know that the other 'build_' methods of the object are public. I believe that they should not be.

@ogrisel
ogrisel Jun 3, 2013 Member

I think they should: they are very useful to introspect the model behavior and documented as such in the narrative documentation. I often point to them to user who want to understand what the vectorizer is doing on stackoverflow, for instance:

http://stackoverflow.com/questions/16080726/special-characters-in-countvectorizer-scikit-learn/16083182#16083182

@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

I think they should: they are very useful to introspect the model behavior as
documented as such in the narrative documentation. I often point to them to
user who want to understand what the vectorizer is doing on stackoverflow,

OK, but then there is definitely a pretty bad code smell here: we have an
object that is becoming huge, with a fairly wide surface. It's bad for
usability, and for long term maintainability. I don't know what the
solution is, I just know that from a high-level perspective this does not
look too good.

@ogrisel
ogrisel Jun 3, 2013 Member

I agree but think that's a trade-off with the "flat is better than nested" rule: we don't want to introduce a complex adapter pattern as a mandatory requirement for the use of this class. Although I think I had thought about a clean way to reduce the public API without removing features / extensibility nor efficiency while discussing with @larsmans somewhere else. I will have to dig through my emails...

Anyway this PR is useful and consistent with the existing API and logic of the current codebase. Also I don't think this class will need to grow any further.

@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

Anyway this PR is useful with the existing API and logic of the current
codebase. Also I don't think this class will need to grow any further.

OK, so a few minor changes, and we merge.

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 3, 2013
doc/modules/feature_extraction.rst
@@ -714,6 +719,24 @@ Some tips and tricks:
(Note that this will not filter out punctuation.)
+
+ For the sake of simplicity, the following example will remove all vowels before counting:
@GaelVaroquaux
GaelVaroquaux Jun 3, 2013 Member

I believe that the line above does not match what the example does below.

@GaelVaroquaux
Member

In general, I am 👍 for merge on this PR as it is a nice feature (please address my small comments). However, I have the feeling that this part of the codebase is getting more and more hairy.

It would be good that someone with good understanding of the text processing application puts a bit of order. I have in mind:

  • Separating out a clear public and private API. We don't want too many public methods
  • Making sure that there is no redundancy.
  • Making sure that everything pickles right (a lot of functionality is defined in closures of functions. I am uneasy with this).
@ogrisel
Member
ogrisel commented Jul 29, 2013

@wrichert sorry I completely forgot about this PR. I wanted to finalize it during the sprint last week but failed to do so. Could you please rebase it on top of current master?

@EntilZha
EntilZha commented Dec 2, 2015

Link chased here from google looking for exactly this functionality. Is there any status update on this or has it already been added in some other way?

@ldulcic
ldulcic commented May 27, 2016

Will this be merged any time soon? It is really useful feature and it looks like code has been written some time ago but still not included in scikit-learn release. I hate to override these classes and create custom solutions when this is already solved :)

@jnothman
Member

@ldulcic, it seems as if there are a number of comments, particularly from @GaelVaroquaux that @wrichert has not responded to. If someone would like to take over this PR, and finish it up, it seems like it's still a desirable enhancement.

@rth rth added a commit to rth/scikit-learn that referenced this pull request Aug 29, 2016
@wrichert @rth wrichert + rth Cleaned and updated version of the token processor (PR #1735) 66dd3eb
@jnothman
Member

superseded by #7286

@jnothman jnothman closed this Sep 29, 2016
@rth rth added a commit to rth/scikit-learn that referenced this pull request Nov 2, 2016
@wrichert @rth wrichert + rth Cleaned and updated version of the token processor (PR #1735) 7cce960
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment