Add stemming support to CountVectorizer #1156

Open
pemistahl opened this Issue Sep 16, 2012 · 24 comments

Projects

None yet

5 participants

@pemistahl

An idea for a feature enhancement:

I'm currently using sklearn.feature_extraction.text.CountVectorizer for one of my projects. In my opinion, it strongly lacks support for stemming. An additional attribute such as self.stemmer which accepts a stemming function as a callable would be nice, together with a reasonable default stemmer for English.

What about an additional stemmer module? However, there are already some quite good Python stemmers available, for example as part of the Natural Language Toolkit. A longer time ago, I contributed stemmers for 12 languages which are available in nltk.stem.snowball. Maybe one could support them as a dependency in scikit-learn?

The general interface, however, is not difficult to implement. For example, one could do it like this:

  1. Introduce an additional attribute self.stemmer which is initialized as None by default.
  2. Write a method build_stemmer as something like this:
# Snowball stemmers could be used as a dependency
from nltk.stem import SnowballStemmer

def build_stemmer(self):
    if self.stemmer is not None:
        return self.stemmer
    # One could provide an English stemmer as default
    english_stemmer = SnowballStemmer('english')
    return lambda tokens: [english_stemmer.stem(token) for token in tokens]    
  1. Incorporate this method call in CountVectorizer.build_analyzer() as something like this:
def build_analyzer(self):
    # ...    
    elif self.analyzer == 'word':
        stop_words = self.get_stop_words()
        # Add a stemmer instance here
        stem = self.build_stemmer()
        tokenize = self.build_tokenizer()

        # Include stemmer in the method chain
        return lambda doc: self._word_ngrams(
            stem(tokenize(preprocess(self.decode(doc)))), stop_words)
    # ...

What do you think?

@amueller
Member

I am +1 on supporting stemming, though -1 on providing a default stemmer (out of scope imho).
Making nltk stemmers easy to use with the CountVectorizer seems desirable.

@ogrisel
Member
ogrisel commented Sep 17, 2012

I like the API proposal but would rather name it: token_processor or token_filter rather than stemmer as it could also be used for other post tokenization processing and filtering than just stemming.

As @amueller said, we don't want to introduce a new dependency, especially on a domain specific library such as NLTK so I propose that the default implementation of build_token_processor would do nothing but would make it easier for a user to derive that class to plug whatever stemming or filter logic he/she wants.

@amueller
Member

+1 on the different name, +1 also on the rest :)

@ogrisel
Member
ogrisel commented Sep 17, 2012

@pemistahl please feel free to submit a Pull Request to move on on this:

http://scikit-learn.org/dev/developers/index.html

@pemistahl

@amueller and @ogrisel Your suggestions sound reasonable to me. I've just created a subclass of CountVectorizer to add the stemming functionality that I need for my project, but as soon as I find the time I will submit a pull request for the original CountVectorizer class.

@wrichert

Would it be possible to do the same for TfidfVectorizer as well?

@amueller
Member

@wrichert Sure. If it would be done for CountVectorizer, it would basically automatically also work for TfidfVectorizer.

@wrichert

I see. Then maybe only the constructor's signature has to be adapted to take the token_processor as a parameter and pass it to the CountVectorizer.

@amueller
Member
amueller commented Dec 1, 2012

@wrichert exactly :)
Feel free to dig into this issue and submit a PR!

@wrichert
wrichert commented Jan 5, 2013

@pemistahl Have you planned to continue on this? If not I would be happy to do it.

@pemistahl

@wrichert I'm busy doing other stuff at the moment. Feel free to add stemming support and go ahead. But thank you for asking me beforehand. :)

@wrichert wrichert added a commit to wrichert/scikit-learn that referenced this issue Jan 8, 2013
@wrichert wrichert Support for token processor. Fixes #1156 9cf5a81
@wrichert wrichert added a commit to wrichert/scikit-learn that referenced this issue Feb 14, 2013
@wrichert wrichert Support for token processor. Fixes #1156 bb5c56e
@larsmans
Member

I just tried adding a lemmatizer to a pipeline, and it's indeed non-trivial ATM. It would have been quite easy if the analyzer were passed a reference to the vectorizer, so it could use the public attributes to decide what to do. (I'm against the current proposed solution in the docs, to inherit from CountVectorizer, as it goes against the OO principle "compose, don't inherit".)

@ogrisel
Member
ogrisel commented Feb 28, 2013

So to make sure: you agree that the solution in PR #1537 would help?

@larsmans
Member

It would help, but it's a stopgap measure that complicates the API even further. The real solution would be to change the analyzer API so that analyzers have access to vectorizer internals. Users can then mix and match vectorizer functionality and their own code algorithms as they please, e.g. custom preprocessor -> standard tokenizer -> custom stemmer -> standard stopword filter.

Finding out how many arguments a function takes is non-trivial, but we can try to pass two arguments to self.analyzer (document and self) in build_analyzer, then retry with one argument.

I just added my own stopgap recipe to the narrative docs.

@larsmans
Member

To be exact, this is roughly what I have in mind:

def lemma_analyzer(document, vectorizer):
    preprocess = vectorizer.build_preprocessor()
    tokenize = vectorizer.build_tokenizer()
    lemmatize = nltk.stem.WordNetLemmatizer().lemmatize
    return map(lemmatize, tokenize(preprocess(document)))

Complicated? Maybe. Flexible? Yes.

The problem is now that n-gram stuff isn't publicly exported, but we can get around that.

@ogrisel
Member
ogrisel commented Feb 28, 2013

That API would create an instance of proprocessor, tokenizer andnltk.stem.WordNetLemmatizer per document which would be very wasteful in CPU both for the overhead of calling the buid_ method and by putting pressure on the GC.

Instead we could make it possible the to pass classes instead of simple callable instances and then override make the build_* instanciate them as required.

For instance:

class MyCustomProprocessor(object):

    def __init__(self, vectorizer):
        self.vectorizer = vectorizer
        # do my one time, vectorizer-aware init logic here

    def __call__(self, document):
        # TODO my usual preprocessing stuff here while
        # having access to the vectorizer if I need

then in CountVectorizer.build_preprocessor:

     def build_preprocessor(self):
         if isinstance(self.preprocessor, type):
              return self.preprocessor(self)
         elif self.preprocessor is not None:
              return self.preprocessor
         else:
              # keep the current implementation here

And we could have the same for tokenizer, analyzer and the new token_processor. We would have vectorizer-aware composability and preserve picklability and backward compat with the current API.

WDYT?

@larsmans
Member

Ok, agree on the performance issue.

The problem with checking for isinstance(x, type) is that it disallows factory functions (including lambdas and partial), which will receive a completely different treatment. In that case, I'd rather provide an ABC to inherit from -- not very pythonic, and unlike the rest of our API, but unambiguous.

Such an ABC could look like

class BaseAnalyzer(object):
    __metaclass__ = ABCMeta()

    @abstractmethod
    def receive_vectorizer(vectorizer):
        """Will be called by the vectorizer with itself as the argument."""

    @abstractmethod
    def analyze(self, document):
        """Analyze a document"""

(Maybe receive_vectorizer could have a default implementation of pass, but analyze would have to be abstract.)

@ogrisel
Member
ogrisel commented Mar 1, 2013

Well in that case we could just use duck typing instead of ABCMeta:

     def build_analyzer(self):
         if self.analyzer is not None:
              if hasattr(self.analyzer, 'set_vectorizer'):
                  # make the analyzer aware of the vectorizer and make it possible
                  # to perform corpus-wise costly init only once for all the documents
                  self.analyzer.set_vectorizer(self)
              return self.analyzer
         else:
              # put the default current implementation here

I find this lighter weight than introducing a new bunch of frameworkish ABCs.

@ogrisel
Member
ogrisel commented Mar 1, 2013

Even better:

     def build_analyzer(self):
         if self.analyzer is not None:
              if hasattr(self.analyzer, 'for_vectorizer'):
                  # make the analyzer instance aware of the vectorizer and make it
                  # possible to perform corpus-wise costly init only once for all the
                  # documents
                  return self.analyzer.for_vectorizer(self)
              else:
                  return self.analyzer
         else:
              # put the default current implementation here

This way the analyzer constructor parameter can be a stateless factory that is not modified by a set_vectorizer call. This is more in line with the traditional behavior of sklearn estimators' constructor parameters. This also bring thread safety for free in case the callers pass the same analyzer (factory) instance to 2 vectorizers concurrently in 2 threads (although this is not a big deal, still it's cleaner).

@ogrisel
Member
ogrisel commented Mar 1, 2013

Also backward compat with the current API is a big + to me.

@larsmans
Member
larsmans commented Mar 1, 2013

Actually, it's a brilliant idea. Let's do it.

@ogrisel
Member
ogrisel commented Mar 1, 2013

:)

@larsmans
Member
larsmans commented Mar 1, 2013

One problem occurred to me this morning: this introduces a cyclic reference when the receiver method stores the vectorizer as an attribute on the analyzer.

@ogrisel
Member
ogrisel commented Mar 1, 2013

To me the instances that are returned with the for_vectorizer factory method are both new instances (rather than inplace modification of the vectorizer.anayzer and furthermore are to be short lived: they are not stored as an attribute on the vectorizer.

Is the reponsibility of the implementer to make the for_vectorizer method a factory method that returns a new instance (a callable that takes the document as input) but I think this feature is only for advanced users anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment