An idea for a feature enhancement:
I'm currently using sklearn.feature_extraction.text.CountVectorizer for one of my projects. In my opinion, it strongly lacks support for stemming. An additional attribute such as self.stemmer which accepts a stemming function as a callable would be nice, together with a reasonable default stemmer for English.
What about an additional stemmer module? However, there are already some quite good Python stemmers available, for example as part of the Natural Language Toolkit. A longer time ago, I contributed stemmers for 12 languages which are available in nltk.stem.snowball. Maybe one could support them as a dependency in scikit-learn?
The general interface, however, is not difficult to implement. For example, one could do it like this:
# Snowball stemmers could be used as a dependency
from nltk.stem import SnowballStemmer
if self.stemmer is not None:
# One could provide an English stemmer as default
english_stemmer = SnowballStemmer('english')
return lambda tokens: [english_stemmer.stem(token) for token in tokens]
elif self.analyzer == 'word':
stop_words = self.get_stop_words()
# Add a stemmer instance here
stem = self.build_stemmer()
tokenize = self.build_tokenizer()
# Include stemmer in the method chain
return lambda doc: self._word_ngrams(
What do you think?
I am +1 on supporting stemming, though -1 on providing a default stemmer (out of scope imho).
Making nltk stemmers easy to use with the CountVectorizer seems desirable.
I like the API proposal but would rather name it: token_processor or token_filter rather than stemmer as it could also be used for other post tokenization processing and filtering than just stemming.
As @amueller said, we don't want to introduce a new dependency, especially on a domain specific library such as NLTK so I propose that the default implementation of build_token_processor would do nothing but would make it easier for a user to derive that class to plug whatever stemming or filter logic he/she wants.
+1 on the different name, +1 also on the rest :)
@pemistahl please feel free to submit a Pull Request to move on on this:
@amueller and @ogrisel Your suggestions sound reasonable to me. I've just created a subclass of CountVectorizer to add the stemming functionality that I need for my project, but as soon as I find the time I will submit a pull request for the original CountVectorizer class.
Would it be possible to do the same for TfidfVectorizer as well?
@wrichert Sure. If it would be done for CountVectorizer, it would basically automatically also work for TfidfVectorizer.
I see. Then maybe only the constructor's signature has to be adapted to take the token_processor as a parameter and pass it to the CountVectorizer.
@wrichert exactly :)
Feel free to dig into this issue and submit a PR!
@pemistahl Have you planned to continue on this? If not I would be happy to do it.
@wrichert I'm busy doing other stuff at the moment. Feel free to add stemming support and go ahead. But thank you for asking me beforehand. :)
Support for token processor. Fixes #1156
I just tried adding a lemmatizer to a pipeline, and it's indeed non-trivial ATM. It would have been quite easy if the analyzer were passed a reference to the vectorizer, so it could use the public attributes to decide what to do. (I'm against the current proposed solution in the docs, to inherit from CountVectorizer, as it goes against the OO principle "compose, don't inherit".)
So to make sure: you agree that the solution in PR #1537 would help?
It would help, but it's a stopgap measure that complicates the API even further. The real solution would be to change the analyzer API so that analyzers have access to vectorizer internals. Users can then mix and match vectorizer functionality and their own code algorithms as they please, e.g. custom preprocessor -> standard tokenizer -> custom stemmer -> standard stopword filter.
Finding out how many arguments a function takes is non-trivial, but we can try to pass two arguments to self.analyzer (document and self) in build_analyzer, then retry with one argument.
I just added my own stopgap recipe to the narrative docs.
To be exact, this is roughly what I have in mind:
def lemma_analyzer(document, vectorizer):
preprocess = vectorizer.build_preprocessor()
tokenize = vectorizer.build_tokenizer()
lemmatize = nltk.stem.WordNetLemmatizer().lemmatize
return map(lemmatize, tokenize(preprocess(document)))
Complicated? Maybe. Flexible? Yes.
The problem is now that n-gram stuff isn't publicly exported, but we can get around that.
That API would create an instance of proprocessor, tokenizer andnltk.stem.WordNetLemmatizer per document which would be very wasteful in CPU both for the overhead of calling the buid_ method and by putting pressure on the GC.
Instead we could make it possible the to pass classes instead of simple callable instances and then override make the build_* instanciate them as required.
def __init__(self, vectorizer):
self.vectorizer = vectorizer
# do my one time, vectorizer-aware init logic here
def __call__(self, document):
# TODO my usual preprocessing stuff here while
# having access to the vectorizer if I need
then in CountVectorizer.build_preprocessor:
if isinstance(self.preprocessor, type):
elif self.preprocessor is not None:
# keep the current implementation here
And we could have the same for tokenizer, analyzer and the new token_processor. We would have vectorizer-aware composability and preserve picklability and backward compat with the current API.
Ok, agree on the performance issue.
The problem with checking for isinstance(x, type) is that it disallows factory functions (including lambdas and partial), which will receive a completely different treatment. In that case, I'd rather provide an ABC to inherit from -- not very pythonic, and unlike the rest of our API, but unambiguous.
Such an ABC could look like
__metaclass__ = ABCMeta()
"""Will be called by the vectorizer with itself as the argument."""
def analyze(self, document):
"""Analyze a document"""
(Maybe receive_vectorizer could have a default implementation of pass, but analyze would have to be abstract.)
Well in that case we could just use duck typing instead of ABCMeta:
if self.analyzer is not None:
if hasattr(self.analyzer, 'set_vectorizer'):
# make the analyzer aware of the vectorizer and make it possible
# to perform corpus-wise costly init only once for all the documents
# put the default current implementation here
I find this lighter weight than introducing a new bunch of frameworkish ABCs.
if self.analyzer is not None:
if hasattr(self.analyzer, 'for_vectorizer'):
# make the analyzer instance aware of the vectorizer and make it
# possible to perform corpus-wise costly init only once for all the
# put the default current implementation here
This way the analyzer constructor parameter can be a stateless factory that is not modified by a set_vectorizer call. This is more in line with the traditional behavior of sklearn estimators' constructor parameters. This also bring thread safety for free in case the callers pass the same analyzer (factory) instance to 2 vectorizers concurrently in 2 threads (although this is not a big deal, still it's cleaner).
Also backward compat with the current API is a big + to me.
Actually, it's a brilliant idea. Let's do it.
One problem occurred to me this morning: this introduces a cyclic reference when the receiver method stores the vectorizer as an attribute on the analyzer.
To me the instances that are returned with the for_vectorizer factory method are both new instances (rather than inplace modification of the vectorizer.anayzer and furthermore are to be short lived: they are not stored as an attribute on the vectorizer.
Is the reponsibility of the implementer to make the for_vectorizer method a factory method that returns a new instance (a callable that takes the document as input) but I think this feature is only for advanced users anyway.