Memoize tokenisation in CountVectorizer #11579

jnothman · 2018-07-16T23:54:42Z

IIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (s -> tokenize(preprocess(s))) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).

Challenges to memoization:

it might makes sense to use in-memory caching rather than on-disk given the relatively small blobs being cached.
it has to be conditioned on all relevant constructor parameters to the CountVectorizer (i.e. anything that goes into build_analyzer up to and including tokenization)

The text was updated successfully, but these errors were encountered:

rth · 2018-07-17T05:16:23Z

Interesting! Possibly use functools.lru_cache for this once the code base becomes Python 3 only after the 0.20 release. There are existing benchmarks that could be used to evaluate the performance in bench_20newsgroups.py.

mina1987 · 2019-10-31T04:19:22Z

Has anyone worked on this? I would like to work on this in Nov 2 WIML SF sprint.

jnothman · 2019-10-31T06:16:03Z

I think you could work on it @mina1987.

One tricky decision is how to handle the cache if the vectorizer is cloned

mina1987 · 2019-11-02T17:05:32Z

I will start working on this!

mina1987 · 2019-11-02T22:07:16Z

I am trying to understand this issue. The purpose is to store doc to token mapping to be used by possibly next similar documents, right?

jnothman · 2019-11-03T05:57:42Z

Yes it move help with repeated content, but more likely when it is fitting or transforming the data across multiple cross validation folds. Hence the need for the cache to be maintained when the estimator is cloned.

mina1987 · 2019-11-03T06:21:01Z

How about having the cache as input/output of fit transform vectorizer?

jnothman · 2019-11-04T07:04:48Z

I'm not sure what you mean. A pull request or code snippet would make your proposal more concrete

jnothman added Enhancement help wanted labels Jul 16, 2018

cmarmo added the module:feature_extraction label Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoize tokenisation in CountVectorizer #11579

Memoize tokenisation in CountVectorizer #11579

jnothman commented Jul 16, 2018

rth commented Jul 17, 2018

mina1987 commented Oct 31, 2019 •

edited

jnothman commented Oct 31, 2019

mina1987 commented Nov 2, 2019

mina1987 commented Nov 2, 2019

jnothman commented Nov 3, 2019 via email

mina1987 commented Nov 3, 2019

jnothman commented Nov 4, 2019 via email

Memoize tokenisation in CountVectorizer #11579

Memoize tokenisation in CountVectorizer #11579

Comments

jnothman commented Jul 16, 2018

rth commented Jul 17, 2018

mina1987 commented Oct 31, 2019 • edited

jnothman commented Oct 31, 2019

mina1987 commented Nov 2, 2019

mina1987 commented Nov 2, 2019

jnothman commented Nov 3, 2019 via email

mina1987 commented Nov 3, 2019

jnothman commented Nov 4, 2019 via email

mina1987 commented Oct 31, 2019 •

edited