Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoize tokenisation in CountVectorizer #11579

Open
jnothman opened this issue Jul 16, 2018 · 8 comments
Open

Memoize tokenisation in CountVectorizer #11579

jnothman opened this issue Jul 16, 2018 · 8 comments

Comments

@jnothman
Copy link
Member

IIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (s -> tokenize(preprocess(s))) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).

Challenges to memoization:

  • it might makes sense to use in-memory caching rather than on-disk given the relatively small blobs being cached.
  • it has to be conditioned on all relevant constructor parameters to the CountVectorizer (i.e. anything that goes into build_analyzer up to and including tokenization)
@rth
Copy link
Member

rth commented Jul 17, 2018

Interesting! Possibly use functools.lru_cache for this once the code base becomes Python 3 only after the 0.20 release. There are existing benchmarks that could be used to evaluate the performance in bench_20newsgroups.py.

@mina1987
Copy link
Contributor

mina1987 commented Oct 31, 2019

Has anyone worked on this? I would like to work on this in Nov 2 WIML SF sprint.

@jnothman
Copy link
Member Author

I think you could work on it @mina1987.

One tricky decision is how to handle the cache if the vectorizer is cloned

@mina1987
Copy link
Contributor

mina1987 commented Nov 2, 2019

I will start working on this!

@mina1987
Copy link
Contributor

mina1987 commented Nov 2, 2019

I am trying to understand this issue. The purpose is to store doc to token mapping to be used by possibly next similar documents, right?

@jnothman
Copy link
Member Author

jnothman commented Nov 3, 2019 via email

@mina1987
Copy link
Contributor

mina1987 commented Nov 3, 2019

How about having the cache as input/output of fit transform vectorizer?

@jnothman
Copy link
Member Author

jnothman commented Nov 4, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants