New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memoize tokenisation in CountVectorizer #11579
Comments
Interesting! Possibly use |
Has anyone worked on this? I would like to work on this in Nov 2 WIML SF sprint. |
I think you could work on it @mina1987. One tricky decision is how to handle the cache if the vectorizer is cloned |
I will start working on this! |
I am trying to understand this issue. The purpose is to store doc to token mapping to be used by possibly next similar documents, right? |
Yes it move help with repeated content, but more likely when it is fitting
or transforming the data across multiple cross validation folds. Hence the
need for the cache to be maintained when the estimator is cloned.
|
How about having the cache as input/output of fit transform vectorizer? |
I'm not sure what you mean. A pull request or code snippet would make your
proposal more concrete
|
IIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (
s -> tokenize(preprocess(s))
) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).Challenges to memoization:
The text was updated successfully, but these errors were encountered: