Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

Open
wants to merge 3 commits into
from

Projects

None yet

6 participants

@cjauvin
Contributor
cjauvin commented Nov 23, 2012

A couple of weeks ago, I submitted this experimental idea to the joblib mailing list, but it didn't receive much attention:


As I was studying the implementation of the sklearn CountVectorizer, my attention was drawn to a comment in the code saying that its main loop could not be efficiently parallelized with joblib:

https://github.com/scikit-learn/scikit-learn/blob/33a5911b55617e919fcfabc283c24784deaed686/sklearn/feature_extraction/text.py#L469

I don't know much about the internals of multiprocessing, but I imagined that there might be a tradeoff between the size of individual jobs and the number of times that a process in the pool is dispatched a new job. For instance, if the vectorizer is passed a very long list of very short documents, then it would seem possible that the dispatching overhead makes it very suboptimal. Perhaps a smaller number of longer jobs would work better in that case?

To explore this hypothesis, I had the simple idea of chaining together jobs as batches (to be executed by a single process) and dispatch those, instead of individual jobs. The user can then experiment with different batch sizes, trying to find the sweet spot.


Today I have decided to try my idea within a more realistic setting, by implementing a minimal version of it for CountVectorizer (it's not a full solution, see caveats in the code comments). Using a 1M-line text corpus and a 24-core Linux machine, parallelizing without batches almost doubles the required time (which I gotta admit is worryingly far from the +20% figure mentioned by @larsmans; my implementation is probably not optimal in that regard) , whereas I have obtained a 3.5X speedup with my batch mechanism:

https://gist.github.com/4137131

I don't know if the idea makes much sense, but I thought I would submit it here anyway, just to see what people thinks.

@cjauvin
Contributor
cjauvin commented Nov 28, 2012

Although I'd have preferred to be told why, I guess this total absence of feedback speaks for itself: my idea probably doesn't make sense.. so I hereby close the PR.

@cjauvin cjauvin closed this Nov 28, 2012
@GaelVaroquaux
Member

Although I'd have preferred to be told why, I guess this total absence of
feedback speaks for itself: my idea probably doesn't make sense.. so I hereby
close the PH.

Please don't. It's just lack of time. I want to look at this. It is on my
todo list. I am just collapsing under work and starting to feel a
beginning of a breakdown due to todo list overload.

@larsmans
Member

Same story here. I'd love to review, but deadlines are approaching.

@cjauvin
Contributor
cjauvin commented Nov 28, 2012

In that case, I reopen it.. :-) Sorry about that.. it's a simple exploratory idea, and it can certainly wait.

@cjauvin cjauvin reopened this Nov 28, 2012
@amueller
Member

@larsmans deadlines? Which one?

@larsmans
Member

Deliverables.

@glouppe glouppe commented on the diff Jan 30, 2013
sklearn/feature_extraction/text.py
@@ -432,7 +468,7 @@ def fit(self, raw_documents, y=None):
self.fit_transform(raw_documents)
return self
- def fit_transform(self, raw_documents, y=None):
+ def fit_transform(self, raw_documents, y=None, n_jobs=1, batch_size=1):
@glouppe
glouppe Jan 30, 2013 Member

n_jobs and batch_size should be set into the constructor instead. It would also be nice it you could include a docstring for both of them. It might indeed not be clear for everyone what is the meaning of batch_size.

@glouppe glouppe commented on the diff Jan 30, 2013
sklearn/feature_extraction/text.py
+# These two functions are required for the joblib parallelization of the
+# CV's analyze function. Since multiprocessing.Pool was causing me some
+# trouble with the pickling of instance members and lambda functions, I cut it
+# short by simply extracting the logic for a single case (word ngrams),
+# with some default parameters hardcoded. Hence, this is NOT meant to be a
+# complete solution, just the minimal code for my proof of concept.
+
+def _word_ngrams_single(tokens, stop_words=None):
+ """Turn tokens into a sequence of n-grams after stop words filtering"""
+ # handle stop words
+ if stop_words is not None:
+ tokens = [w for w in tokens if w not in stop_words]
+
+ # handle token n-grams
+ min_n, max_n = (1, 1)#self.ngram_range
+ if max_n != 1:
@glouppe
glouppe Jan 30, 2013 Member

I don't get it. max_n will alway be 1 since you set it that way the line before?

@cjauvin
cjauvin Jan 30, 2013 Contributor

Like I say in the comment just above, this whole PR is absolutely not meant to be a complete solution. It's just a tentative proof of concept, studying the idea for a single case (I guess I should have said "unigrams", or "1-grams" instead of "word ngrams"), hence the hardcoding of certain parameters.

@jnothman
Member

Batch mechanism is now in joblib.parallel. Should we be incorporating it in CountVectorizer??

@cjauvin
Contributor
cjauvin commented Aug 16, 2015

It's funny, because I had proposed a batching mechanism idea for Joblib more than 2 years ago:

cjauvin/joblib@04846a0

It was very sketchy, and I'm really not sure that what finally got implemented (which I wasn't aware of) is even related at all..

@jnothman
Member

The new batching involves dynamic sizing. Still, given a more recent PR, I suspect we're not going to get much benefit from batching CountVectorizer, but more benchmarks are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment