Parallel CountVectorizer #1821

Closed
wants to merge 16 commits into
from

Conversation

Projects
None yet
2 participants
@lqdc

lqdc commented Mar 29, 2013

figure_1_mem
figure_1_transform
figure_1
Slightly higher memory usage for multiprocessing but much faster.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

Permission denied... Do I remove multiprocessing tests from the nose testing?

lqdc commented Mar 29, 2013

Permission denied... Do I remove multiprocessing tests from the nose testing?

@GaelVaroquaux

View changes

sklearn/feature_extraction/text.py
import numbers
from operator import itemgetter
import re
import unicodedata
import warnings
+import multiprocessing as mp

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

I would prefer it if you could avoid 'import ... as ...' for such a standard module: it helps readability IMHO.

@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

I would prefer it if you could avoid 'import ... as ...' for such a standard module: it helps readability IMHO.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

Permission denied... Do I remove multiprocessing tests from the nose testing?

Yes, we don't test parallel code in scikit-learn.

That said, we have also so far always avoided as much as possible to use
different code paths for parallel code in scikit-learn.

Member

GaelVaroquaux commented Mar 29, 2013

Permission denied... Do I remove multiprocessing tests from the nose testing?

Yes, we don't test parallel code in scikit-learn.

That said, we have also so far always avoided as much as possible to use
different code paths for parallel code in scikit-learn.

@GaelVaroquaux

View changes

sklearn/feature_extraction/text.py
@@ -592,7 +626,19 @@ def __init__(self, input='content', charset='utf-8',
self.stop_words = stop_words
self.max_df = max_df
self.min_df = min_df
+ if n_jobs == -1:
+ self.n_jobs = mp.cpu_count()

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

The init should not do any modifications of the arguments that it is given. These modifications should happen in the methods that need these arguments.

@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

The init should not do any modifications of the arguments that it is given. These modifications should happen in the methods that need these arguments.

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

It is happening here too:

if vocabulary is not None:
    if not isinstance(vocabulary, Mapping):
        vocabulary = dict((t, i) for i, t in enumerate(vocabulary))
    if not vocabulary:
        raise ValueError("empty vocabulary passed to fit")
    self.fixed_vocabulary = True
    self.vocabulary_ = dict(vocabulary)

Should we change that?

@lqdc

lqdc Mar 29, 2013

It is happening here too:

if vocabulary is not None:
    if not isinstance(vocabulary, Mapping):
        vocabulary = dict((t, i) for i, t in enumerate(vocabulary))
    if not vocabulary:
        raise ValueError("empty vocabulary passed to fit")
    self.fixed_vocabulary = True
    self.vocabulary_ = dict(vocabulary)

Should we change that?

@GaelVaroquaux

View changes

sklearn/feature_extraction/text.py
+ if len(raw_documents) == 0:
+ raise ValueError("No documents provided")
+ # tell python how to pickle a method
+ copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

I would very like to avoid relying on such trick, and rather use functions than methods. That would simplify greatly the parallel code.

If I look at the two methods that you need to do parallel computing on (_count_new_vocab and _count_fixed_vocab), they 'self' only once, that is to build an analyzer. The analyser could be built outside of the parallel code and passed to two functions that contains pretty much the inner workings of respectively _count_new_vocab and _count_fixed_vocab. These function wouldn't need to be a method, and the problem is solved: the design is simpler and does require pickling tricks.

@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

I would very like to avoid relying on such trick, and rather use functions than methods. That would simplify greatly the parallel code.

If I look at the two methods that you need to do parallel computing on (_count_new_vocab and _count_fixed_vocab), they 'self' only once, that is to build an analyzer. The analyser could be built outside of the parallel code and passed to two functions that contains pretty much the inner workings of respectively _count_new_vocab and _count_fixed_vocab. These function wouldn't need to be a method, and the problem is solved: the design is simpler and does require pickling tricks.

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

alright, but then it's necessary to change chunk_docs to yield the analyzer with each chunk or put all the chunks in a list. Putting them all in a list would take up a lot of ram that python wouldn't give back. I agree that it's ugly, so I'll try to change it.

@lqdc

lqdc Mar 29, 2013

alright, but then it's necessary to change chunk_docs to yield the analyzer with each chunk or put all the chunks in a list. Putting them all in a list would take up a lot of ram that python wouldn't give back. I agree that it's ugly, so I'll try to change it.

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

So apparently writing a function like

def _run_pool(n_jobs, fixed_vocab, raw_documents, chunk_size, analyze, vocab):
    pool = Pool(processes=n_jobs)
    if not fixed_vocab:
        result = pool.map(_count_new_vocab, 
                          _chunk_docs(raw_documents, chunk_size, 
                          analyze, vocab=None))

and calling it from the class doesn't work either.

@lqdc

lqdc Mar 29, 2013

So apparently writing a function like

def _run_pool(n_jobs, fixed_vocab, raw_documents, chunk_size, analyze, vocab):
    pool = Pool(processes=n_jobs)
    if not fixed_vocab:
        result = pool.map(_count_new_vocab, 
                          _chunk_docs(raw_documents, chunk_size, 
                          analyze, vocab=None))

and calling it from the class doesn't work either.

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

The other way to do it is to use Queues, but it adds a lot of complexity, and you have to worry about order and such.

@lqdc

lqdc Mar 29, 2013

The other way to do it is to use Queues, but it adds a lot of complexity, and you have to worry about order and such.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

Given my remark on refactoring to avoid parallel computing on methods, is that any reason that you are not using 'joblib.Parallel' for the parallel computing part. We try to rely on it, because it gives us an abstraction on multiprocessing that encompasses many useful patterns.

In particular, it catters for the specific failure that you are encountering on travis (line 37 of job.parallel.py), specifically that multiprocessing does not work on certain systems.

Member

GaelVaroquaux commented Mar 29, 2013

Given my remark on refactoring to avoid parallel computing on methods, is that any reason that you are not using 'joblib.Parallel' for the parallel computing part. We try to rely on it, because it gives us an abstraction on multiprocessing that encompasses many useful patterns.

In particular, it catters for the specific failure that you are encountering on travis (line 37 of job.parallel.py), specifically that multiprocessing does not work on certain systems.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

I had bad experience with joblib a while ago because it copied all the objects and took up a lot of ram. Also in this case, the multiprocessing code is actually shorter than joblib code. The other reason i didn't use it, quoting from Joblib site:

Under the hood, the Parallel object create a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list.

So in other words it would still need access to multiprocessing, which causes the permission denied. But I can change it if you want it to be consistent with other things in the library.

lqdc commented Mar 29, 2013

I had bad experience with joblib a while ago because it copied all the objects and took up a lot of ram. Also in this case, the multiprocessing code is actually shorter than joblib code. The other reason i didn't use it, quoting from Joblib site:

Under the hood, the Parallel object create a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list.

So in other words it would still need access to multiprocessing, which causes the permission denied. But I can change it if you want it to be consistent with other things in the library.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

So basically I did get the same errors with Joblib - it still needs to pickle a method - when calling mp.Pool/joblib.Parallel.

lqdc commented Mar 29, 2013

So basically I did get the same errors with Joblib - it still needs to pickle a method - when calling mp.Pool/joblib.Parallel.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

Basically just calling the function that runs the pool seems to not be ok. This should theoretically work.
https://gist.github.com/lqdc/5269837

I think the only other way is to use Queues.

lqdc commented Mar 29, 2013

Basically just calling the function that runs the pool seems to not be ok. This should theoretically work.
https://gist.github.com/lqdc/5269837

I think the only other way is to use Queues.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

Basically just calling the function that runs the pool seems to not be ok. This
should theoretically work.
https://gist.github.com/lqdc/5269837

I like this overall design. Is there a drawback to it?

Thanks!

Member

GaelVaroquaux commented Mar 29, 2013

Basically just calling the function that runs the pool seems to not be ok. This
should theoretically work.
https://gist.github.com/lqdc/5269837

I like this overall design. Is there a drawback to it?

Thanks!

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

The drawback is that it doesn't work haha.

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

lqdc commented Mar 29, 2013

The drawback is that it doesn't work haha.

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 29, 2013

Member

The drawback is that it doesn't work haha.

Strange. I can't see why. Do you have a branch that I can check out, and
a script to run to reproduce the error. I should really be working on
other things, but I am curious.

Member

GaelVaroquaux commented Mar 29, 2013

The drawback is that it doesn't work haha.

Strange. I can't see why. Do you have a branch that I can check out, and
a script to run to reproduce the error. I should really be working on
other things, but I am curious.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Mar 29, 2013

https://github.com/lqdc/scikit-learn/tree/countvect_unpickleable

the text.py is the one in the gist
the sklearn/feature_extraction/tests/test_unpickleable.py is the script to test it out

lqdc commented Mar 29, 2013

https://github.com/lqdc/scikit-learn/tree/countvect_unpickleable

the text.py is the one in the gist
the sklearn/feature_extraction/tests/test_unpickleable.py is the script to test it out

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Apr 2, 2013

Also, my memory claims were a bit optimistic. The parallel version may spike another 50% at certain times. Any other comments?

lqdc commented Apr 2, 2013

Also, my memory claims were a bit optimistic. The parallel version may spike another 50% at certain times. Any other comments?

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Apr 17, 2013

countvect_sorting
updated fit_transform graph with sorting information.
Wherever it doesn't explicitly say sorted, the features aren't sorted.

lqdc commented Apr 17, 2013

countvect_sorting
updated fit_transform graph with sorting information.
Wherever it doesn't explicitly say sorted, the features aren't sorted.

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Apr 17, 2013

ngram_12_transform

updated transform graph

lqdc commented Apr 17, 2013

ngram_12_transform

updated transform graph

@lqdc lqdc closed this Apr 18, 2013

@lqdc

This comment has been minimized.

Show comment Hide comment
@lqdc

lqdc Apr 18, 2013

The parallel code doesn't scale well with a large number of features right now which is the only time one would want to do it in parallel.

lqdc commented Apr 18, 2013

The parallel code doesn't scale well with a large number of features right now which is the only time one would want to do it in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment