[MRG] Release the GIL at tree building time #2528

Merged
merged 2 commits into from Oct 20, 2013

8 participants

@ogrisel
scikit-learn member

Good news! Thanks to Gilles releasing the GIL on most splitter methods, we can now release the GIL for most of the tree construction as well.

I did a bench and I get a 10x speedup on a 12 cores machine with a python thread pool (individual trees build in 6s).

@pprett
scikit-learn member
@ogrisel
scikit-learn member

Here is the bench script I used (you need pip install futures if running python 2:

https://gist.github.com/ogrisel/7008622

and here is the output on my box:

$ python ~/bench_tree.py

Dataset statistics:
===================
number of features:       54
number of classes:        2
data type:                float32
number of train samples:  100000 (pos=36745, neg=63255, size=21MB)
number of test samples:   481012 (pos=175095, neg=305917, size=103MB)
Training one:
trained one tree in 5.83s
time to train one tree: 5.83s
Training 12 in with 12 threads
trained one tree in 6.4s
trained one tree in 6.41s
trained one tree in 6.41s
trained one tree in 6.46s
trained one tree in 6.48s
trained one tree in 6.51s
trained one tree in 6.52s
trained one tree in 6.58s
trained one tree in 6.6s
trained one tree in 6.62s
trained one tree in 6.63s
trained one tree in 7.22s
time to train 12 trees in //: 7.23s
speedup: 9.68533627751x
@ogrisel
scikit-learn member

Tomorrow I will have a look at the other methods (prediction and such) and have look at making joblib support Python threads in addition to processes.

@glouppe
scikit-learn member

Wow.

@jaquesgrobler
scikit-learn member

@ogrisel wow that's a nice speedup!

@coveralls

Coverage Status

Coverage remained the same when pulling f859b7d on ogrisel:tree-nogil into b807036 on scikit-learn:master.

@mblondel
scikit-learn member

So now multiprocessing is not the only way to go! Can you summarize the pros and cons of multiprocessing vs. threads? Also, can you give the speed up before this PR?

@GaelVaroquaux
scikit-learn member

Very nice. Does this have an impact on non-threaded performance?

@pprett
scikit-learn member

@GaelVaroquaux this is just a proof of concept - i think Olivier just added the nogil markers to the necessary methods and put the thread pool in the gist.

@GaelVaroquaux
scikit-learn member
@ogrisel
scikit-learn member

Does this have an impact on non-threaded performance?

No releasing the GIL must be a nanosecond kind of operation. I cannot be seen compared to building trees that last seconds or more.

@ogrisel
scikit-learn member

Can you summarize the pros and cons of multiprocessing vs. threads?

multiprocessing cons:

  • need to pickle and unpickle args and return values for the callable executed in parallel
  • will copy the input training data n_workers times

multithreading cons:

  • will only execute in parallel compiled code that release the GIL (all other Python interpreted code or compiled code that does not release the GIL will run sequentially: threads will wait for each other and only one core will be used under GIL-protected code sections)

Some but not all of the memory copy and pickling overhead of multiprocessing can be optimized away by using memory mapping.

multiprocessing is still useful to parallelize code that has many sub-calls to python function. For instance, general grid search is probably best addressed by multiprocessing (ideally with memmaping), while parallelism in highly specialized cython code areas are probably best addressed by multithreading.

Also, can you give the speed up before this PR?

Before this PR, each tree was built sequentially hence building 12 trees on a 12 cores machine would take 12 x one tree build time (with 11 cores idle). It means in that case around 70s.

@pprett
scikit-learn member

did you also bench against using multiprocessing parallel tree building? I hardly used it in the past (before memory mapping) due to memory hogging and high cost for pickling etc... I never got such a nice linear speedup as in your benchmark.

@ogrisel
scikit-learn member

did you also bench against using multiprocessing parallel tree building? I hardly used it in the past (before memory mapping) due to memory hogging and high cost for pickling etc... I never got such a nice linear speedup as in your benchmark.

I will update the script to compare with multiprocessing and measure the pickling overhead.

@glouppe glouppe and 1 other commented on an outdated diff Oct 16, 2013
sklearn/tree/_tree.pyx
+
+ # Stack right child
+ stack[stack_n_values] = pos
+ stack[stack_n_values + 1] = end
+ stack[stack_n_values + 2] = depth + 1
+ stack[stack_n_values + 3] = node_id
+ stack[stack_n_values + 4] = 0
+ stack_n_values += 5
+
+ # Stack left child
+ stack[stack_n_values] = start
+ stack[stack_n_values + 1] = pos
+ stack[stack_n_values + 2] = depth + 1
+ stack[stack_n_values + 3] = node_id
+ stack[stack_n_values + 4] = 1
+ stack_n_values += 5
self._resize(self.node_count)
@glouppe
scikit-learn member
glouppe added a note Oct 16, 2013

If I am not wrong, this line above could also be put in the nogil section.

@ogrisel
scikit-learn member
ogrisel added a note Oct 16, 2013

yes I will update the PR. I was thinking of trying to to put the whole method as nogil and wrap the leading numpy calls with with gil block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member

I have updated the benchmark script to compare with joblib master, without and without memory mapping:

python bench_tree.py
Dataset statistics:
===================
number of features:       54
number of classes:        2
data type:                float32
number of train samples:  100000 (pos=36745, neg=63255, size=21MB)
number of test samples:   481012 (pos=175095, neg=305917, size=103MB)
Training one:
-> trained one tree in 5.69s
Training 4 trees with 4 threads
-> trained one tree in 11.4s
-> trained one tree in 11.4s
-> trained one tree in 11.4s
-> trained one tree in 11.4s
Time to train 4 trees in parallel: 11.4s
Speedup: 2.0x
Training 4 trees with 4 processes and memory mapping
-> trained one tree in 11.7s
-> trained one tree in 11.8s
-> trained one tree in 11.8s
-> trained one tree in 11.8s
Time to train 4 trees in parallel: 11.9s
Speedup: 1.9x
Training 4 trees with 4 processes without memory mapping
-> trained one tree in 12.7s
-> trained one tree in 12.6s
-> trained one tree in 12.6s
-> trained one tree in 12.5s
Time to train 4 trees in parallel: 13.0s
Speedup: 1.7x

This times this is run from my laptop that has 2 hyperthreaded cores (hence 4 logical CPUs).

The speedup is near perfect for threading and memmaped joblib. The overhead of pickling and extra memory allocations is visible (even on such a small 21MB dataset) when disabling the memory mapping.

@ogrisel
scikit-learn member

The fact that the thread version is slightly faster than the multiproc + memmap version is probably caused by the fact that pickling the resulting trees from worker to original process is not required for the threading version while this is the case for multiproc version even with memmaping as the datastructure for the resulting trees cannot be allocated in memmap area ahead of time from the orig process.

@coveralls

Coverage Status

Coverage remained the same when pulling 1e3cadf on ogrisel:tree-nogil into b807036 on scikit-learn:master.

@GaelVaroquaux
scikit-learn member
@mblondel
scikit-learn member

@ogrisel Thanks, very interesting! What are the rules of thumb for wrapping a function or a portion of code with nogil?

Next we should try to add nogil to SGD (some functions are already annotated with it), train different classifiers on different splits using one thread per split and do averaging. Some algorithms in lightning would also be good candidates.

@ogrisel
scikit-learn member

Do you see a difference in memory consumption (I would expect one)?

Between thread and multiprocessing with memmap? The only overhead of the multiprocessing + memmap is the
allocation of the trees in the worker process that is duplicated when pickling the trees to be send back to the parent process before being collected on the worker process. But this is just temporary duplication and usually the trees are smaller than the original data (unless the data is completely random).

I will try to use memory_profiler but there is currently a limitation of the timestamp feature and multiprocessing.

@ogrisel
scikit-learn member

Off course for the multiprocessing without memmaping case, the data gets duplicated many times both in the multiprocessing queue and in the worker sub processes thus exploding the memory usage on non-toy data but this is already well known.

@ogrisel
scikit-learn member

@ogrisel Thanks, very interesting! What are the rules of thumb for wrapping a function or a portion of code with nogil?

You just cannot allocate or manipulate any python object (including exceptions) in those sections. Whenever you want to call a Python object from a nogil section (e.g. to call a callback such as a custom metric defined by a Python callable or a monitoring callback, or raise an exception) you need to re-acquire the GIL just for that specific call with a with gil block as the one used in this PR to raise the MemoryError exception. Off-course that should be restricted to rare events otherwise you will observe a lock contention related perf degradation and the code will run no faster than sequentially.

See: http://docs.cython.org/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil

Next we should try to add nogil to SGD (some functions are already annotated with it), train different classifiers on different splits using one thread per split and do averaging. Some algorithms in lightning would also be good candidates.

Yes, even without averaging it would be useful for parallelizing the OVA multiclass pattern efficiently.

@pprett
scikit-learn member

Yes, even without averaging it would be useful for parallelizing the OVA multiclass pattern efficiently.

+1 that's a huge pain point

@mblondel
scikit-learn member

Do you plan to ship futures in sklearn.externals?

@ogrisel
scikit-learn member

There is no need to ship futures, I just used it as convenience for testing threading in this gist. I plan to add a thread mode for joblib instead. That way no code change will be required in sklearn to switch from multiprocessing based to threading based parallelism.

@mblondel
scikit-learn member

Sounds good!

@larsmans
scikit-learn member

Nice work!

No releasing the GIL must be a nanosecond kind of operation.

Maybe not nanoseconds, but cheap. The trick is in the Cython code generator: when a cdef nogil function is called and the GIL is off, it's just a call to a static C function with no interference from the CPython API.

@ogrisel
scikit-learn member

Actually I think there is no need to wait for merging this PR:

  • the GIL release does not impact the rest in any way,
  • the predict code would need a partial rewrite to be able to release the GIL there as well (let's do a dedicated PR for that),
  • the joblib thread mode will be prototyped in its own pull request upstream.
@larsmans
scikit-learn member

Go ahead!

@glouppe
scikit-learn member

Okay then. +1 for merge on my side.

@ogrisel ogrisel merged commit 668c197 into scikit-learn:master Oct 20, 2013

1 check passed

Details default The Travis CI build passed
@ogrisel
scikit-learn member

Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment