Switch autograd to use a flexible pool of workers with device locks #18949

ezyang · 2019-04-05T14:14:50Z

Today, autograd allocates a fixed set of threads, one for CPU, and one per CUDA device, and processes work for all backwards invocations (even if they are happening in parallel, or reentrantly) on these threads. This maintains the "no-reentrant apply invariant" that a single function's apply never gets entered concurrently; this allows us to implement AccumulateGrad without locks.

The purpose of this issue is to propose a new design for autograd, based on a pool of threads, and a per-device set of locks which protect the work queue for backwards function on that device, and help us maintain the no-reentrant apply invariant.

We maintain a pool of unallocated threads, waiting for work to do. There is a lock per device.
When work for device X enters the autograd scheduler, it is placed on its respective work queue. If there is no worker for that device, we wake up a thread from the pool, which takes out the lock for the device and then begins processing work on the pool in a loop.
When there is no more work to be done, the thread gives up the lock and returns to the pool.
When a reentrant call to backwards occurs on a worker device (or really, any blocking operation), the worker thread gives up its lock and wakes up a thread from the unallocated pool to take it. It then does the blocking operation. After it finishes the blocking operation, it attempts to take back the lock (so that it can finish processing the backwards function).

This redesign would fix #6959, because reentrant calls wouldn't recursively process the worker queue; instead, such processing would happen in another thread. It also has good performance characteristics, as without reentrant backwards calls, you use exactly the same number of threads that you used previously. Unlike the generator solution proposed in #18568 (comment) it is backwards compatible with current autograd Function syntax.

From a scheduling perspective, because worker threads are greedy (they will do as much as work as possible before exiting), this design will favor executing all available work, before returning the lock to threads which are waiting to reacquire the lock to finish up reentrant backwards calls. You could solve this by adding a preemption mechanism, so that a thread that wants to return can preempt the existing worker.

cc @apaszke @shubhtuls @sublee @albanD

apaszke · 2019-04-07T17:42:34Z

Well, one also needs to carefully take the two overheads that make most changes to the engine make it perform significantly worse:

Power management

Taking a sleeping thread will likely assign it to a cold core and it will take a while for the clock speed to rise. This is already killing us even when we only use a single thread for backward and a single one for forward.

Lots of threads?

While you might avoid recursion errors in a single thread, wouldn't those cases simply allocate a bajillion threads, which doesn't seem very nice either?

Inter-thread communication

Apart from the clock speed issues, the overhead of switching the queue between threads seems to be as large as even 4us in some cases, which is pretty significant compared to the cost of many of our ops. Keep that in mind.

ezyang · 2019-04-08T01:44:51Z

Yes, I agree care is necessary. Ideally, we should end up with exactly the same performance characteristics as the prior design if we don't make reentrant calls to backwards (this might mean that threads, when idle, shouldn't give up the lock, but just keep the lock and keep the core warm, or whatever it is we do today. Similarly, there shouldn't be any queue switching because threads keep processing events on their own queue unless they get preempted, which should only happen when a reentrant call wants to return.)

Having a lot of sacrificial threads to deal with blocking calls isn't very nice, but it's a pretty common way to get around this kind of problem. GHC threads are implemented similarly, because sometimes you want to make FFI calls to blocking routines, but you don't want to lower the overall concurrency of your system. And in any case, for the examples quoted here, you wouldn't end up with a "bajillion" threads, you would just end up with an extra thread per reentrant call to backwards, which is the best you can hope to do without rewriting the backward implementation to be a coroutine with generators.

yf225 · 2019-07-24T22:26:55Z

@ezyang If I understand it correctly, is this going to fix #18333 and #17566, for running backward() on separate models in parallel?

ezyang · 2019-07-26T21:08:09Z

No it doesn't. Excerpts from Will Feng's message of 2019-07-24 15:28:03 -0700:

…

@ezyang If I understand it correctly, is this going to fix #18333 and #17566, for running backward() on separate models in parallel?

ezyang · 2020-01-22T22:12:12Z

malvika did this!

malvika2147 mentioned this issue Jun 18, 2019

Switch autograd to use a pool of workers for each device #21911

Closed

ezyang closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch autograd to use a flexible pool of workers with device locks #18949

Switch autograd to use a flexible pool of workers with device locks #18949

ezyang commented Apr 5, 2019

apaszke commented Apr 7, 2019

ezyang commented Apr 8, 2019

yf225 commented Jul 24, 2019

ezyang commented Jul 26, 2019 via email

ezyang commented Jan 22, 2020

Switch autograd to use a flexible pool of workers with device locks #18949

Switch autograd to use a flexible pool of workers with device locks #18949

Comments

ezyang commented Apr 5, 2019

apaszke commented Apr 7, 2019

Power management

Lots of threads?

Inter-thread communication

ezyang commented Apr 8, 2019

yf225 commented Jul 24, 2019

ezyang commented Jul 26, 2019 via email

ezyang commented Jan 22, 2020