Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch autograd to use a flexible pool of workers with device locks #18949

Closed
ezyang opened this issue Apr 5, 2019 · 5 comments
Closed

Switch autograd to use a flexible pool of workers with device locks #18949

ezyang opened this issue Apr 5, 2019 · 5 comments
Labels
feature A request for a proper, new feature. hackamonth low priority We're unlikely to get around to doing this in the near future module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented Apr 5, 2019

Today, autograd allocates a fixed set of threads, one for CPU, and one per CUDA device, and processes work for all backwards invocations (even if they are happening in parallel, or reentrantly) on these threads. This maintains the "no-reentrant apply invariant" that a single function's apply never gets entered concurrently; this allows us to implement AccumulateGrad without locks.

The purpose of this issue is to propose a new design for autograd, based on a pool of threads, and a per-device set of locks which protect the work queue for backwards function on that device, and help us maintain the no-reentrant apply invariant.

  1. We maintain a pool of unallocated threads, waiting for work to do. There is a lock per device.
  2. When work for device X enters the autograd scheduler, it is placed on its respective work queue. If there is no worker for that device, we wake up a thread from the pool, which takes out the lock for the device and then begins processing work on the pool in a loop.
  3. When there is no more work to be done, the thread gives up the lock and returns to the pool.
  4. When a reentrant call to backwards occurs on a worker device (or really, any blocking operation), the worker thread gives up its lock and wakes up a thread from the unallocated pool to take it. It then does the blocking operation. After it finishes the blocking operation, it attempts to take back the lock (so that it can finish processing the backwards function).

This redesign would fix #6959, because reentrant calls wouldn't recursively process the worker queue; instead, such processing would happen in another thread. It also has good performance characteristics, as without reentrant backwards calls, you use exactly the same number of threads that you used previously. Unlike the generator solution proposed in #18568 (comment) it is backwards compatible with current autograd Function syntax.

From a scheduling perspective, because worker threads are greedy (they will do as much as work as possible before exiting), this design will favor executing all available work, before returning the lock to threads which are waiting to reacquire the lock to finish up reentrant backwards calls. You could solve this by adding a preemption mechanism, so that a thread that wants to return can preempt the existing worker.

cc @apaszke @shubhtuls @sublee @albanD

@ezyang ezyang added feature A request for a proper, new feature. module: autograd Related to torch.autograd, and the autograd engine in general low priority We're unlikely to get around to doing this in the near future triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module hackamonth labels Apr 5, 2019
@apaszke
Copy link
Contributor

apaszke commented Apr 7, 2019

Well, one also needs to carefully take the two overheads that make most changes to the engine make it perform significantly worse:

Power management

Taking a sleeping thread will likely assign it to a cold core and it will take a while for the clock speed to rise. This is already killing us even when we only use a single thread for backward and a single one for forward.

Lots of threads?

While you might avoid recursion errors in a single thread, wouldn't those cases simply allocate a bajillion threads, which doesn't seem very nice either?

Inter-thread communication

Apart from the clock speed issues, the overhead of switching the queue between threads seems to be as large as even 4us in some cases, which is pretty significant compared to the cost of many of our ops. Keep that in mind.

@ezyang
Copy link
Contributor Author

ezyang commented Apr 8, 2019

Yes, I agree care is necessary. Ideally, we should end up with exactly the same performance characteristics as the prior design if we don't make reentrant calls to backwards (this might mean that threads, when idle, shouldn't give up the lock, but just keep the lock and keep the core warm, or whatever it is we do today. Similarly, there shouldn't be any queue switching because threads keep processing events on their own queue unless they get preempted, which should only happen when a reentrant call wants to return.)

Having a lot of sacrificial threads to deal with blocking calls isn't very nice, but it's a pretty common way to get around this kind of problem. GHC threads are implemented similarly, because sometimes you want to make FFI calls to blocking routines, but you don't want to lower the overall concurrency of your system. And in any case, for the examples quoted here, you wouldn't end up with a "bajillion" threads, you would just end up with an extra thread per reentrant call to backwards, which is the best you can hope to do without rewriting the backward implementation to be a coroutine with generators.

@yf225
Copy link
Contributor

yf225 commented Jul 24, 2019

@ezyang If I understand it correctly, is this going to fix #18333 and #17566, for running backward() on separate models in parallel?

@ezyang
Copy link
Contributor Author

ezyang commented Jul 26, 2019 via email

@ezyang
Copy link
Contributor Author

ezyang commented Jan 22, 2020

malvika did this!

@ezyang ezyang closed this as completed Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. hackamonth low priority We're unlikely to get around to doing this in the near future module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants