Switch autograd to use a flexible pool of workers with device locks #18949
Labels
feature
A request for a proper, new feature.
hackamonth
low priority
We're unlikely to get around to doing this in the near future
module: autograd
Related to torch.autograd, and the autograd engine in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Today, autograd allocates a fixed set of threads, one for CPU, and one per CUDA device, and processes work for all backwards invocations (even if they are happening in parallel, or reentrantly) on these threads. This maintains the "no-reentrant apply invariant" that a single function's apply never gets entered concurrently; this allows us to implement AccumulateGrad without locks.
The purpose of this issue is to propose a new design for autograd, based on a pool of threads, and a per-device set of locks which protect the work queue for backwards function on that device, and help us maintain the no-reentrant apply invariant.
This redesign would fix #6959, because reentrant calls wouldn't recursively process the worker queue; instead, such processing would happen in another thread. It also has good performance characteristics, as without reentrant backwards calls, you use exactly the same number of threads that you used previously. Unlike the generator solution proposed in #18568 (comment) it is backwards compatible with current autograd Function syntax.
From a scheduling perspective, because worker threads are greedy (they will do as much as work as possible before exiting), this design will favor executing all available work, before returning the lock to threads which are waiting to reacquire the lock to finish up reentrant backwards calls. You could solve this by adding a preemption mechanism, so that a thread that wants to return can preempt the existing worker.
cc @apaszke @shubhtuls @sublee @albanD
The text was updated successfully, but these errors were encountered: