-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Heuristic-based autograd execution order #4746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pinging @apaszke about the failing JIT test: It seems that the arg order of an Add is swapped. Should I just fix the test? Or are there more things I need to worry about? The error is:
|
cc: @thatguymike |
The trace is equivalent, so feel free to |
FWIW, here are the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice patch. I'm only concerned about the scheduling strategy (I think it's different than the one we discussed).
struct ReadyQueue { | ||
std::deque<FunctionTask> queue; | ||
std::priority_queue<FunctionTask, std::vector<FunctionTask>, CompareFunctionTaskTime> heap; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
hi, @ssnl |
tl;dr: This PR implements the heuristic-based autograd execution order, where the tasks for each thread are ordered by the time the
Function
is created.Function
s created later are executed earlier.The current breadth-first (BFS) order can cause huge memory usage in certain models. This was discovered using a real model that uses a
Linear
module multiple times in forward. AfterLinear
is decomposed intoTranspose
+Addmm
(#1935), the BFS orders allTBackward
tasks at last to execute, causing huge amount of memory occupied by intermediate results.With help from @ezyang , @apaszke , @zdevito and @colesbury , I have benchmarked three autograd execution orders:
BFS (the current scheme):
Within a thread, tasks are fetched from a FIFO queue.
Sample diff: None.
DFS:
Within a thread, tasks are fetched from a LIFO stack.
Sample diff: ssnl@ac5a97d
HEAP (Heuristic based):
Within a thread, tasks are fetched from a max-heap, ordered by the time each autograd Function is created.
Sample diff: ssnl@44d91b1
The benchmark code is applying the above diffs on master at 2dd7039, with code from #4511 (Methods for checking CUDA memory usage) manually added.
The benchmarked tasks include: ImageNet on ResNet50, Open-NMT, word language model, CycleGAN, the mini-model with which we discovered the issue (*), and a model specifically crafted to make DFS and HEAP perform worse (**).
The benchmark details and results can be found here. Roughly speaking, the performance for three approaches are similar, except on (*) and (**).
Basing on the results and following discussions, we think it would be reasonable to switch to HEAP ordering, because:
There is no substantial slowdown in some common models from benchmark results.
All three orders have edges cases that make them bad. But HEAP and DFS need a particular bad way of creating multi-device graph, which should be very uncommon. Yet BFS case can be encountered in real models, especially for models with many Linear layers (T+Addmm). HEAP generally performs slightly better than DFS in benchmark results.
This is probably a weaker point. For BFS and DFS, the actual backward order depends on the order of operator args. For HEAP, it instead depends on the creation order of ops, which I think is easier to reason about. Although we probably shouldn't encourage user to tune the ordering, it will be helpful for us to fix some OOM models without releasing new binaries.
Furthermore, after benchmarking on a tiny CPU model, we don't see obvious overheads for maintaining the heap.