Heuristic-based autograd execution order #4746

ssnl · 2018-01-19T19:57:16Z

tl;dr: This PR implements the heuristic-based autograd execution order, where the tasks for each thread are ordered by the time the Function is created. Functions created later are executed earlier.

The current breadth-first (BFS) order can cause huge memory usage in certain models. This was discovered using a real model that uses a Linear module multiple times in forward. After Linear is decomposed into Transpose+Addmm (#1935), the BFS orders all TBackward tasks at last to execute, causing huge amount of memory occupied by intermediate results.

With help from @ezyang , @apaszke , @zdevito and @colesbury , I have benchmarked three autograd execution orders:

BFS (the current scheme):
Within a thread, tasks are fetched from a FIFO queue.
Sample diff: None.
DFS:
Within a thread, tasks are fetched from a LIFO stack.
Sample diff: ssnl@ac5a97d
HEAP (Heuristic based):
Within a thread, tasks are fetched from a max-heap, ordered by the time each autograd Function is created.
Sample diff: ssnl@44d91b1

The benchmark code is applying the above diffs on master at 2dd7039, with code from #4511 (Methods for checking CUDA memory usage) manually added.

The benchmarked tasks include: ImageNet on ResNet50, Open-NMT, word language model, CycleGAN, the mini-model with which we discovered the issue (*), and a model specifically crafted to make DFS and HEAP perform worse (**).

The benchmark details and results can be found here. Roughly speaking, the performance for three approaches are similar, except on (*) and (**).

Basing on the results and following discussions, we think it would be reasonable to switch to HEAP ordering, because:

There is no substantial slowdown in some common models from benchmark results.
All three orders have edges cases that make them bad. But HEAP and DFS need a particular bad way of creating multi-device graph, which should be very uncommon. Yet BFS case can be encountered in real models, especially for models with many Linear layers (T+Addmm). HEAP generally performs slightly better than DFS in benchmark results.
This is probably a weaker point. For BFS and DFS, the actual backward order depends on the order of operator args. For HEAP, it instead depends on the creation order of ops, which I think is easier to reason about. Although we probably shouldn't encourage user to tune the ordering, it will be helpful for us to fix some OOM models without releasing new binaries.

Furthermore, after benchmarking on a tiny CPU model, we don't see obvious overheads for maintaining the heap.

pytorchbot · 2018-01-19T19:57:17Z

@ssnl, thanks for your PR! We identified @zdevito to be a potential reviewer.

ssnl · 2018-01-19T20:14:23Z

pinging @apaszke about the failing JIT test:

It seems that the arg order of an Add is swapped. Should I just fix the test? Or are there more things I need to worry about?

The error is:

20:06:26 OK
20:06:26 Running JIT tests
20:06:27 s...s..s...ss.s..s.....sx...x...F.s...........s...s.s..........
20:06:27 ======================================================================
20:06:27 FAIL: test_input_pruning (_main_.TestJit)
20:06:27 Check that stage 1 will return only one value
20:06:27 ----------------------------------------------------------------------
20:06:27 Traceback (most recent call last):
20:06:27   File "test_jit.py", line 1080, in test_input_pruning
20:06:27     self.assertExpected(str(fn.graph_for(x, y)))
20:06:27   File "/var/lib/jenkins/workspace/test/common.py", line 376, in assertExpected
20:06:27     self.assertMultiLineEqual(expected, s)
20:06:27 AssertionError: 'grap[304 chars]le(5, 5) = add[alpha={1}](%6, %5)\n  return (%2, %3, %7);\n}\n' != 'grap[304 chars]le(5, 5) = add[alpha={1}](%5, %6)\n  return (%2, %3, %7);\n}\n'
20:06:27   graph(%0 : Double(5, 5)
20:06:27         %1 : Double(5, 5)
20:06:27         -------- stage 1 --------
20:06:27         %4 : Double(5, 5)
20:06:27         %5 : Double(5, 5)) {
20:06:27     %2 : Double(5, 5) = mul(%0, %1)
20:06:27     %3 : Double(5, 5) = add[alpha={1}](%0, %1)
20:06:27     ---------------- stage 1 ----------------
20:06:27     %6 : Double(5, 5) = mul(%4, %1)
20:06:27 -   %7 : Double(5, 5) = add[alpha={1}](%6, %5)
20:06:27 ?                                        ----
20:06:27 +   %7 : Double(5, 5) = add[alpha={1}](%5, %6)
20:06:27 ?                                       ++++
20:06:27     return (%2, %3, %7);
20:06:27   }
20:06:27

soumith · 2018-01-19T20:21:43Z

cc: @thatguymike

apaszke · 2018-01-19T20:41:36Z

The trace is equivalent, so feel free to --accept it

ssnl · 2018-01-19T23:20:22Z

FWIW, here are the short-perf-test-* outputs from pytorch-linux-xenial-cuda8-cudnn6-py3 CI:

Task	z-value
test_cpu_speed_mini_sequence_labeler	-0.2878277307471087
test_cpu_speed_mnist	-1.2850299502022648
test_gpu_speed_mnist	-4.470857984650843
test_gpu_speed_word_language_model	-0.721292663063553
test_gpu_speed_cudnn_lstm	-0.454405894995399
test_gpu_speed_lstm	0.38482758620688173
test_gpu_speed_mlstm	-1.5262515262515481

apaszke

Really nice patch. I'm only concerned about the scheduling strategy (I think it's different than the one we discussed).

torch/csrc/autograd/engine.cpp

 struct ReadyQueue {
-  std::deque<FunctionTask> queue;
+  std::priority_queue<FunctionTask, std::vector<FunctionTask>, CompareFunctionTaskTime> heap;


ezyang

Looks great!

knsong · 2020-01-16T12:31:19Z

hi, @ssnl
does the problem https://discuss.pytorch.org/t/torch-autograd-function-overwrite/66656/5 have something to do with autograd execution order?

heap autograd order

4c1a2e1

onnxbot-worker-1 mentioned this pull request Jan 19, 2018

[auto] pytorch-pr-4746 onnxbot/onnx-fb-universe#333

Closed

--accept JIT test

fa4036a

apaszke reviewed Jan 20, 2018

View reviewed changes

ezyang self-requested a review January 24, 2018 02:56

ezyang approved these changes Jan 24, 2018

View reviewed changes

soumith merged commit a14abc7 into pytorch:master Jan 24, 2018

ssnl deleted the heap_ branch January 25, 2018 18:23

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heuristic-based autograd execution order #4746

Heuristic-based autograd execution order #4746

Uh oh!

ssnl commented Jan 19, 2018 •

edited

Loading

Uh oh!

pytorchbot commented Jan 19, 2018

Uh oh!

ssnl commented Jan 19, 2018

Uh oh!

soumith commented Jan 19, 2018

Uh oh!

apaszke commented Jan 19, 2018

Uh oh!

ssnl commented Jan 19, 2018

Uh oh!

apaszke left a comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Uh oh!

knsong commented Jan 16, 2020

Uh oh!

Uh oh!

Heuristic-based autograd execution order #4746

Heuristic-based autograd execution order #4746

Uh oh!

Conversation

ssnl commented Jan 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchbot commented Jan 19, 2018

Uh oh!

ssnl commented Jan 19, 2018

Uh oh!

soumith commented Jan 19, 2018

Uh oh!

apaszke commented Jan 19, 2018

Uh oh!

ssnl commented Jan 19, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

knsong commented Jan 16, 2020

Uh oh!

Uh oh!

ssnl commented Jan 19, 2018 •

edited

Loading