Skip to content

Heuristic-based autograd execution order #4746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 24, 2018
Merged

Heuristic-based autograd execution order #4746

merged 2 commits into from
Jan 24, 2018

Conversation

ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 19, 2018

tl;dr: This PR implements the heuristic-based autograd execution order, where the tasks for each thread are ordered by the time the Function is created. Functions created later are executed earlier.

The current breadth-first (BFS) order can cause huge memory usage in certain models. This was discovered using a real model that uses a Linear module multiple times in forward. After Linear is decomposed into Transpose+Addmm (#1935), the BFS orders all TBackward tasks at last to execute, causing huge amount of memory occupied by intermediate results.

With help from @ezyang , @apaszke , @zdevito and @colesbury , I have benchmarked three autograd execution orders:

  1. BFS (the current scheme):
    Within a thread, tasks are fetched from a FIFO queue.
    Sample diff: None.

  2. DFS:
    Within a thread, tasks are fetched from a LIFO stack.
    Sample diff: ssnl@ac5a97d

  3. HEAP (Heuristic based):
    Within a thread, tasks are fetched from a max-heap, ordered by the time each autograd Function is created.
    Sample diff: ssnl@44d91b1

The benchmark code is applying the above diffs on master at 2dd7039, with code from #4511 (Methods for checking CUDA memory usage) manually added.

The benchmarked tasks include: ImageNet on ResNet50, Open-NMT, word language model, CycleGAN, the mini-model with which we discovered the issue (*), and a model specifically crafted to make DFS and HEAP perform worse (**).

The benchmark details and results can be found here. Roughly speaking, the performance for three approaches are similar, except on (*) and (**).

Basing on the results and following discussions, we think it would be reasonable to switch to HEAP ordering, because:

  1. There is no substantial slowdown in some common models from benchmark results.

  2. All three orders have edges cases that make them bad. But HEAP and DFS need a particular bad way of creating multi-device graph, which should be very uncommon. Yet BFS case can be encountered in real models, especially for models with many Linear layers (T+Addmm). HEAP generally performs slightly better than DFS in benchmark results.

  3. This is probably a weaker point. For BFS and DFS, the actual backward order depends on the order of operator args. For HEAP, it instead depends on the creation order of ops, which I think is easier to reason about. Although we probably shouldn't encourage user to tune the ordering, it will be helpful for us to fix some OOM models without releasing new binaries.

Furthermore, after benchmarking on a tiny CPU model, we don't see obvious overheads for maintaining the heap.

@pytorchbot
Copy link
Collaborator

@ssnl, thanks for your PR! We identified @zdevito to be a potential reviewer.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 19, 2018

pinging @apaszke about the failing JIT test:

It seems that the arg order of an Add is swapped. Should I just fix the test? Or are there more things I need to worry about?

The error is:

20:06:26 OK
20:06:26 Running JIT tests
20:06:27 s...s..s...ss.s..s.....sx...x...F.s...........s...s.s..........
20:06:27 ======================================================================
20:06:27 FAIL: test_input_pruning (_main_.TestJit)
20:06:27 Check that stage 1 will return only one value
20:06:27 ----------------------------------------------------------------------
20:06:27 Traceback (most recent call last):
20:06:27   File "test_jit.py", line 1080, in test_input_pruning
20:06:27     self.assertExpected(str(fn.graph_for(x, y)))
20:06:27   File "/var/lib/jenkins/workspace/test/common.py", line 376, in assertExpected
20:06:27     self.assertMultiLineEqual(expected, s)
20:06:27 AssertionError: 'grap[304 chars]le(5, 5) = add[alpha={1}](%6, %5)\n  return (%2, %3, %7);\n}\n' != 'grap[304 chars]le(5, 5) = add[alpha={1}](%5, %6)\n  return (%2, %3, %7);\n}\n'
20:06:27   graph(%0 : Double(5, 5)
20:06:27         %1 : Double(5, 5)
20:06:27         -------- stage 1 --------
20:06:27         %4 : Double(5, 5)
20:06:27         %5 : Double(5, 5)) {
20:06:27     %2 : Double(5, 5) = mul(%0, %1)
20:06:27     %3 : Double(5, 5) = add[alpha={1}](%0, %1)
20:06:27     ---------------- stage 1 ----------------
20:06:27     %6 : Double(5, 5) = mul(%4, %1)
20:06:27 -   %7 : Double(5, 5) = add[alpha={1}](%6, %5)
20:06:27 ?                                        ----
20:06:27 +   %7 : Double(5, 5) = add[alpha={1}](%5, %6)
20:06:27 ?                                       ++++
20:06:27     return (%2, %3, %7);
20:06:27   }
20:06:27

@soumith
Copy link
Member

soumith commented Jan 19, 2018

cc: @thatguymike

@apaszke
Copy link
Contributor

apaszke commented Jan 19, 2018

The trace is equivalent, so feel free to --accept it

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 19, 2018

FWIW, here are the short-perf-test-* outputs from pytorch-linux-xenial-cuda8-cudnn6-py3 CI:

Task z-value
test_cpu_speed_mini_sequence_labeler -0.2878277307471087
test_cpu_speed_mnist -1.2850299502022648
test_gpu_speed_mnist -4.470857984650843
test_gpu_speed_word_language_model -0.721292663063553
test_gpu_speed_cudnn_lstm -0.454405894995399
test_gpu_speed_lstm 0.38482758620688173
test_gpu_speed_mlstm -1.5262515262515481

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice patch. I'm only concerned about the scheduling strategy (I think it's different than the one we discussed).

struct ReadyQueue {
std::deque<FunctionTask> queue;
std::priority_queue<FunctionTask, std::vector<FunctionTask>, CompareFunctionTaskTime> heap;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang ezyang self-requested a review January 24, 2018 02:56
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@soumith soumith merged commit a14abc7 into pytorch:master Jan 24, 2018
@ssnl ssnl deleted the heap_ branch January 25, 2018 18:23
@knsong
Copy link

knsong commented Jan 16, 2020

hi, @ssnl
does the problem https://discuss.pytorch.org/t/torch-autograd-function-overwrite/66656/5 have something to do with autograd execution order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants