[WIP] Strawman/Tracker for fixing grad copy elision + post hooks (e.g. DDP) interaction #13337

mcarilli · 2018-10-30T21:25:10Z

@FDecaYed wrote some nice logic into accumulate_grad.cpp to elide unnecessary cloning or accumulation of incoming gradients. Under certain conditions, the incoming gradient can be stolen and used directly.

One of these conditions is that the incoming gradient's refcount must be 1. Nothing else can be referencing the data (which might happen, for instance, for c = b + a: the + operator's backward can simply pass c's incoming gradient tensor directly to both a and b's AccumulateGrad functions).

However, if a given param's AccumulateGrad function has post hooks (like, for example, the allreduce hooks employed by both apex and c10d DDP), the autograd engine stashes a copy of the incoming gradients, which increments the refcount, causing AccumulateGrad's hard check that the refcount is 1 to fail, and triggering an (in this case) unnecessary deep copy.

This PR updates the logic to account for the possible presence of post hooks on the AccumulateGrad function. I discussed it with @apaszke and we both agree it introduces a worrisome silent dependency between engine.cpp and accumulate_grad.cpp. I've commented the relevant pieces, but I'm open to suggestions as to how this check could be made more agnostic to the implementation of engine.cpp.

… interaction

albanD · 2018-11-02T20:41:12Z

Actually to be fair, the following code sample fails on master if the DDP-like hook is removed. So this maybe not be a blocker for this PR.
But the following code used to work and is not working anymore with this PR.

Code sample:

import torch

inp = torch.rand(1, 10, requires_grad=True)
dummy = torch.rand(1, 10) # Make sure we get contiguous gradients

# DPP like hook
def dpp_hook(*args):
    pass
tmp = inp.expand_as(inp)
tmp.grad_fn.next_functions[0][0].register_hook(dpp_hook)

# View your leaf tensor differently
b = inp.view(10)

# Hook on the result of this op
my_bad_ref = None
def other_hook(grad):
    global my_bad_ref
    my_bad_ref = grad
b.register_hook(other_hook)

# This first backward will make all refs the same from this PR
b.mul(dummy).sum().backward()
print(".grad data: ", inp.grad.data_ptr())
print("my_bad_ref data: ", my_bad_ref.data_ptr())

# Subsequent backward modify my awesome tensor
print(my_bad_ref)
inp.sum().backward()
print(my_bad_ref) # It changed here

With current master, the data_ptr() will not be the same and both prints at the end will print the same tensor.
With this change, all the data_ptr() are the same and the two prints at the end contain two different values.

mcarilli · 2018-11-12T18:15:55Z

b is not a leaf, so other_hook gets converted to a function pre hook on b's AccumulateGrad function, as you told me in pytorch slack. In current master (without my PR) if you remove ddp_hook on inp's AccumulateGrad function, the copy elision optimization takes effect as normal. The incoming gradient is simply handed down through other_hook, then presumably through view()'s backward, and eventually stashed (without any deep copies along the way) by inp's AccumulateGrad as inp.grad, and so you observe that the data_ptr()s are the same.

In your example, the only reason that restoring ddp_hook "fixes" your issue is because in current master ddp_hook disables copy elision, which causes a deep copy to take place, so pointers become different. My PR restores the copy elision, which re-exposes your issue. In this way, by restoring the copy-eliding behavior, my PR is not "breaking" anything: it's simply re-exposing the default behavior that was already present (which, due to your issue, may have been "broken" all along). This behavior should have been spotted and discussed when grad copy elision was originally merged.

If you guys are comfortable with the way the grad copy elision currently works in master, then my PR should be acceptable, since it simply restores the existing behavior for cases where a ddp-style hook is registered. If not, then we can take a harder look at grad copy elision, keeping in mind the big picture that ddp-style hooks in particular are NOT the fundamental issue here.

Summary: Rebased version of #13337. I don't think the lint errors in the original PR had to do with files I touched, so hopefully the rebase fixes them. Pull Request resolved: #14587 Differential Revision: D13277428 Pulled By: soumith fbshipit-source-id: f04c186b1dd4889b4250597eef87f9e9bf7b2426

Strawman/tracker for fixing grad copy elision + post hooks (e.g. DDP)…

de7d182

… interaction

mcarilli closed this Nov 30, 2018

mcarilli mentioned this pull request Nov 30, 2018

Accumulate grad fix #14587

Closed

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Strawman/Tracker for fixing grad copy elision + post hooks (e.g. DDP) interaction #13337

[WIP] Strawman/Tracker for fixing grad copy elision + post hooks (e.g. DDP) interaction #13337

Uh oh!

mcarilli commented Oct 30, 2018

Uh oh!

albanD commented Nov 2, 2018 •

edited

Loading

Uh oh!

mcarilli commented Nov 12, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] Strawman/Tracker for fixing grad copy elision + post hooks (e.g. DDP) interaction #13337

[WIP] Strawman/Tracker for fixing grad copy elision + post hooks (e.g. DDP) interaction #13337

Uh oh!

Conversation

mcarilli commented Oct 30, 2018

Uh oh!

albanD commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcarilli commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albanD commented Nov 2, 2018 •

edited

Loading

mcarilli commented Nov 12, 2018 •

edited

Loading