-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[CAG] Support for call_module at copy paste aot bwd graph #153827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CAG] Support for call_module at copy paste aot bwd graph #153827
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153827
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 274055f with merge base c45515c ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
Could you provide more details on how you're running into the issue? Is compiling via torch.compile(backend="hpu_backend") + hpu device sometimes installing submodules on the AOT backward graph (via custom joint passes?). I don't have an HPU handy, but if you are able to, could you provide the TORCH_LOGS="aot_graphs" obtained from running the script with hpu backend? |
Sure, at the end full logs with TORCH_LOGS="aot_graphs" (Run with 1-dim tensor as with scalar I'm getting error - haven't found root cause yet where is that happening but brief description if that matters below logs) Given one example what's being tried to be copied within
Regarding problem with scalar there's such graph being processed:
Where output of
|
I think what's happening is the HPU backend mutates the graph produced by AOTDispatch during lowering. And because there's no guarantee that the graph is still runnable after lowering, it will error out when traced (by Dynamo in this case). For regular backends, we send a copy of the graph for lowering, and preserve the AOTDispatch traced graph: https://github.com/pytorch/pytorch/blob/72a3c8dfa8580f03c992ca06be89b92a6c163b0b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L1218-L122. It is possible that the HPU backend implementation does not call their backend through this code path. The code change may get us past the initial error, but I don't think Compiled Autograd will work unless the HPU backend either preserves the AOTDispatch graph even after lowering, or guarantees that the graph is still executable |
HPU and inductor backends doesn't go through With proposed implementation of copy for |
That is only used by torch.export, I'd expect
I believe the issue is that the AOT backward modified by HPU backend treats its input as a list. And does not expect it to be called directly by # Before HPU backend processing, expects graph input args[0] to be tensor
graph():
%tangents_1 : [num_users=1] = placeholder[target=tangents_1]
...
# After HPU backend processing, expects graph input args[0] to be list
graph():
%input_list : list [num_users=2] = placeholder[target=input_list]
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%input_list, 0), kwargs = {})
... But the inputs to the AOT backward are created by running graph():
...
# create the AOT backward's inputs
%call_aot_bwd_prologue : [num_users=1] = call_function[target=torch._dynamo.compiled_autograd.call_aot_bwd_prologue](args = ((), [], %getitem_3), kwargs = {})
%getitem_5 : [num_users=0] = call_function[target=operator.getitem](args = (%call_aot_bwd_prologue, 0), kwargs = {})
# call the copy pasted AOT backward code
# this line corresponds to %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%input_list, 0), kwargs = {})
# but getitem_5 is NOT a list
%getitem_6 : [num_users=0] = call_function[target=operator.getitem](args = (%getitem_5, 0), kwargs = {}) To dig deeper, we would need to figure out how the HPU backend passes a list (and not just a single tensor) to their AOT backward. AOTAutograd and Compiled Autograd both need the prologue to be able to format the inputs properly. |
Thanks for your comments, indeed it's problem that HPU backend is creating list for inputs, while disabled this optimization scalars are also working as expected - same when copied graph for compilation. I think this is the way we'll handle this problem for now as briefly tested it doesn't seem to have much performance impact - but for future, do you think this problem can be somehow handled by PyTorch side? |
@pytorchbot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
I think this is the better option for compiled autograd. Compiled autograd will invoke the backend again on the backward graph traced at runtime, if backends can mutate the AOT backward that we copy paste, then backends will need to handle already lowered graphs. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…3827) Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7 Problem is being observed with HPU backend in example repro due to creating fused modules. ``` import torch device = 'cpu' #'hpu' backend = 'inductor' #'hpu_backend' def fn(t1): t1 = t1 * 1 t1_grad = torch.ones_like(t1, device=device) t1.backward(t1_grad, retain_graph=True) return t1 t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze() compiled_fn = torch.compile(fn, backend=backend) result = compiled_fn(t1) with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)): result_grad = torch.ones_like(result, device=device) result.backward(result_grad) print(f'{result_grad=}') print(f'{t1.grad=}') ``` With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze): `torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')` While on CPU there's following warning and None returned: `repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.) print(f'{t1.grad=}') t1.grad=None` Pull Request resolved: pytorch#153827 Approved by: https://github.com/xmfan
Support for
call_module
incopy_paste_aot_backward_graph
added recently with PT2.7Problem is being observed with HPU backend in example repro due to creating fused modules.
With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze):
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use
tensor.item()in Python or
tensor.item()in C++ to convert a 0-dim tensor to a number')
While on CPU there's following warning and None returned:
repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.) print(f'{t1.grad=}') t1.grad=None
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @xmfan @jeromean @bsochack @sujoysaraswati