-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5 #85
Comments
18 tasks
Repro for the bug
|
anijain2305
moved this from To do
to In progress
in PyTorch 1.13 Compiler Experimental SOTA (May Deadline)
Apr 15, 2022
@Chillee has been looking at this and isolated the problem with even smaller repro than the above. Assigning this to him. |
Resolved in pytorch/pytorch#75933 |
anijain2305
moved this from In progress
to Done
in PyTorch 1.13 Compiler Experimental SOTA (May Deadline)
May 9, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The three models - speech_transformer, hf_GPT2 and hf_T5 fail with similar type of error signature.
TorchDynamo finds static subgraphs and sends them to AOT Autograd. AOT Autograd generates the forward and backward graphs. The output of AOT Autograd is a autograd.Function (code). AOT Autograd saves some tensors for the backward pass gradient computation in the forward pass.
The issue arises in the backward pass. When we read the saved_tensors, one of the item in the
saved_tensors
is not of Tensor type anymore. This causes cryptic error messages like the one below. And this type changes from run to run. I have seen immutable_dict, tuple and even weakref and builtin.I further looked into C++ and starting printing the type of objects while saving the tensors at the end of forward pass, and reading them back in backward pass. I observed the weird behavior in this line -(https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L834). This is called in the backward pass, when we call ctx.saved_tensors.
When I print the unpacked_var, it is a tensor. It has its dim, I can print its shape and everything.
But Py_TYPE(value)→tp_name equals immutable_dict here.
The unpack_fn is basically THPVariable_Wrap - (https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L849).
For completeness, adding images for the failure
Repro -
python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_GPT2
Repro -
python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=speech_transformer
Repro -
python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_T5
The text was updated successfully, but these errors were encountered: