[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check #118105

mengluy0125 · 2024-01-23T18:27:34Z

Summary:
We observed the following error when launch e2e AFOC model test

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

f524190245

Differential Revision: D53011463

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

pytorch-bot · 2024-01-23T18:27:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118105

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2f16436 with merge base 5b671ce ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / rocm5.7-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2, unstable) (gh)
Process completed with exit code 1.
trunk / linux-focal-rocm5.7-py3.8 / test (default, 1, 1, linux.rocm.gpu, unstable) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-01-23T18:27:44Z

This pull request was exported from Phabricator. Differential Revision: D53011463

…time numeric check (pytorch#118105) Summary: We observed the following error when launch e2e AFOC model test ``` RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward. ``` f524190245 Test Plan: 'training_platform:c640e3f93574472da8894d9a0365f6a0' f524376722 P1086047304 Differential Revision: D53011463

facebook-github-bot · 2024-01-23T19:39:26Z

This pull request was exported from Phabricator. Differential Revision: D53011463

…time numeric check (pytorch#118105) Summary: We observed the following error when launch e2e AFOC model test ``` RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward. ``` f524190245 Test Plan: 'training_platform:c640e3f93574472da8894d9a0365f6a0' f524376722 P1086047304 Reviewed By: jackiexu1992 Differential Revision: D53011463

facebook-github-bot · 2024-01-24T01:28:53Z

This pull request was exported from Phabricator. Differential Revision: D53011463

facebook-github-bot · 2024-01-24T18:43:05Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-01-24T18:44:57Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions bot added the module: inductor label Jan 23, 2024

facebook-github-bot added the fb-exported label Jan 23, 2024

github-actions bot added the ciflow/inductor label Jan 23, 2024

mengluy0125 force-pushed the export-D53011463 branch from 37c94a3 to aa35ad1 Compare January 23, 2024 19:39

jackiexu1992 approved these changes Jan 23, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 23, 2024

mengluy0125 force-pushed the export-D53011463 branch from aa35ad1 to 2f16436 Compare January 24, 2024 01:28

pytorchmergebot added the merging label Jan 24, 2024

pytorchmergebot closed this in fc13545 Jan 24, 2024

pytorchmergebot added Merged and removed merging labels Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check #118105

[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check #118105

Uh oh!

mengluy0125 commented Jan 23, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 23, 2024

Uh oh!

facebook-github-bot commented Jan 23, 2024

Uh oh!

facebook-github-bot commented Jan 24, 2024

Uh oh!

facebook-github-bot commented Jan 24, 2024

Uh oh!

pytorchmergebot commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check #118105

[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check #118105

Uh oh!

Conversation

mengluy0125 commented Jan 23, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118105

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

facebook-github-bot commented Jan 23, 2024

Uh oh!

facebook-github-bot commented Jan 23, 2024

Uh oh!

facebook-github-bot commented Jan 24, 2024

Uh oh!

facebook-github-bot commented Jan 24, 2024

Uh oh!

pytorchmergebot commented Jan 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mengluy0125 commented Jan 23, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading