New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102640
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7fbd56d: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this approach seems reasonable to me. Made one small comment about correct.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
the optim benchmarks also started running into bugs after this change, see: https://github.com/pytorch/benchmark/actions/runs/5167132765/jobs/9307817625 can we revert or back out this change, add some tests and verify the bugs no longer exist, and then reland? |
@pytorchbot revert -c nosigmal "latency increase and optim bugs" |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -c signal "latency increase and optim bugs" |
❌ 🤖 pytorchbot command failed:
Try |
see also pytorch/test-infra#4282 |
@pytorchbot revert -c nosignal -m “introduced dynamo optim flakiness and other latency issues” |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -c nosignal "latency increase and optim bugs" |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -c nosignal -m "latency increase and optim bugs" |
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 102640 failedReason: Command
Details for Dev Infra teamRaised by workflow job |
Sigh, I see backing out is not trivial because of #103121. |
On calls to
_init_group
rather than tracing through it, extract python values from the arguments, and call the initialization. This avoids having to trace this function which is very slow with large parameters, and also avoids graph breaking on it. This is sound in this case because the state is only initialized once in the eager case. Guards on the state and params are generated explicitly rather than via tracing the initialization.Caveats:
_init_group
also gathers various state tensors into lists via mutating list arguments to pass to the functional optimizer implementation. These state tensors exist on the optimizer itself, but we don't know exactly how the gathering is done and which tensors correspond to which attributes of the optimizer module (each optimizer has different states). To rectify this, we keep weak_ptrs to all of the tensors collected in the lists in globals (similar to how parameter keys are stored for dictionaries). These pointers are guaranteed to be alive as long as the optimizer object is alive if the internal state is not interfered with and they are guarded with weakref guardscc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy