Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

Closed
wants to merge 13 commits into from

Conversation

mlazos
Copy link
Contributor

@mlazos mlazos commented May 31, 2023

On calls to _init_group rather than tracing through it, extract python values from the arguments, and call the initialization. This avoids having to trace this function which is very slow with large parameters, and also avoids graph breaking on it. This is sound in this case because the state is only initialized once in the eager case. Guards on the state and params are generated explicitly rather than via tracing the initialization.

Caveats:
_init_group also gathers various state tensors into lists via mutating list arguments to pass to the functional optimizer implementation. These state tensors exist on the optimizer itself, but we don't know exactly how the gathering is done and which tensors correspond to which attributes of the optimizer module (each optimizer has different states). To rectify this, we keep weak_ptrs to all of the tensors collected in the lists in globals (similar to how parameter keys are stored for dictionaries). These pointers are guaranteed to be alive as long as the optimizer object is alive if the internal state is not interfered with and they are guarded with weakref guards

cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy

@pytorch-bot
Copy link

pytorch-bot bot commented May 31, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102640

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7fbd56d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@mlazos mlazos marked this pull request as ready for review May 31, 2023 19:54
@mlazos mlazos requested a review from jansel May 31, 2023 19:54
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this approach seems reasonable to me. Made one small comment about correct.

torch/_dynamo/variables/optimizer.py Outdated Show resolved Hide resolved
@mlazos mlazos changed the title [WIP] Initialize optimizer in dynamo to avoid graph break and tracing slowness Initialize optimizer in dynamo to avoid graph break and tracing slowness Jun 2, 2023
@mlazos
Copy link
Contributor Author

mlazos commented Jun 2, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@mlazos
Copy link
Contributor Author

mlazos commented Jun 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mlazos
Copy link
Contributor Author

mlazos commented Jun 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@mlazos
Copy link
Contributor Author

mlazos commented Jun 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@eellison
Copy link
Contributor

eellison commented Jun 6, 2023

There was a 10s HF compilation latency increase between 8215468 and 87cbfe9. I think this is the only plausible culprit.

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 8, 2023

the optim benchmarks also started running into bugs after this change, see: https://github.com/pytorch/benchmark/actions/runs/5167132765/jobs/9307817625

can we revert or back out this change, add some tests and verify the bugs no longer exist, and then reland?

@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2023

@pytorchbot revert -c nosigmal "latency increase and optim bugs"

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'nosigmal' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2023

@pytorchbot revert -c signal "latency increase and optim bugs"

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'signal' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2023

see also pytorch/test-infra#4282

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 9, 2023

@pytorchbot revert -c nosignal -m “introduced dynamo optim flakiness and other latency issues”

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: dynamo optim flakiness and other latency issues”

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 9, 2023

@pytorchbot revert -c nosignal "latency increase and optim bugs"

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 9, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 9, 2023

@pytorchbot revert -c nosignal -m "latency increase and optim bugs"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 102640 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit c46af25bb3d4cd95485682ea5574dd47dab5dd90 returned non-zero exit code 1

Auto-merging torch/_dynamo/eval_frame.py
Auto-merging torch/_dynamo/utils.py
CONFLICT (modify/delete): torch/_dynamo/variables/optimizer.py deleted in parent of c46af25bb3d (Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)) and modified in HEAD.  Version HEAD of torch/_dynamo/variables/optimizer.py left in tree.
error: could not revert c46af25bb3d... Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
Details for Dev Infra team Raised by workflow job

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 9, 2023

Sigh, I see backing out is not trivial because of #103121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants