Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

mlazos · 2023-05-31T19:49:11Z

On calls to _init_group rather than tracing through it, extract python values from the arguments, and call the initialization. This avoids having to trace this function which is very slow with large parameters, and also avoids graph breaking on it. This is sound in this case because the state is only initialized once in the eager case. Guards on the state and params are generated explicitly rather than via tracing the initialization.

Caveats:
_init_group also gathers various state tensors into lists via mutating list arguments to pass to the functional optimizer implementation. These state tensors exist on the optimizer itself, but we don't know exactly how the gathering is done and which tensors correspond to which attributes of the optimizer module (each optimizer has different states). To rectify this, we keep weak_ptrs to all of the tensors collected in the lists in globals (similar to how parameter keys are stored for dictionaries). These pointers are guaranteed to be alive as long as the optimizer object is alive if the internal state is not interfered with and they are guarded with weakref guards

cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy

pytorch-bot · 2023-05-31T19:49:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102640

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7fbd56d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jansel

Overall this approach seems reasonable to me. Made one small comment about correct.

torch/_dynamo/variables/optimizer.py

mlazos · 2023-06-02T19:53:19Z

@pytorchbot merge

pytorchmergebot · 2023-06-02T19:55:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-02T20:00:24Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

mlazos · 2023-06-02T20:01:42Z

@pytorchbot merge

pytorchmergebot · 2023-06-02T20:04:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-02T20:14:17Z

Merge failed

Reason: 27 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

…iterations

mlazos · 2023-06-03T06:26:07Z

@pytorchbot merge

pytorchmergebot · 2023-06-03T06:29:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-03T07:20:21Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-bionic-py3_8-clang8-xla / test (xla, 1, 1, linux.12xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

mlazos · 2023-06-03T15:47:38Z

@pytorchbot merge

pytorchmergebot · 2023-06-03T15:49:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2023-06-06T17:09:48Z

There was a 10s HF compilation latency increase between 8215468 and 87cbfe9. I think this is the only plausible culprit.

janeyx99 · 2023-06-08T13:47:41Z

the optim benchmarks also started running into bugs after this change, see: https://github.com/pytorch/benchmark/actions/runs/5167132765/jobs/9307817625

can we revert or back out this change, add some tests and verify the bugs no longer exist, and then reland?

ezyang · 2023-06-09T13:35:34Z

@pytorchbot revert -c nosigmal "latency increase and optim bugs"

pytorch-bot · 2023-06-09T13:35:36Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'nosigmal' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

ezyang · 2023-06-09T13:35:43Z

@pytorchbot revert -c signal "latency increase and optim bugs"

pytorch-bot · 2023-06-09T13:35:45Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'signal' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

ezyang · 2023-06-09T13:35:57Z

Reverting PR 102640 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit c46af25bb3d4cd95485682ea5574dd47dab5dd90 returned non-zero exit code 1

Auto-merging torch/_dynamo/eval_frame.py
Auto-merging torch/_dynamo/utils.py
CONFLICT (modify/delete): torch/_dynamo/variables/optimizer.py deleted in parent of c46af25bb3d (Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)) and modified in HEAD.  Version HEAD of torch/_dynamo/variables/optimizer.py left in tree.
error: could not revert c46af25bb3d... Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Details for Dev Infra team

Raised by workflow job

janeyx99 · 2023-06-09T14:05:42Z

Sigh, I see backing out is not trivial because of #103121.

mlazos added 4 commits May 26, 2023 08:35

proto for running the initialization code in dynamo

0da6e84

Add some missing changes

b1dd9c2

Initial impl working

b9a9ca9

Linter

b80db85

github-actions bot added ciflow/inductor module: dynamo labels May 31, 2023

Merge with main

88f2577

mlazos added the release notes: dynamo label May 31, 2023

mlazos marked this pull request as ready for review May 31, 2023 19:54

mlazos requested a review from jansel May 31, 2023 19:54

jansel approved these changes Jun 2, 2023

View reviewed changes

torch/_dynamo/variables/optimizer.py Outdated Show resolved Hide resolved

mlazos added 4 commits June 2, 2023 19:14

Add additional check on dict source

b2f2278

Fix guard edge case with param key ids

135aa2c

Linting

92ff58e

Fix other edge case with const key function

32a660a

mlazos changed the title ~~[WIP] Initialize optimizer in dynamo to avoid graph break and tracing slowness~~ Initialize optimizer in dynamo to avoid graph break and tracing slowness Jun 2, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2023

pytorchmergebot added the merging label Jun 2, 2023

lint

9a229cd

pytorchmergebot removed the merging label Jun 2, 2023

pytorchmergebot added the merging label Jun 2, 2023

pytorchmergebot removed the merging label Jun 2, 2023

Merge branch 'main' into mlazos/opt-trace

4fdc498

Don't add weakref guards for grads since they may be updated between …

9c3b0d4

…iterations

pytorchmergebot added the merging label Jun 3, 2023

pytorchmergebot removed the merging label Jun 3, 2023

Fix bug with source map

7fbd56d

pytorchmergebot added the merging label Jun 3, 2023

pytorchmergebot added the Merged label Jun 3, 2023

pytorchmergebot removed the merging label Jun 3, 2023

pytorchmergebot closed this in c46af25 Jun 3, 2023

janeyx99 mentioned this pull request Jun 9, 2023

Disabling ALL TestOptim on the dynamo config #103322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

mlazos commented May 31, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented May 31, 2023 •

edited

jansel left a comment

mlazos commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

mlazos commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

mlazos commented Jun 3, 2023

pytorchmergebot commented Jun 3, 2023

pytorchmergebot commented Jun 3, 2023

mlazos commented Jun 3, 2023

pytorchmergebot commented Jun 3, 2023

eellison commented Jun 6, 2023

janeyx99 commented Jun 8, 2023

ezyang commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

ezyang commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

ezyang commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorchmergebot commented Jun 9, 2023

pytorchmergebot commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

Initialize optimizer in dynamo to avoid graph break and tracing slowness #102640

Conversation

mlazos commented May 31, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented May 31, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102640

✅ No Failures

jansel left a comment

Choose a reason for hiding this comment

mlazos commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

Merge started

pytorchmergebot commented Jun 2, 2023

Merge failed

mlazos commented Jun 2, 2023

pytorchmergebot commented Jun 2, 2023

Merge started

pytorchmergebot commented Jun 2, 2023

Merge failed

mlazos commented Jun 3, 2023

pytorchmergebot commented Jun 3, 2023

Merge started

pytorchmergebot commented Jun 3, 2023

Merge failed

mlazos commented Jun 3, 2023

pytorchmergebot commented Jun 3, 2023

Merge started

eellison commented Jun 6, 2023

janeyx99 commented Jun 8, 2023

ezyang commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

ezyang commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

ezyang commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorch-bot bot commented Jun 9, 2023

janeyx99 commented Jun 9, 2023

pytorchmergebot commented Jun 9, 2023

pytorchmergebot commented Jun 9, 2023

Reverting PR 102640 failed

janeyx99 commented Jun 9, 2023

mlazos commented May 31, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented May 31, 2023 •

edited