[AOTI] Support module buffer mutation #123164

desertfire · 2024-04-02T14:10:09Z

Stack from ghstack (oldest at bottom):

-> [AOTI] Support module buffer mutation #123164

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @chauhang

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. [ghstack-poisoned]

pytorch-bot · 2024-04-02T14:10:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123164

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 4faec2d with merge base bcb6e5a ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-docs / build-docs-python-false (gh)
Process completed with exit code 2.

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3.8-gcc11 / test (docs_test, 1, 1, linux.2xlarge, unstable) (gh)
Process completed with exit code 2.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. ghstack-source-id: 6e13303baa4cc41941a01c6a5cd5ed07b80e3e57 Pull Request resolved: #123164

desertfire · 2024-04-02T14:16:40Z

Reland #122824 because there was a merge conflict.

digantdesai

Stamping

malfet

LGTM (provided that Mac CI is green)

desertfire · 2024-04-02T14:58:42Z

@pytorchbot merge

pytorchmergebot · 2024-04-02T15:00:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-02T15:00:42Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-docs / build-docs-python-false

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

desertfire · 2024-04-02T15:46:47Z

@pytorchbot merge -i

pytorchmergebot · 2024-04-02T15:48:31Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-docs / build-docs-python-false, pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-02T17:53:38Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch rebase origin/main returned non-zero exit code 1

Rebasing (1/1)
Auto-merging torch/_inductor/compile_fx.py
CONFLICT (content): Merge conflict in torch/_inductor/compile_fx.py
Auto-merging torch/_inductor/lowering.py
Auto-merging torch/_inductor/utils.py
error: could not apply 37dc4037b85... [AOTI] Support module buffer mutation (#123164)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 37dc4037b85... [AOTI] Support module buffer mutation (#123164)

Details for Dev Infra team

Raised by workflow job

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. ghstack-source-id: 2b461a3c817c7c892f13dd5ac939f22eb8973265 Pull Request resolved: #123164

desertfire · 2024-04-02T20:23:32Z

@pytorchbot merge -f "Trunk test passed previously; fixed a minor merge conflict"

pytorchmergebot · 2024-04-02T20:25:15Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jithunnair-amd · 2024-04-03T04:10:57Z

@desertfire This is the second recent AOTI-related PR that has broken ROCm CI due to nosignal; the other one was 122882. Since the only ROCm CI jobs that run by default on AOTI PRs are the ones for ciflow/trunk (pre-merge checks) and ciflow/inductor (auto-label by bot), and neither of those currently run test_aot_inductor.py, there is no CI signal on the PR indicating a ROCm breakage (we can't run the entire list of test suites on PRs due to CI capacity constraints). To remedy this, I propose adding inductor/test_aot_inductor to the inductor workflow here:

pytorch/.ci/pytorch/test.sh

Line 318 in 9288b27

    
           python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose

It seems like it would add about an hour more to the inductor workflow test jobs for both CUDA and ROCm. Would that be acceptable?

Testing this out in #123340

cc @huydhn @malfet @atalman

atalman · 2024-04-03T11:52:14Z

@desertfire Please provide forward fix for this issue this is a failing test on the trunk:

024-04-03T07:37:22.3706672Z =========================== short test summary info ============================
2024-04-03T07:37:22.3707311Z FAILED [6.3134s] inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_buffer_mutation_3_abi_compatible_cuda - AssertionError: False is not true
2024-04-03T07:37:22.3707315Z 
2024-04-03T07:37:22.3707505Z To execute this test, run the following from the base repo dir:
2024-04-03T07:37:22.3707879Z     PYTORCH_TEST_WITH_ROCM=1 python test_aot_inductor.py -k test_buffer_mutation_3_abi_compatible_cuda
2024-04-03T07:37:22.3707883Z

#123164 removed the below code (so that constants are not readonly) to support module buffer mutation: https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691 However, it may cause relocation overflow when the `.data` section is large. Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (previously `.lrodata` is used) so that it won't be in between the `.text` and `.bss` section ``` .text .data .bss .lrodata .ldata ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues. More context here: #123164 (comment) Runtime increase for inductor workflow: CUDA: PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290) This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340) ROCM: PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290) This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340) Pull Request resolved: #123340 Approved by: https://github.com/atalman, https://github.com/desertfire

#123164 removed the below code (so that constants are not readonly) to support module buffer mutation: https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691 However, it may cause relocation overflow when the `.data` section is large. Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section ``` .text .rodata .data .bss .lrodata .ldata ``` We met this issue when fixing #114450 and running the below models on CPU: - AlbertForMaskedLM - AlbertForQuestionAnswering - BlenderbotForCausalLM - DebertaV2ForMaskedLM - DebertaV2ForQuestionAnswering - XGLMForCausalLM Pull Request resolved: #123639 Approved by: https://github.com/jgong5, https://github.com/desertfire

Summary: Fixes pytorch#120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. Pull Request resolved: pytorch#123164 Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov

AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues. More context here: pytorch#123164 (comment) Runtime increase for inductor workflow: CUDA: PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290) This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340) ROCM: PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290) This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340) Pull Request resolved: pytorch#123340 Approved by: https://github.com/atalman, https://github.com/desertfire

) pytorch#123164 removed the below code (so that constants are not readonly) to support module buffer mutation: https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691 However, it may cause relocation overflow when the `.data` section is large. Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section ``` .text .rodata .data .bss .lrodata .ldata ``` We met this issue when fixing pytorch#114450 and running the below models on CPU: - AlbertForMaskedLM - AlbertForQuestionAnswering - BlenderbotForCausalLM - DebertaV2ForMaskedLM - DebertaV2ForQuestionAnswering - XGLMForCausalLM Pull Request resolved: pytorch#123639 Approved by: https://github.com/jgong5, https://github.com/desertfire

AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues. More context here: #123164 (comment) Runtime increase for inductor workflow: CUDA: PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290) This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340) ROCM: PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290) This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340) Pull Request resolved: #123340 Approved by: https://github.com/atalman, https://github.com/desertfire

[AOTI] Support module buffer mutation

65f49d6

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Apr 2, 2024

digantdesai approved these changes Apr 2, 2024

View reviewed changes

malfet approved these changes Apr 2, 2024

View reviewed changes

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2024

desertfire added the topic: not user facing topic category label Apr 2, 2024

pytorchmergebot added the merging label Apr 2, 2024

pytorchmergebot removed the merging label Apr 2, 2024

pytorchmergebot added the merging label Apr 2, 2024

pytorchmergebot removed the merging label Apr 2, 2024

chenyang78 approved these changes Apr 2, 2024

View reviewed changes

khabinov approved these changes Apr 2, 2024

View reviewed changes

pytorchmergebot added the merging label Apr 2, 2024

pytorchmergebot closed this in 0ff6155 Apr 2, 2024

pytorchmergebot added Merged and removed merging labels Apr 2, 2024

atalman mentioned this pull request Apr 3, 2024

DISABLED test_buffer_mutation_3_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda) #123251

Closed

chunyuan-w mentioned this pull request Apr 9, 2024

[AOTI] fix relocation overflow error when .data is large #123639

Closed

jithunnair-amd mentioned this pull request Apr 9, 2024

Add test_aot_inductor to test_inductor #123340

Closed

CaoE mentioned this pull request Apr 22, 2024

[inductor] Get wrong results when supporting module buffer mutation #124583

Closed

github-actions bot deleted the gh/desertfire/358/head branch May 5, 2024 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AOTI] Support module buffer mutation #123164

[AOTI] Support module buffer mutation #123164

desertfire commented Apr 2, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 2, 2024 •

edited

desertfire commented Apr 2, 2024

digantdesai left a comment

malfet left a comment

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

jithunnair-amd commented Apr 3, 2024 •

edited

atalman commented Apr 3, 2024 •

edited

[AOTI] Support module buffer mutation #123164

[AOTI] Support module buffer mutation #123164

Conversation

desertfire commented Apr 2, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 2, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123164

✅ You can merge normally! (2 Unrelated Failures)

desertfire commented Apr 2, 2024

digantdesai left a comment

Choose a reason for hiding this comment

malfet left a comment

Choose a reason for hiding this comment

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

Merge started

pytorchmergebot commented Apr 2, 2024

Merge failed

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

Merge started

pytorchmergebot commented Apr 2, 2024

Merge failed

desertfire commented Apr 2, 2024

pytorchmergebot commented Apr 2, 2024

Merge started

jithunnair-amd commented Apr 3, 2024 • edited

atalman commented Apr 3, 2024 • edited

desertfire commented Apr 2, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 2, 2024 •

edited

jithunnair-amd commented Apr 3, 2024 •

edited

atalman commented Apr 3, 2024 •

edited