Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AOTI] Support module buffer mutation #123164

Closed
wants to merge 2 commits into from

Conversation

desertfire
Copy link
Contributor

@desertfire desertfire commented Apr 2, 2024

Stack from ghstack (oldest at bottom):

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @chauhang

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123164

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 4faec2d with merge base bcb6e5a (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire added a commit that referenced this pull request Apr 2, 2024
Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

ghstack-source-id: 6e13303baa4cc41941a01c6a5cd5ed07b80e3e57
Pull Request resolved: #123164
@desertfire
Copy link
Contributor Author

Reland #122824 because there was a merge conflict.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (provided that Mac CI is green)

@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2024
@desertfire
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@desertfire
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-docs / build-docs-python-false, pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch rebase origin/main returned non-zero exit code 1

Rebasing (1/1)
Auto-merging torch/_inductor/compile_fx.py
CONFLICT (content): Merge conflict in torch/_inductor/compile_fx.py
Auto-merging torch/_inductor/lowering.py
Auto-merging torch/_inductor/utils.py
error: could not apply 37dc4037b85... [AOTI] Support module buffer mutation (#123164)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 37dc4037b85... [AOTI] Support module buffer mutation (#123164)
Details for Dev Infra team Raised by workflow job

Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames chauhang

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Apr 2, 2024
Summary: Fixes #120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

ghstack-source-id: 2b461a3c817c7c892f13dd5ac939f22eb8973265
Pull Request resolved: #123164
@desertfire
Copy link
Contributor Author

@pytorchbot merge -f "Trunk test passed previously; fixed a minor merge conflict"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented Apr 3, 2024

@desertfire This is the second recent AOTI-related PR that has broken ROCm CI due to nosignal; the other one was 122882. Since the only ROCm CI jobs that run by default on AOTI PRs are the ones for ciflow/trunk (pre-merge checks) and ciflow/inductor (auto-label by bot), and neither of those currently run test_aot_inductor.py, there is no CI signal on the PR indicating a ROCm breakage (we can't run the entire list of test suites on PRs due to CI capacity constraints). To remedy this, I propose adding inductor/test_aot_inductor to the inductor workflow here:

python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose

It seems like it would add about an hour more to the inductor workflow test jobs for both CUDA and ROCm. Would that be acceptable?

Testing this out in #123340

cc @huydhn @malfet @atalman

@atalman
Copy link
Contributor

atalman commented Apr 3, 2024

@desertfire Please provide forward fix for this issue this is a failing test on the trunk:

024-04-03T07:37:22.3706672Z =========================== short test summary info ============================
2024-04-03T07:37:22.3707311Z FAILED [6.3134s] inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_buffer_mutation_3_abi_compatible_cuda - AssertionError: False is not true
2024-04-03T07:37:22.3707315Z 
2024-04-03T07:37:22.3707505Z To execute this test, run the following from the base repo dir:
2024-04-03T07:37:22.3707879Z     PYTORCH_TEST_WITH_ROCM=1 python test_aot_inductor.py -k test_buffer_mutation_3_abi_compatible_cuda
2024-04-03T07:37:22.3707883Z 

chunyuan-w added a commit that referenced this pull request Apr 9, 2024
#123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691

However, it may cause relocation overflow when the `.data` section is large.

Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (previously `.lrodata` is used) so that it won't be in between the `.text` and `.bss` section

```
.text
.data
.bss
.lrodata
.ldata
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Apr 9, 2024
AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues.

More context here: #123164 (comment)

Runtime increase for inductor workflow:
CUDA:
PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290)
This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340)

ROCM:
PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290)
This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340)
Pull Request resolved: #123340
Approved by: https://github.com/atalman, https://github.com/desertfire
pytorchmergebot pushed a commit that referenced this pull request Apr 11, 2024
#123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691

However, it may cause relocation overflow when the `.data` section is large.

Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section

```
.text
.rodata
.data
.bss
.lrodata
.ldata
```

We met this issue when fixing #114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM

Pull Request resolved: #123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Summary: Fixes pytorch#120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

Pull Request resolved: pytorch#123164
Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues.

More context here: pytorch#123164 (comment)

Runtime increase for inductor workflow:
CUDA:
PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290)
This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340)

ROCM:
PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290)
This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340)
Pull Request resolved: pytorch#123340
Approved by: https://github.com/atalman, https://github.com/desertfire
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
)

pytorch#123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691

However, it may cause relocation overflow when the `.data` section is large.

Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section

```
.text
.rodata
.data
.bss
.lrodata
.ldata
```

We met this issue when fixing pytorch#114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM

Pull Request resolved: pytorch#123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
)

pytorch#123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
https://github.com/pytorch/pytorch/blob/a9a9ce6d9cf25f4fb87e1d74c79781dc404f0c59/torch/_inductor/codecache.py#L1685-L1691

However, it may cause relocation overflow when the `.data` section is large.

Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section

```
.text
.rodata
.data
.bss
.lrodata
.ldata
```

We met this issue when fixing pytorch#114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM

Pull Request resolved: pytorch#123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues.

More context here: #123164 (comment)

Runtime increase for inductor workflow:
CUDA:
PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290)
This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340)

ROCM:
PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290)
This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340)
Pull Request resolved: #123340
Approved by: https://github.com/atalman, https://github.com/desertfire
@github-actions github-actions bot deleted the gh/desertfire/358/head branch May 5, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants