[inductor] avoid creating LoopBody twice #162101

shunting314 · 2025-09-03T21:34:59Z

Stack from ghstack (oldest at bottom):

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:

>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-09-03T21:35:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162101

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 0e96e63 with merge base a6f9e0e ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh) (similar failure)
'test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager'
inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager'

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-test / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
Process completed with exit code 134.
inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
'Test'
pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, lf.linux.4xlarge) (gh) (trunk failure)
test_quantization.py::TestQuantizeFxOps::test_general_shape_ops

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

pytorchmergebot · 2025-09-19T17:40:36Z

Starting merge as part of PR stack under #162355

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: #162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: #162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162101, #162126

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops Pull Request resolved: pytorch#162101 Approved by: https://github.com/jansel, https://github.com/eellison

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops Pull Request resolved: pytorch#162101 Approved by: https://github.com/jansel, https://github.com/eellison

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops Pull Request resolved: pytorch#162101 Approved by: https://github.com/jansel, https://github.com/eellison

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

[inductor] avoid creating LoopBody twice in merge_loops

30fb89e

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 3, 2025

This was referenced Sep 3, 2025

[inductor] turn on loaf (for oss) by default #162030

Closed

LOAF not for land hack #162102

Open

Update on "[inductor] avoid creating LoopBody twice in merge_loops"

0699cb1

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

shunting314 mentioned this pull request Sep 3, 2025

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

Closed

shunting314 requested review from eellison and jansel September 3, 2025 21:40

jansel approved these changes Sep 3, 2025

View reviewed changes

eellison approved these changes Sep 3, 2025

View reviewed changes

shunting314 mentioned this pull request Sep 4, 2025

[Inductor] don't call sympy_str when not needed #162126

Closed

shunting314 changed the title ~~[inductor] avoid creating LoopBody twice in merge_loops~~ [inductor] avoid creating LoopBody twice Sep 4, 2025

shunting314 mentioned this pull request Sep 4, 2025

[inductor] fix TemplateBuffer.extract_read_writes #162221

Closed

This was referenced Sep 5, 2025

[inductor] rename deps during refreshing #162303

Closed

[inductor] fuse for scalar shared data #162311

Closed

shunting314 added the topic: not user facing topic category label Sep 6, 2025

This was referenced Sep 6, 2025

[inductor] fix 3d tiled online softmax #162341

Closed

[Inductor] do loop reordering in a separate final round #162355

Closed

shunting314 added 3 commits September 7, 2025 23:24

pytorchmergebot closed this in 466122b Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

github-actions bot deleted the gh/shunting314/216/head branch October 20, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] avoid creating LoopBody twice #162101

[inductor] avoid creating LoopBody twice #162101

Uh oh!

shunting314 commented Sep 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[inductor] avoid creating LoopBody twice #162101

[inductor] avoid creating LoopBody twice #162101

Uh oh!

Conversation

shunting314 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162101

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shunting314 commented Sep 3, 2025 •

edited

Loading

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading