Skip to content

Conversation

shunting314
Copy link
Contributor

@shunting314 shunting314 commented Sep 3, 2025

Stack from ghstack (oldest at bottom):

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:

>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Copy link

pytorch-bot bot commented Sep 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162101

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 0e96e63 with merge base a6f9e0e (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@shunting314 shunting314 changed the title [inductor] avoid creating LoopBody twice in merge_loops [inductor] avoid creating LoopBody twice Sep 4, 2025
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@shunting314 shunting314 added the topic: not user facing topic category label Sep 6, 2025
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #162355

pytorchmergebot pushed a commit that referenced this pull request Sep 19, 2025
I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering.  Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

<img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" />

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

Pull Request resolved: #162126
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162101
pytorchmergebot pushed a commit that referenced this pull request Sep 19, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: #162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162101, #162126
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

Pull Request resolved: pytorch#162101
Approved by: https://github.com/jansel, https://github.com/eellison
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering.  Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

<img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" />

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

Pull Request resolved: pytorch#162126
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162101
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

Pull Request resolved: pytorch#162101
Approved by: https://github.com/jansel, https://github.com/eellison
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering.  Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

<img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" />

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

Pull Request resolved: pytorch#162126
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162101
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody.

In looks like it's ok to have duplicate symbols in sympy replacement:
```
>>> x, y = sympy.symbols("x y")
>>> (x + y).xreplace({x: 0, y: x + 1})
x + 1
>>> (x + y).xreplace({x: y * y, y: x + 1})
x + y**2 + 1
>>> (x + y + x * x).xreplace({x: 0, y: x})
x
```

UPDATE: add the same optimization for LoopBody.reorder_iter_loops

Pull Request resolved: pytorch#162101
Approved by: https://github.com/jansel, https://github.com/eellison
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering.  Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set.

<img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" />

(scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 )

Pull Request resolved: pytorch#162126
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162101
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions:

https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641

Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes.

Pull Request resolved: pytorch#162355
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162101, pytorch#162126
@github-actions github-actions bot deleted the gh/shunting314/216/head branch October 20, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants