Optimize reduction + amax fusion #111122

ipiszy · 2023-10-12T06:24:28Z

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels.

Benchmark:

python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark

Before this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). 
Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms.

After this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). 
Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms.

LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16.

From Inductor nightly benchmark test:
There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.

Stack from ghstack (oldest at bottom):

-> Optimize reduction + amax fusion #111122

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

[ghstack-poisoned]

pytorch-bot · 2023-10-12T06:24:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111122

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 2da5e75 with merge base 547a116 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ade61469e2eebcb6494ba6bae88e474bc94f87cf Pull Request resolved: #111122

ipiszy · 2023-10-12T06:31:43Z

torch/_inductor/dependencies.py

+        # Input node has already been realized. Return its size and reduction_size.
+        return input_node.get_size(), input_node.get_reduction_size()
+
+    # This is one issue: what if there are permutations between the input node and its dependent realized nodes?


@jansel Wonder do you have any suggestions for this?

In addition to permutations there are views which change the ndimension.

Is it ok if this function is approximate? Or are there correctness issues if it is wrong?

Using reduction_sizes from dependent nodes have a better chance to fuse these nodes.

e.g. The current case is:

x1 = layer_norm(x0) x2 = amax(x1) x3 = to_fp8(x1)

Inductor generates these nodes:

n0=WelfordReduction() n1=WelfordReduction() n2=WelfordReduction() n3=Pointwise() n4=Reduction() n5=Pointwise()

Currently n0, n1, n2, n3, n5 are fused together. n3, n4 are fused together.
I'd like to make first level reduction ranges of n4 the same as n0 / n1 / n2, so that n0, n1, n2, n3, first level n4, n5 can be fused together.

So it seem to me that we cannot use approximate values here for n4 reduction sizes.

vadimkantorov · 2023-10-12T10:16:51Z

test/inductor/test_fp8.py

+        batch_size, sequence_length, hidden_size = shape
+
+        def amax_fp8(x: Tensor, scale: Tensor):
+            y = torch.max(torch.abs(x))


should this use torch.amax instead of older torch.max? If max is not intentional, I think using amax to mean "return the values without indices" is clearer

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: b7c11cfb4c03156c3eb9e0f1198e34321f080910 Pull Request resolved: #111122

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: bfc21e5da21b4fce616d80c9ce195a4673d27e79 Pull Request resolved: #111122

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: f9abe7a11dafa5dd3095cd2d46029c029271424d Pull Request resolved: #111122

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 8ea42b1fe47bc07024af40b92d6062b31b2b3834 Pull Request resolved: #111122

jansel · 2023-10-17T21:27:44Z

torch/_inductor/dependencies.py

+
+    from .ir import ComputedBuffer, Loops
+
+    if not isinstance(input_node.data.data, Loops):


I think we need some checks to ensure .data and .data.data exist. There are some cases like views that result in different nesting.

Yeah sure. I added some checks in the callsite, let me also add checks here for safety.

jansel · 2023-10-17T21:28:53Z

torch/_inductor/dependencies.py

+        if hasattr(input_node, "get_size") and hasattr(
+            input_node, "get_reduction_size"
+        ):


Adding a method would be cleaner than these hasattr checks.

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. From Inductor nightly benchmark test: There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations. ![Screenshot 2023-10-18 at 4 58 55 PM](https://github.com/pytorch/pytorch/assets/10527447/6640474a-1e1d-4d33-97e9-0a60d0bc9f1f) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 8f2ec448b4f1ff768402b93b15e87e7140fe9d22 Pull Request resolved: #111122

ipiszy

Thanks @jansel !

ipiszy · 2023-10-19T07:04:23Z

torch/_inductor/dependencies.py

+
+    from .ir import ComputedBuffer, Loops
+
+    if not isinstance(input_node.data.data, Loops):


Yeah sure. I added some checks in the callsite, let me also add checks here for safety.

ipiszy · 2023-10-19T18:20:11Z

@pytorchbot merge

pytorchmergebot · 2023-10-19T18:22:32Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ipiszy · 2023-10-19T20:13:55Z

@pytorchbot label "topic: not user facing"

ipiszy · 2023-10-19T20:14:16Z

@pytorchbot merge

pytorchmergebot · 2023-10-19T20:16:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) [ghstack-poisoned]

Summary: In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D50544876 Pulled By: ipiszy

mlazos · 2023-10-24T04:42:08Z

@ipiszy This PR caused a significant regression in TIMM dm_nfnet_f0

repro command:

python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor

Can you take a look?

cc @eellison

In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: #111781 Approved by: https://github.com/malfet, https://github.com/jansel

In pytorch#111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: pytorch#111781 Approved by: https://github.com/malfet, https://github.com/jansel

In #111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results: ![Screenshot 2023-10-30 at 2 30 10 PM](https://github.com/pytorch/pytorch/assets/10527447/c7b241c0-92a4-49ff-96fb-2805c8fcc45a) <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: #112297 Approved by: https://github.com/jansel

ipiszy · 2023-11-01T18:16:19Z

@ipiszy This PR caused a significant regression in TIMM dm_nfnet_f0

repro command:

python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor

Can you take a look?

cc @eellison

FYI this is fixed by #112297.

This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. From Inductor nightly benchmark test: There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations. ![Screenshot 2023-10-18 at 4 58 55 PM](https://github.com/pytorch/pytorch/assets/10527447/6640474a-1e1d-4d33-97e9-0a60d0bc9f1f) Pull Request resolved: pytorch#111122 Approved by: https://github.com/jansel

In pytorch#111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: pytorch#111781 Approved by: https://github.com/malfet, https://github.com/jansel

…#112297) In pytorch#111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results: ![Screenshot 2023-10-30 at 2 30 10 PM](https://github.com/pytorch/pytorch/assets/10527447/c7b241c0-92a4-49ff-96fb-2805c8fcc45a) <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: pytorch#112297 Approved by: https://github.com/jansel

In pytorch#111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: pytorch#111781 Approved by: https://github.com/malfet, https://github.com/jansel

…#112297) In pytorch#111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results: ![Screenshot 2023-10-30 at 2 30 10 PM](https://github.com/pytorch/pytorch/assets/10527447/c7b241c0-92a4-49ff-96fb-2805c8fcc45a) <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: pytorch#112297 Approved by: https://github.com/jansel

Optimize reduction + amax fusion

bdb09dd

[ghstack-poisoned]

ipiszy added a commit that referenced this pull request Oct 12, 2023

Optimize reduction + amax fusion

e6a8342

ghstack-source-id: ade61469e2eebcb6494ba6bae88e474bc94f87cf Pull Request resolved: #111122

github-actions bot added module: inductor ciflow/inductor labels Oct 12, 2023

ipiszy commented Oct 12, 2023

View reviewed changes

ipiszy requested review from jansel and drisspg October 12, 2023 06:32

vadimkantorov reviewed Oct 12, 2023

View reviewed changes

ipiszy added a commit that referenced this pull request Oct 13, 2023

Optimize reduction + amax fusion

273d98b

ghstack-source-id: b7c11cfb4c03156c3eb9e0f1198e34321f080910 Pull Request resolved: #111122

ipiszy added a commit that referenced this pull request Oct 15, 2023

Optimize reduction + amax fusion

2963f25

ghstack-source-id: bfc21e5da21b4fce616d80c9ce195a4673d27e79 Pull Request resolved: #111122

ipiszy added a commit that referenced this pull request Oct 16, 2023

Optimize reduction + amax fusion

63c8990

ghstack-source-id: f9abe7a11dafa5dd3095cd2d46029c029271424d Pull Request resolved: #111122

ipiszy added a commit that referenced this pull request Oct 17, 2023

Optimize reduction + amax fusion

696b6b9

ghstack-source-id: 8ea42b1fe47bc07024af40b92d6062b31b2b3834 Pull Request resolved: #111122

jansel approved these changes Oct 17, 2023

View reviewed changes

ipiszy added a commit that referenced this pull request Oct 19, 2023

Optimize reduction + amax fusion

a757519

ghstack-source-id: 8f2ec448b4f1ff768402b93b15e87e7140fe9d22 Pull Request resolved: #111122

ipiszy commented Oct 19, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 19, 2023

pytorchmergebot added the merging label Oct 19, 2023

pytorchmergebot removed the merging label Oct 19, 2023

pytorch-bot bot added the topic: not user facing topic category label Oct 19, 2023

pytorchmergebot added the merging label Oct 19, 2023

pytorchmergebot added Merged and removed merging labels Oct 19, 2023

pytorchmergebot closed this in dc31dbb Oct 19, 2023

ipiszy mentioned this pull request Oct 23, 2023

Fix reduction + () + multi-level reduction optimization #111781

Closed

facebook-github-bot deleted the gh/ipiszy@gmail.com/11/head branch October 23, 2023 14:24

ipiszy mentioned this pull request Oct 23, 2023

Fix reduction + () + multi-level reduction optimization (#111781) #111839

Closed

ipiszy mentioned this pull request Oct 27, 2023

Fix regression from pointwise + multi-level reduction fusion #112297

Closed

lw mentioned this pull request Apr 24, 2024

[Inductor] Support fusion of chained reductions even if keepdims=True #124843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize reduction + amax fusion #111122

Optimize reduction + amax fusion #111122

ipiszy commented Oct 12, 2023 •

edited

pytorch-bot bot commented Oct 12, 2023 •

edited

ipiszy Oct 12, 2023

jansel Oct 12, 2023

ipiszy Oct 13, 2023

vadimkantorov Oct 12, 2023 •

edited

jansel Oct 17, 2023

ipiszy Oct 19, 2023

jansel Oct 17, 2023

ipiszy left a comment

ipiszy Oct 19, 2023

ipiszy commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

ipiszy commented Oct 19, 2023

ipiszy commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

mlazos commented Oct 24, 2023

ipiszy commented Nov 1, 2023


		from .ir import ComputedBuffer, Loops

		if not isinstance(input_node.data.data, Loops):

Optimize reduction + amax fusion #111122

Optimize reduction + amax fusion #111122

Conversation

ipiszy commented Oct 12, 2023 • edited

pytorch-bot bot commented Oct 12, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111122

✅ You can merge normally! (3 Unrelated Failures)

ipiszy Oct 12, 2023

Choose a reason for hiding this comment

jansel Oct 12, 2023

Choose a reason for hiding this comment

ipiszy Oct 13, 2023

Choose a reason for hiding this comment

vadimkantorov Oct 12, 2023 • edited

Choose a reason for hiding this comment

jansel Oct 17, 2023

Choose a reason for hiding this comment

ipiszy Oct 19, 2023

Choose a reason for hiding this comment

jansel Oct 17, 2023

Choose a reason for hiding this comment

ipiszy left a comment

Choose a reason for hiding this comment

ipiszy Oct 19, 2023

Choose a reason for hiding this comment

ipiszy commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

Merge failed

ipiszy commented Oct 19, 2023

ipiszy commented Oct 19, 2023

pytorchmergebot commented Oct 19, 2023

Merge started

mlazos commented Oct 24, 2023

ipiszy commented Nov 1, 2023

ipiszy commented Oct 12, 2023 •

edited

pytorch-bot bot commented Oct 12, 2023 •

edited

vadimkantorov Oct 12, 2023 •

edited