[inductor][cpp] epilogue support for gemm template #126019

jgong5 · 2024-05-12T08:51:59Z

Stack from ghstack (oldest at bottom):

-> [inductor][cpp] epilogue support for gemm template #126019

As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new codegen_loop_bodies and codegen_functions methods are added to c++ vector codegen for this purpose. This is leveraged by the store_output method of the template kernel for epilogue codegen and store to the final result.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-05-12T08:52:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126019

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit b30c694 with merge base 7a506dd ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (#127438)
sebotnet33ts_256
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5c5aa78127c399dc804cc7f768fe038cbf05a7e4 Pull Request resolved: #126019

[ghstack-poisoned]

ghstack-source-id: 58c56a7ef3271a127573415e5391b8f1ac5d1875 Pull Request resolved: #126019

[ghstack-poisoned]

As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

jgong5 · 2024-05-15T21:31:25Z

@pytorchbot merge

pytorchmergebot · 2024-05-15T21:33:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ue fusion (#126068) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: #126068 Approved by: https://github.com/jansel ghstack dependencies: #126019

…ue fusion (#126068) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: #126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019

) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: #126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068

This reverts commit 56c412d. Reverted #126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](#124021 (comment)))

pytorchmergebot · 2024-05-27T09:01:53Z

@jgong5 your PR has been successfully reverted.

[ghstack-poisoned]

Chillee · 2024-05-28T07:37:44Z

torch/_inductor/codegen/cpp_template_kernel.py

+            cpp_argdefs, _, _ = self.args.cpp_argdefs()
+            return f"void {self.kernel_name}({', '.join(cpp_argdefs)})"
+
+        placeholder = "<DEFINE_KERNEL>"


btw, rename this to <DEF_KERNEL>, or it's going to merge conflict with #127144

Thanks for the reminder. Fixed.

As part of pytorch#125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: pytorch#126019 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124021

…ue fusion (pytorch#126068) As part of pytorch#125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: pytorch#126068 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124021, pytorch#126019

…rch#126545) As part of pytorch#125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: pytorch#126545 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124021, pytorch#126019, pytorch#126068

…26019)" This reverts commit 56c412d. Reverted pytorch#126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](pytorch#124021 (comment)))

[ghstack-poisoned]

jgong5 · 2024-05-29T09:05:42Z

@pytorchbot rebase

pytorchmergebot · 2024-05-29T09:07:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-29T09:07:29Z

Tried to rebase and push PR #126019, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

ghstack-source-id: 26e170c08eb2d226dcefcc77be2669ebff9eb9ee Pull Request resolved: #126019

jgong5 · 2024-05-29T09:08:34Z

@pytorchbot merge

pytorchmergebot · 2024-05-29T09:11:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ue fusion (#126068) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: #126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019

…26019)" This reverts commit 56c412d. Reverted pytorch#126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](pytorch#124021 (comment)))

As part of pytorch#125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: pytorch#126019 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124021

…ue fusion (pytorch#126068) As part of pytorch#125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: pytorch#126068 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124021, pytorch#126019

Update

ff8e6b5

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels May 12, 2024

jgong5 added a commit that referenced this pull request May 12, 2024

[inductor][cpp] epilogue support for gemm template

efff599

ghstack-source-id: 5c5aa78127c399dc804cc7f768fe038cbf05a7e4 Pull Request resolved: #126019

jgong5 marked this pull request as draft May 12, 2024 08:52

pytorchbot added the open source label May 12, 2024

Update

ea0a37c

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request May 13, 2024

[inductor][cpp] epilogue support for gemm template

f603200

ghstack-source-id: 58c56a7ef3271a127573415e5391b8f1ac5d1875 Pull Request resolved: #126019

jgong5 marked this pull request as ready for review May 13, 2024 01:46

This was referenced May 13, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion #126068

Closed

Update

e9d6975

[ghstack-poisoned]

jgong5 added the release notes: inductor label May 14, 2024

jgong5 added 4 commits May 14, 2024 02:02

Update

a5bb8bc

[ghstack-poisoned]

jgong5 requested review from jansel and lezcano May 14, 2024 12:16

jansel approved these changes May 15, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2024

pytorchmergebot added the merging label May 15, 2024

pytorchmergebot added the Merged label May 16, 2024

pytorchmergebot closed this in 7844c20 May 16, 2024

pytorchmergebot removed the merging label May 16, 2024

pytorchmergebot reopened this May 27, 2024

Update

8e43b1c

[ghstack-poisoned]

jgong5 mentioned this pull request May 28, 2024

[inductor][cpp] BF16 AMX micro-gemm support #127195

Closed

Chillee reviewed May 28, 2024

View reviewed changes

jgong5 added 2 commits May 28, 2024 18:04

Update

ddc7412

[ghstack-poisoned]

Update

b30c694

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request May 29, 2024

[inductor][cpp] epilogue support for gemm template

e770634

ghstack-source-id: 26e170c08eb2d226dcefcc77be2669ebff9eb9ee Pull Request resolved: #126019

pytorchmergebot added the merging label May 29, 2024

pytorchmergebot closed this in 92bc444 May 29, 2024

pytorchmergebot removed the merging label May 29, 2024

kit1980 removed the Reverted label May 29, 2024

github-actions bot deleted the gh/jgong5/45/head branch June 29, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor][cpp] epilogue support for gemm template #126019

[inductor][cpp] epilogue support for gemm template #126019

jgong5 commented May 12, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented May 12, 2024 •

edited

Loading

jgong5 commented May 15, 2024

pytorchmergebot commented May 15, 2024

pytorchmergebot commented May 27, 2024

Chillee May 28, 2024

jgong5 May 29, 2024

jgong5 commented May 29, 2024

pytorchmergebot commented May 29, 2024

pytorchmergebot commented May 29, 2024

jgong5 commented May 29, 2024

pytorchmergebot commented May 29, 2024

[inductor][cpp] epilogue support for gemm template #126019

[inductor][cpp] epilogue support for gemm template #126019

Conversation

jgong5 commented May 12, 2024 • edited by pytorchmergebot Loading

pytorch-bot bot commented May 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126019

✅ You can merge normally! (4 Unrelated Failures)

jgong5 commented May 15, 2024

pytorchmergebot commented May 15, 2024

Merge started

pytorchmergebot commented May 27, 2024

Chillee May 28, 2024

Choose a reason for hiding this comment

jgong5 May 29, 2024

Choose a reason for hiding this comment

jgong5 commented May 29, 2024

pytorchmergebot commented May 29, 2024

pytorchmergebot commented May 29, 2024

jgong5 commented May 29, 2024

pytorchmergebot commented May 29, 2024

Merge started

jgong5 commented May 12, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented May 12, 2024 •

edited

Loading