Precompile triton templates #121998

eellison · 2024-03-15T21:46:18Z

Stack from ghstack (oldest at bottom):

Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-03-15T21:46:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121998

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit f4cbd2d with merge base 5891c5b ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper_abi_compatible, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_mm_plus_mm2_cuda_cuda_wrapper
pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
Process completed with exit code 128.
trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
The action 'Test' has timed out after 210 minutes.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Before this PR we were precompiling triton templates in parallel. We would compile as part of benchmarking so compilation was not parallelized. Triton benchmarking templates were emitted as : ``` triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` [ghstack-poisoned]

Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` [ghstack-poisoned]

shunting314 · 2024-03-19T19:16:17Z

Trying to understand more, why the full kernel specification is required if we want to precompile in parallel?

shunting314 · 2024-03-19T19:17:29Z

torch/_inductor/select_algorithm.py

            kernel_name=kernel_name,
            output_node=fake_out,
-            use_jit=True,
+            use_jit=False,


After this, is there any remaining cases that set use_jit to True? If not, we can probably remove this argument

eellison · 2024-03-19T19:26:45Z

@shunting314 in order to compile with JIT you need actually invoke it. There is a "warmup_only" arg in triton we could use, however we'd still need to provide real tensor inputs.

In the case that you have arbitrarily many mms you want to compile in parallel, then you are potentially allocating the arguments for all of them at once which is not feasible for memory.

We already generate the full argument specification when we codegen anyway, so it's better to reuse this then the alternative.

shunting314 · 2024-03-19T21:02:07Z

I see. You actually want the precompile ability of triton_heuristics.template

huydhn · 2024-03-21T23:04:06Z

@pytorchbot revert -m 'Sorry for reverting your change but it is causing all ROCm trunk job to fail https://hud.pytorch.org/pytorch/pytorch/commit/b8df2f0ca530ebe01fa079c891c170a1f4b22823' -c nosignal

pytorchmergebot · 2024-03-21T23:05:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit b8df2f0. Reverted #121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail https://hud.pytorch.org/pytorch/pytorch/commit/b8df2f0ca530ebe01fa079c891c170a1f4b22823 ([comment](#121998 (comment)))

pytorchmergebot · 2024-03-21T23:06:04Z

@eellison your PR has been successfully reverted.

Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison · 2024-03-25T20:39:31Z

@pytorchbot merge

pytorchmergebot · 2024-03-25T20:41:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: We are reverting pytorch#121998 because the change plus `search-autotune-cache` led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job. Test Plan: CI tests Reviewed By: nmacchioni Differential Revision: D55712203

Summary: We are reverting #121998 because the change plus search-autotune-cache led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job. Test Plan: CI tests Reviewed By: nmacchioni Differential Revision: D55712203 Pull Request resolved: #123305 Approved by: https://github.com/eellison, https://github.com/nmacchioni, https://github.com/xw285cornell

Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: #121998 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275, #121997

This reverts commit b8df2f0. Reverted #121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail https://hud.pytorch.org/pytorch/pytorch/commit/b8df2f0ca530ebe01fa079c891c170a1f4b22823 ([comment](#121998 (comment)))

Summary: We are reverting pytorch#121998 because the change plus search-autotune-cache led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job. Test Plan: CI tests Reviewed By: nmacchioni Differential Revision: D55712203 Pull Request resolved: pytorch#123305 Approved by: https://github.com/eellison, https://github.com/nmacchioni, https://github.com/xw285cornell

Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: #121998 Approved by: https://github.com/jansel

WIP precompile triton templates

9add41f

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 15, 2024

This was referenced Mar 15, 2024

[Easy] add option to print compilation time #121996

Closed

Defer selection of triton template #120275

Closed

Precompile in background #121997

Closed

Generate non-written to benchmarking inputs from existing memory instead of new allocation #121999

Closed

eellison added 8 commits March 15, 2024 15:34

Update on "WIP precompile triton templates"

bf128a0

[ghstack-poisoned]

Update on "WIP precompile triton templates"

a3b8446

[ghstack-poisoned]

Update on "WIP precompile triton templates"

76191c4

[ghstack-poisoned]

Update on "WIP precompile triton templates"

646d525

[ghstack-poisoned]

Update on "WIP precompile triton templates"

69ec26f

[ghstack-poisoned]

Update on "WIP precompile triton templates"

f3f7703

[ghstack-poisoned]

Update on "WIP precompile triton templates"

5b80d11

[ghstack-poisoned]

Update on "WIP precompile triton templates"

4c2ab4f

[ghstack-poisoned]

eellison changed the title ~~WIP precompile triton templates~~ Precompile triton templates Mar 18, 2024

eellison added 2 commits March 18, 2024 18:16

eellison requested review from aakhundov, jansel and shunting314 March 19, 2024 01:42

eellison mentioned this pull request Mar 19, 2024

test enable #120278

Closed

jansel approved these changes Mar 19, 2024

View reviewed changes

shunting314 reviewed Mar 19, 2024

View reviewed changes

eellison added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 19, 2024

pytorchmergebot closed this in b8df2f0 Mar 21, 2024

pytorchmergebot added Merged and removed merging labels Mar 21, 2024

huydhn added the ciflow/rocm Trigger "default" config CI on ROCm label Mar 21, 2024

pytorchmergebot added the Reverted label Mar 21, 2024

pytorchmergebot reopened this Mar 21, 2024

eellison added 3 commits March 21, 2024 19:10

This was referenced Mar 25, 2024

Dont precompile already seen keys, limit epilogue choices #122642

Closed

Take minimum size reused buffer #122643

Closed

pytorchmergebot added the merging label Mar 25, 2024

pytorchmergebot closed this in ebde6c7 Mar 25, 2024

pytorchmergebot removed the merging label Mar 25, 2024

yoyoyocmu mentioned this pull request Apr 3, 2024

Back out "Precompile triton templates (#121998)" #123305

Closed

github-actions bot deleted the gh/eellison/614/head branch April 25, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Precompile triton templates #121998

Precompile triton templates #121998

Uh oh!

eellison commented Mar 15, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 15, 2024 •

edited

Loading

Uh oh!

shunting314 commented Mar 19, 2024

Uh oh!

shunting314 Mar 19, 2024

Uh oh!

eellison commented Mar 19, 2024

Uh oh!

shunting314 commented Mar 19, 2024

Uh oh!

huydhn commented Mar 21, 2024

Uh oh!

pytorchmergebot commented Mar 21, 2024

Uh oh!

pytorchmergebot commented Mar 21, 2024

Uh oh!

eellison commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Precompile triton templates #121998

Precompile triton templates #121998

Uh oh!

Conversation

eellison commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121998

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

shunting314 commented Mar 19, 2024

Uh oh!

shunting314 Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

eellison commented Mar 19, 2024

Uh oh!

shunting314 commented Mar 19, 2024

Uh oh!

huydhn commented Mar 21, 2024

Uh oh!

pytorchmergebot commented Mar 21, 2024

Uh oh!

pytorchmergebot commented Mar 21, 2024

Uh oh!

eellison commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eellison commented Mar 15, 2024 •

edited

Loading

pytorch-bot bot commented Mar 15, 2024 •

edited

Loading