Re-land precompile triton templates #124030

eellison · 2024-04-15T02:27:25Z

Stack from ghstack (oldest at bottom):

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

This reverts commit e0c9764. [ghstack-poisoned]

pytorch-bot · 2024-04-15T02:27:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124030

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f1e2cf9 with merge base bbb6e36 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This reverts commit e0c9764. [ghstack-poisoned]

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. [ghstack-poisoned]

eellison · 2024-04-16T23:50:03Z

@pytorchbot merge

pytorchmergebot · 2024-04-16T23:51:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: #122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030

DanilBaibak · 2024-04-17T11:29:41Z

@pytorchbot revert -m "Broken trunk" -c ignoredsignal

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: pytorch#124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

…2642) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: pytorch#122642 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030

…torch#122642)" This reverts commit 050051f. Reverted pytorch#122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

This reverts commit d68196e. Reverted pytorch#124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: #124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

This reverts commit 030bb13. Reverted pytorch#124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: pytorch#124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

…2642) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: pytorch#122642 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030

Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: pytorch#123229 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642

…22825) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: pytorch#122825 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229

Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: #124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825

This reverts commit 030bb13. Reverted pytorch#124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: pytorch#124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

…2642) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: pytorch#122642 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030

Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: pytorch#123229 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642

…22825) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: pytorch#122825 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229

Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: pytorch#124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229, pytorch#122825

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: #124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

…2642) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: pytorch#122642 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030

…torch#122642)" This reverts commit 050051f. Reverted pytorch#122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

This reverts commit d68196e. Reverted #124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](#124030 (comment)))

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: pytorch#124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

This reverts commit 030bb13. Reverted pytorch#124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](pytorch#124030 (comment)))

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: pytorch#124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu

Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: #122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030

Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: pytorch#123229 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642

…22825) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: pytorch#122825 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229

Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: pytorch#124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: pytorch#124030, pytorch#122642, pytorch#123229, pytorch#122825

Re-land precompile triton templates

8b27dd8

This reverts commit e0c9764. [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 15, 2024

Update on "Re-land precompile triton templates"

93888c1

This reverts commit e0c9764. [ghstack-poisoned]

eellison requested review from yoyoyocmu, xw285cornell and shunting314 April 15, 2024 20:30

yoyoyocmu requested a review from nmacchioni April 15, 2024 20:38

shunting314 approved these changes Apr 15, 2024

View reviewed changes

Update on "Re-land precompile triton templates"

664fd58

Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. [ghstack-poisoned]

nmacchioni approved these changes Apr 15, 2024

View reviewed changes

yoyoyocmu approved these changes Apr 15, 2024

View reviewed changes

eellison added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Apr 16, 2024

pytorchmergebot added the merging label Apr 16, 2024

pytorchmergebot added the Merged label Apr 17, 2024

pytorchmergebot closed this in d68196e Apr 17, 2024

pytorchmergebot removed the merging label Apr 17, 2024

Chillee mentioned this pull request May 2, 2024

Weird AST constructor issue with mode="max-autotune" with python 3.11 #125374

Closed

pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024

Revert "Re-land precompile triton templates (#124030)"

166acfb

This reverts commit d68196e. Reverted #124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](#124030 (comment)))

github-actions bot deleted the gh/eellison/629/head branch June 1, 2024 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-land precompile triton templates #124030

Re-land precompile triton templates #124030

eellison commented Apr 15, 2024 •

edited

pytorch-bot bot commented Apr 15, 2024 •

edited

eellison commented Apr 16, 2024

pytorchmergebot commented Apr 16, 2024

DanilBaibak commented Apr 17, 2024

Re-land precompile triton templates #124030

Re-land precompile triton templates #124030

Conversation

eellison commented Apr 15, 2024 • edited

pytorch-bot bot commented Apr 15, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124030

✅ No Failures

eellison commented Apr 16, 2024

pytorchmergebot commented Apr 16, 2024

Merge started

DanilBaibak commented Apr 17, 2024

eellison commented Apr 15, 2024 •

edited

pytorch-bot bot commented Apr 15, 2024 •

edited