[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

htyu · 2024-03-28T00:48:18Z

Summary: Previous work https://github.com/pytorch/pytorch/pull/120742 to enable matrix_instr_nonkdim only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. matrix_instr_nonkdim needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as BLOCK_N that can be just set inside the kernel itself.

Test Plan:
P1201466917

triton_heuristics.template(
num_stages=1,
num_warps=4,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
)

Perf :
Before: 1.693ms 0.134GB 79.28GB/s
After: 1.577ms 0.134GB 85.12GB/s

Differential Revision: D55456401

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-03-28T00:48:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122852

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6564757 with merge base 3e7fd45 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_nested_tensor_from_jagged_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_nested_tensor_from_jagged_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-03-28T00:48:27Z

This pull request was exported from Phabricator. Differential Revision: D55456401

…#122852) Summary: Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself. Test Plan: P1201466917 triton_heuristics.template( num_stages=1, num_warps=4, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16}, inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None}, ) Differential Revision: D55456401

facebook-github-bot · 2024-03-28T00:49:11Z

This pull request was exported from Phabricator. Differential Revision: D55456401

…#122852) Summary: Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself. Test Plan: P1201466917 triton_heuristics.template( num_stages=1, num_warps=4, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16}, inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None}, ) Differential Revision: D55456401

facebook-github-bot · 2024-03-28T01:35:40Z

This pull request was exported from Phabricator. Differential Revision: D55456401

facebook-github-bot · 2024-03-28T01:35:52Z

This pull request was exported from Phabricator. Differential Revision: D55456401

xw285cornell

LG! Can we add perf impact in the summary?

htyu · 2024-03-28T16:56:37Z

@pytorchbot merge -f "failure unrelated"

pytorchmergebot · 2024-03-28T16:56:56Z

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

pytorchmergebot · 2024-03-28T16:58:29Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Previous work `#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself. Test Plan: P1201466917 triton_heuristics.template( num_stages=1, num_warps=4, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16}, inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None}, ) Differential Revision: D55456401

…#122852) Summary: Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself. Test Plan: P1201466917 triton_heuristics.template( num_stages=1, num_warps=4, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16}, inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None}, ) Perf : Before: 1.693ms 0.134GB 79.28GB/s After: 1.577ms 0.134GB 85.12GB/s Differential Revision: D55456401 Pull Request resolved: pytorch#122852 Approved by: https://github.com/xw285cornell

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 28, 2024

facebook-github-bot added the fb-exported label Mar 28, 2024

htyu force-pushed the export-D55456401 branch from ccee54a to 8d9e0d1 Compare March 28, 2024 00:49

htyu requested a review from jansel March 28, 2024 00:49

htyu force-pushed the export-D55456401 branch from 8d9e0d1 to 1bd765c Compare March 28, 2024 01:35

htyu force-pushed the export-D55456401 branch from 1bd765c to 6564757 Compare March 28, 2024 01:35

xw285cornell approved these changes Mar 28, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2024

htyu mentioned this pull request Mar 28, 2024

[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU #120742

Closed

htyu added the topic: not user facing topic category label Mar 28, 2024

pytorchmergebot added the merging label Mar 28, 2024

pytorchmergebot closed this in 049d68d Mar 28, 2024

pytorchmergebot added Merged and removed merging labels Mar 28, 2024

pytorch deleted a comment from pytorch-bot bot Mar 28, 2024

pytorch deleted a comment from pytorchmergebot Mar 28, 2024

htyu mentioned this pull request Mar 28, 2024

[v.2.3.0] Release Tracker #121760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

htyu commented Mar 28, 2024 •

edited

pytorch-bot bot commented Mar 28, 2024 •

edited

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

xw285cornell left a comment •

edited

htyu commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

Conversation

htyu commented Mar 28, 2024 • edited

pytorch-bot bot commented Mar 28, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122852

✅ You can merge normally! (2 Unrelated Failures)

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

facebook-github-bot commented Mar 28, 2024

xw285cornell left a comment • edited

Choose a reason for hiding this comment

htyu commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

Merge started

htyu commented Mar 28, 2024 •

edited

pytorch-bot bot commented Mar 28, 2024 •

edited

xw285cornell left a comment •

edited