Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta #122852

Closed
wants to merge 1 commit into from

Conversation

htyu
Copy link
Contributor

@htyu htyu commented Mar 28, 2024

Summary: Previous work https://github.com/pytorch/pytorch/pull/120742 to enable matrix_instr_nonkdim only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. matrix_instr_nonkdim needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as BLOCK_N that can be just set inside the kernel itself.

Test Plan:
P1201466917

triton_heuristics.template(
num_stages=1,
num_warps=4,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
)

Perf :
Before: 1.693ms 0.134GB 79.28GB/s
After: 1.577ms 0.134GB 85.12GB/s

Differential Revision: D55456401

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Mar 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122852

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6564757 with merge base 3e7fd45 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55456401

htyu added a commit to htyu/pytorch that referenced this pull request Mar 28, 2024
…#122852)

Summary:

Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Differential Revision: D55456401
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55456401

@htyu htyu requested a review from jansel March 28, 2024 00:49
…#122852)

Summary:

Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Differential Revision: D55456401
htyu added a commit to htyu/pytorch that referenced this pull request Mar 28, 2024
…#122852)

Summary:

Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Differential Revision: D55456401
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55456401

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55456401

Copy link
Contributor

@xw285cornell xw285cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG! Can we add perf impact in the summary?

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2024
@htyu htyu added the topic: not user facing topic category label Mar 28, 2024
@htyu
Copy link
Contributor Author

htyu commented Mar 28, 2024

@pytorchbot merge -f "failure unrelated"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch pytorch deleted a comment from pytorch-bot bot Mar 28, 2024
@pytorch pytorch deleted a comment from pytorchmergebot Mar 28, 2024
htyu added a commit that referenced this pull request Mar 28, 2024
Summary:

Previous work `#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Differential Revision: D55456401
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Apr 16, 2024
…#122852)

Summary: Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Perf :
Before: 1.693ms    0.134GB    79.28GB/s
After:    1.577ms    0.134GB    85.12GB/s

Differential Revision: D55456401

Pull Request resolved: pytorch#122852
Approved by: https://github.com/xw285cornell
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…#122852)

Summary: Previous work `pytorch#120742 to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Perf :
Before: 1.693ms    0.134GB    79.28GB/s
After:    1.577ms    0.134GB    85.12GB/s

Differential Revision: D55456401

Pull Request resolved: pytorch#122852
Approved by: https://github.com/xw285cornell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants