[Inductor] Further tune block size for templated attention on H100 #125286

yanboliang · 2024-05-01T00:54:28Z

Run a script to enumerate and get the best default block size for templated attention.

A100 -> no change, check numbers at #125139
H100

torch.bfloat16

Before:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.103 |              |             |             |             |            |               |                |
| Max     |     1.322 |            8 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.829 |            1 |          16 |        1024 |        1024 |        128 | relative_bias | torch.bfloat16 |

After:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.137 |              |             |             |             |            |               |                |
| Max     |     1.442 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.bfloat16 |
| Min     |     0.913 |            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |

torch.float32

Before:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------|
| Average |     2.269 |              |             |             |             |            |               |               |
| Max     |     3.740 |           16 |          16 |        1024 |        1024 |         64 | noop          | torch.float32 |
| Min     |     0.761 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.float32 |

After:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.489 |              |             |             |             |            |             |               |
| Max     |     3.755 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.609 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.float32 |

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

pytorch-bot · 2024-05-01T00:54:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125286

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 93fb304 with merge base 4d5f807 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Chillee

needs a rebase

Chillee · 2024-05-01T02:36:48Z

torch/_inductor/kernel/templated_attention.py

Should probably check if it exactly matches

yanboliang · 2024-05-01T05:17:08Z

@pytorchbot merge

pytorchmergebot · 2024-05-01T05:18:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#125286) Run a script to enumerate and get the best default block size for templated attention. A100 -> no change, check numbers at pytorch#125139 H100 ## torch.bfloat16 Before: ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------| | Average | 1.103 | | | | | | | | | Max | 1.322 | 8 | 16 | 512 | 512 | 64 | noop | torch.bfloat16 | | Min | 0.829 | 1 | 16 | 1024 | 1024 | 128 | relative_bias | torch.bfloat16 | ``` After: ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------| | Average | 1.137 | | | | | | | | | Max | 1.442 | 1 | 16 | 512 | 512 | 128 | relative_bias | torch.bfloat16 | | Min | 0.913 | 1 | 16 | 1024 | 1024 | 64 | head_bias | torch.bfloat16 | ``` ## torch.float32 Before: ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------| | Average | 2.269 | | | | | | | | | Max | 3.740 | 16 | 16 | 1024 | 1024 | 64 | noop | torch.float32 | | Min | 0.761 | 1 | 16 | 512 | 512 | 128 | relative_bias | torch.float32 | ``` After: ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------| | Average | 2.489 | | | | | | | | | Max | 3.755 | 16 | 16 | 4096 | 4096 | 64 | noop | torch.float32 | | Min | 1.609 | 1 | 16 | 512 | 512 | 64 | head_bias | torch.float32 | ``` Pull Request resolved: pytorch#125286 Approved by: https://github.com/Chillee

pytorch-bot bot added ciflow/inductor module: inductor oncall: pt2 labels May 1, 2024

yanboliang added the topic: not user facing topic category label May 1, 2024

yanboliang requested review from Chillee and drisspg May 1, 2024 01:00

Chillee approved these changes May 1, 2024

View reviewed changes

yanboliang added 2 commits April 30, 2024 20:36

Further tune block size for templated attention on H100

7814023

Fix lint

93fb304

yanboliang force-pushed the block-default branch from f56eed6 to 93fb304 Compare May 1, 2024 03:37

yanboliang added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2024

pytorchmergebot added the merging label May 1, 2024

pytorchmergebot added the Merged label May 1, 2024

pytorchmergebot closed this in aead440 May 1, 2024

pytorchmergebot removed the merging label May 1, 2024

yanboliang deleted the block-default branch May 1, 2024 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Further tune block size for templated attention on H100 #125286

[Inductor] Further tune block size for templated attention on H100 #125286

Uh oh!

yanboliang commented May 1, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 1, 2024 •

edited

Loading

Uh oh!

Chillee left a comment

Uh oh!

Chillee May 1, 2024

Uh oh!

yanboliang commented May 1, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Inductor] Further tune block size for templated attention on H100 #125286

[Inductor] Further tune block size for templated attention on H100 #125286

Uh oh!

Conversation

yanboliang commented May 1, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

torch.bfloat16

torch.float32

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125286

✅ No Failures

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

Chillee May 1, 2024

Choose a reason for hiding this comment

Uh oh!

yanboliang commented May 1, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yanboliang commented May 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading