Skip to content

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented May 1, 2024

Run a script to enumerate and get the best default block size for templated attention.

A100 -> no change, check numbers at #125139
H100

torch.bfloat16

Before:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.103 |              |             |             |             |            |               |                |
| Max     |     1.322 |            8 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.829 |            1 |          16 |        1024 |        1024 |        128 | relative_bias | torch.bfloat16 |

After:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.137 |              |             |             |             |            |               |                |
| Max     |     1.442 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.bfloat16 |
| Min     |     0.913 |            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |

torch.float32

Before:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------|
| Average |     2.269 |              |             |             |             |            |               |               |
| Max     |     3.740 |           16 |          16 |        1024 |        1024 |         64 | noop          | torch.float32 |
| Min     |     0.761 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.float32 |

After:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.489 |              |             |             |             |            |             |               |
| Max     |     3.755 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.609 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.float32 |

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

Copy link

pytorch-bot bot commented May 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125286

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 93fb304 with merge base 4d5f807 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a rebase

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably check if it exactly matches

@yanboliang yanboliang added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2024
@yanboliang
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yanboliang yanboliang deleted the block-default branch May 1, 2024 16:57
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…ytorch#125286)

Run a script to enumerate and get the best default block size for templated attention.

A100 -> no change, check numbers at pytorch#125139
H100
## torch.bfloat16

Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.103 |              |             |             |             |            |               |                |
| Max     |     1.322 |            8 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.829 |            1 |          16 |        1024 |        1024 |        128 | relative_bias | torch.bfloat16 |

```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.137 |              |             |             |             |            |               |                |
| Max     |     1.442 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.bfloat16 |
| Min     |     0.913 |            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |
```

## torch.float32
Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------|
| Average |     2.269 |              |             |             |             |            |               |               |
| Max     |     3.740 |           16 |          16 |        1024 |        1024 |         64 | noop          | torch.float32 |
| Min     |     0.761 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.float32 |
```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.489 |              |             |             |             |            |             |               |
| Max     |     3.755 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.609 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.float32 |
```

Pull Request resolved: pytorch#125286
Approved by: https://github.com/Chillee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants