[SDPA] implement ordering #91362

drisspg · 2022-12-23T15:53:48Z

Summary

In some cases, dependent on input, flash-attention is not the fastest fused kernel and memory-efficient attention is better. This implements a simple heuristic function for deciding the ordering of kernel functions. This was based off of the xformer function found here: https://github.com/fairinternal/xformers/blob/15bff4986c3a4376176a4e6fa3dc0f2a120fa0bb/xformers/ops/fmha/dispatch.py#L13

cc @ngimel

pytorch-bot · 2022-12-23T15:53:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91362

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 6b22fa1:

FLAKY - The following jobs failed but were likely due to flakiness present on master:

linux-focal-rocm5.3-py3.8 / test (default, 2, 2, linux.rocm.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2022-12-26T22:14:29Z

@pytorchbot rebase

pytorchmergebot · 2022-12-26T22:16:40Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-12-26T22:16:45Z

Successfully rebased heuristic_ordering onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout heuristic_ordering && git pull --rebase)

drisspg · 2022-12-29T17:43:43Z

Profiling the composite MHA with head_dim = 128

Time for flash_attention:
Time: 34762.24639452994 us

Time for efficient_attention:
Time: 32185.50537712872 us

drisspg · 2022-12-29T17:48:22Z

In an extreme case with:
batch_size = 1
num_heads = 8
max_seq_len = 2048
head_dim =128

Flash time:
Time: 3734.933268278837 us

Efficient-attention time:
Time: 495.5342309549451 us

cpuhrsch · 2023-01-03T04:15:01Z

aten/src/ATen/native/transformers/sdp_utils_cpp.h

 namespace sdp {
+
+constexpr int32_t num_backends = 3;


This seems like a disturbingly generic name even if it's within the sdp namespace.

cpuhrsch · 2023-01-03T04:17:29Z

aten/src/ATen/native/transformers/cuda/sdp_utils.h

@@ -29,6 +30,46 @@ struct sdp_params {
  bool is_causal;
 };

+inline std::array<SDPBackend, num_backends> priority_order(sdp_params params) {


Another way to implement this (and I think it's kind of what this is), is to modify use_flash_attention and use_mem_efficient_attention to return an integer or to create new functions that return integers.

These integers then are the estimated number of operations performed by the respective fused kernel. This is similar to estimate_matmul_time.

You then pick the one that returns the lowest number of operations. And if the number of operations is negative, well then the kernel doesn't apply.

cpuhrsch

Seems fine for now, but I think generalizing this to a cost model would be more future proof.

drisspg · 2023-01-03T14:32:19Z

@pytorchbot merge -l

pytorchmergebot · 2023-01-03T14:33:54Z

Merge started

The -l land checks flag is deprecated and no longer needed. Instead we now automatically add the ciflow\trunk label to your PR once it's approved

Your change will be merged once all checks on your PR pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-03T14:33:55Z

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team

Raised by workflow job

drisspg · 2023-01-03T16:57:01Z

@pytorchbot rebase

drisspg · 2023-01-03T16:57:16Z

@pytorchbot merge

pytorchmergebot · 2023-01-03T16:59:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-03T16:59:02Z

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team

Raised by workflow job

pytorchmergebot · 2023-01-03T16:59:21Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-01-03T16:59:26Z

Successfully rebased heuristic_ordering onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout heuristic_ordering && git pull --rebase)

drisspg · 2023-01-03T17:37:02Z

@pytorchbot merge

pytorchmergebot · 2023-01-03T17:38:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-01-03T22:06:10Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / linux-focal-rocm5.3-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

drisspg · 2023-01-03T22:29:23Z

@pytorchbot merge -f "unrelated failure"

pytorchmergebot · 2023-01-03T22:33:09Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

drisspg requested review from cpuhrsch and mikekgfb December 23, 2022 15:54

pytorchmergebot force-pushed the heuristic_ordering branch from b281936 to 44f8382 Compare December 26, 2022 22:16

cpuhrsch reviewed Jan 3, 2023

View reviewed changes

cpuhrsch approved these changes Jan 3, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2023

implement ordering

6b22fa1

pytorchmergebot force-pushed the heuristic_ordering branch from 44f8382 to 6b22fa1 Compare January 3, 2023 16:59

pytorchmergebot added the Merged label Jan 3, 2023

pytorchmergebot closed this in 3a60deb Jan 3, 2023

drisspg changed the title ~~implement ordering~~ [SDPA] implement ordering Jan 10, 2023

drisspg added module: performance Issues related to performance, either of kernel code or framework glue module: multi-headed-attention labels Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDPA] implement ordering #91362

[SDPA] implement ordering #91362

drisspg commented Dec 23, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Dec 23, 2022 •

edited

drisspg commented Dec 26, 2022

pytorchmergebot commented Dec 26, 2022

pytorchmergebot commented Dec 26, 2022

drisspg commented Dec 29, 2022

drisspg commented Dec 29, 2022

cpuhrsch Jan 3, 2023

cpuhrsch Jan 3, 2023

cpuhrsch left a comment

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

drisspg commented Jan 3, 2023

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

[SDPA] implement ordering #91362

[SDPA] implement ordering #91362

Conversation

drisspg commented Dec 23, 2022 • edited by pytorch-bot bot

Summary

pytorch-bot bot commented Dec 23, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91362

❌ 1 Failures

drisspg commented Dec 26, 2022

pytorchmergebot commented Dec 26, 2022

pytorchmergebot commented Dec 26, 2022

drisspg commented Dec 29, 2022

drisspg commented Dec 29, 2022

cpuhrsch Jan 3, 2023

Choose a reason for hiding this comment

cpuhrsch Jan 3, 2023

Choose a reason for hiding this comment

cpuhrsch left a comment

Choose a reason for hiding this comment

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

Merge started

pytorchmergebot commented Jan 3, 2023

Merge failed

drisspg commented Jan 3, 2023

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

Merge started

pytorchmergebot commented Jan 3, 2023

Merge failed

pytorchmergebot commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

Merge started

pytorchmergebot commented Jan 3, 2023

Merge failed

drisspg commented Jan 3, 2023

pytorchmergebot commented Jan 3, 2023

Merge started

drisspg commented Dec 23, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Dec 23, 2022 •

edited