[WIP] enable cuda graphs support for flash attention with dropout #100196

ngimel · 2023-04-27T19:48:32Z

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2023-04-27T19:48:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100196

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 5c0999f:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2023-04-27T19:49:21Z

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h

drisspg · 2023-04-28T18:03:20Z

aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp

@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda(
      max_seqlen_batch_kv,
      output_shape) = sdpa_nested_preprocessing(query, key, value);

-  Tensor attention, log_sumexp, debug_attn_mask;
-  int64_t philox_seed{0}, philox_offset{0};
+  Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset;
  std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) =


nit: another great structured bindings location

Yep, if it doesn't break internal builds, I'll put it in more places.

drisspg · 2023-04-28T18:04:30Z

aten/src/ATen/native/transformers/cuda/attention.cu

@@ -826,7 +825,7 @@ std::tuple<Tensor, Tensor, int64_t, int64_t, Tensor> _flash_attention_forward(
  at::Tensor output = at::empty_like(query);

  Tensor logsumexp, debug_attn_mask;


nit: could be same line

drisspg · 2023-04-28T18:08:46Z

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp

@@ -34,6 +34,8 @@
 #include <c10/cuda/CUDAGuard.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
+#include <ATen/ops/scalar_tensor.h>
+#include <iostream>


drisspg · 2023-04-28T20:11:36Z

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp

-        "scaled_dot_product_flash_attention does not support dropout with cuda graph capture mode enabled");
+    at::Tensor seed_t, offset_t;
+    if (is_dropout) {
+
        // See Note [Acquire lock when using random generators]
        std::lock_guard<std::mutex> lock(gen->mutex_);
        // generator_state = at::Tensor::wrap_tensor_impl(gen -> get_state());


nit: this comment can go and was added by me when attempting to do this the first go around

anijain2305 · 2023-04-28T20:20:59Z

aten/src/ATen/native/native_functions.yaml

@@ -14030,12 +14030,13 @@
 - func: _scaled_dot_product_attention_math(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool is_causal=False, Tensor? dropout_mask=None, *, float? scale=None) -> (Tensor, Tensor)
  variants: function

- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, int philox_seed, int philox_offset, Tensor debug_attn_mask)
+- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)


Not a comment for the int->Tensor refactor, but in general, we could just use generator.get_state which is a Tensor of seed, offset instead of having 2 separate tensors.

Right, but we cannot return generator object in native functions, so we have to make do with tensors.

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp

ngimel

I've checked that calls to sdpa are serialized to scaled_dot_product_attention, not _ APIs that I'm changing, so we can try landing as is, if something breaks we can go with plan B of leaving old functions as is, and adding new ones.

ngimel · 2023-05-01T22:57:30Z

aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp

@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda(
      max_seqlen_batch_kv,
      output_shape) = sdpa_nested_preprocessing(query, key, value);

-  Tensor attention, log_sumexp, debug_attn_mask;
-  int64_t philox_seed{0}, philox_offset{0};
+  Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset;
  std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) =


Yep, if it doesn't break internal builds, I'll put it in more places.

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp

drisspg

🚀

ngimel · 2023-05-02T19:32:31Z

@pytorchbot merge -f

pytorch-bot · 2023-05-02T19:32:33Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

ngimel · 2023-05-02T23:03:23Z

@pytorchbot merge -f "test failures unrelated"

pytorchmergebot · 2023-05-02T23:05:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2023-05-03T01:39:43Z

@pytorchbot revert -m "broke no ops build https://hud.pytorch.org/pytorch/pytorch/commit/32615618e439ce84d9365bd0d8892e34fcbe8add https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318" -c nosignal

pytorchmergebot · 2023-05-03T01:41:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-05-03T01:42:02Z

@ngimel your PR has been successfully reverted.

…pout (#100196)" This reverts commit 3261561. Reverted #100196 on behalf of https://github.com/clee2000 due to broke no ops build https://hud.pytorch.org/pytorch/pytorch/commit/32615618e439ce84d9365bd0d8892e34fcbe8add https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318 ([comment](#100196 (comment)))

ngimel · 2023-05-08T16:15:59Z

@pytorchbot merge -f "test failures unrelated"

pytorchmergebot · 2023-05-08T16:19:11Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…torch#100196) Fixes pytorch#99905 Pull Request resolved: pytorch#100196 Approved by: https://github.com/drisspg

# Summary Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](facebookresearch/xformers@6425fd0) ## Future Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken. ## Specific Changes ### Minor Changes * The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞 * Adding the new files and re-arranging/creating folder structure * Updating include paths * Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK` * Changes to xformer specific macros * Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place. ### Bigger Changes * Previous Kernel changes "Removed the chunk optimization: see discussion here: #96880" * Increased the number of cuda kernels -> potentially effecting the cuda_lib size. * Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished. * Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures. * I made these changes due to in part to the ability for this PR to land:#100196 ### Due Diligence Checks: * CUDA_lib size: * Before: 496 MiB * After: 496MiB * Performance Sweep: * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2 * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35. Pull Request resolved: #100583 Approved by: https://github.com/cpuhrsch

ngimel added 3 commits April 27, 2023 16:42

changes to kernels and unpack

989dc37

compiles

f0e6e48

no-graphs tests pass

e441184

ngimel requested review from albanD and soulitzer as code owners April 27, 2023 19:48

ngimel marked this pull request as draft April 27, 2023 19:48

drisspg reviewed Apr 27, 2023

View reviewed changes

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp Outdated Show resolved Hide resolved

drisspg reviewed Apr 27, 2023

View reviewed changes

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h Show resolved Hide resolved

ngimel added 4 commits April 27, 2023 23:39

cleanup, tests

a262f24

Merge branch 'main' into flash_dropout

ed43e6b

remove xfails

a94c625

lint

2f0263b

github-actions bot added ciflow/inductor module: dynamo labels Apr 28, 2023

drisspg reviewed Apr 28, 2023

View reviewed changes

anijain2305 reviewed Apr 28, 2023

View reviewed changes

drisspg reviewed Apr 28, 2023

View reviewed changes

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp Show resolved Hide resolved

drisspg reviewed Apr 28, 2023

View reviewed changes

aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp Show resolved Hide resolved

ngimel added 2 commits May 1, 2023 18:11

Merge branch 'main' into flash_dropout

b2773e4

more rigorous tests

0c8f460

ngimel commented May 1, 2023

View reviewed changes

address comments

6501760

drisspg approved these changes May 2, 2023

View reviewed changes

pytorchmergebot added merging Merged and removed merging labels May 2, 2023

pytorchmergebot closed this in 3261561 May 2, 2023

pytorchmergebot added the Reverted label May 3, 2023

drisspg mentioned this pull request May 4, 2023

Upstream xformers code #100583

Closed

ngimel reopened this May 5, 2023

Merge branch 'main' into flash_dropout

fcd46cf

ngimel marked this pull request as ready for review May 5, 2023 22:40

ngimel requested a review from mruberry as a code owner May 5, 2023 22:40

conditional include

5c0999f

ngimel force-pushed the ngimel/flash_dropout branch from 22863d9 to 5c0999f Compare May 7, 2023 00:00

ngimel added the ciflow/trunk Trigger trunk jobs on your pull request label May 7, 2023

pytorchmergebot added the merging label May 8, 2023

pytorchmergebot removed the merging label May 8, 2023

pytorchmergebot closed this in bfe5f5b May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] enable cuda graphs support for flash attention with dropout #100196

[WIP] enable cuda graphs support for flash attention with dropout #100196

ngimel commented Apr 27, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 27, 2023 •

edited

github-actions bot commented Apr 27, 2023

drisspg Apr 28, 2023

ngimel May 1, 2023

drisspg Apr 28, 2023

drisspg Apr 28, 2023

drisspg Apr 28, 2023 •

edited

anijain2305 Apr 28, 2023

ngimel May 1, 2023

ngimel left a comment

ngimel May 1, 2023

drisspg left a comment

ngimel commented May 2, 2023

pytorch-bot bot commented May 2, 2023

ngimel commented May 2, 2023

pytorchmergebot commented May 2, 2023

clee2000 commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

ngimel commented May 8, 2023

pytorchmergebot commented May 8, 2023

		@@ -826,7 +825,7 @@ std::tuple<Tensor, Tensor, int64_t, int64_t, Tensor> _flash_attention_forward(
		at::Tensor output = at::empty_like(query);

		Tensor logsumexp, debug_attn_mask;

[WIP] enable cuda graphs support for flash attention with dropout #100196

[WIP] enable cuda graphs support for flash attention with dropout #100196

Conversation

ngimel commented Apr 27, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 27, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100196

❌ 3 New Failures

github-actions bot commented Apr 27, 2023

This PR needs a label

drisspg Apr 28, 2023

Choose a reason for hiding this comment

ngimel May 1, 2023

Choose a reason for hiding this comment

drisspg Apr 28, 2023

Choose a reason for hiding this comment

drisspg Apr 28, 2023

Choose a reason for hiding this comment

drisspg Apr 28, 2023 • edited

Choose a reason for hiding this comment

anijain2305 Apr 28, 2023

Choose a reason for hiding this comment

ngimel May 1, 2023

Choose a reason for hiding this comment

ngimel left a comment

Choose a reason for hiding this comment

ngimel May 1, 2023

Choose a reason for hiding this comment

drisspg left a comment

Choose a reason for hiding this comment

ngimel commented May 2, 2023

pytorch-bot bot commented May 2, 2023

ngimel commented May 2, 2023

pytorchmergebot commented May 2, 2023

Merge started

clee2000 commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

ngimel commented May 8, 2023

pytorchmergebot commented May 8, 2023

Merge started

ngimel commented Apr 27, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 27, 2023 •

edited

drisspg Apr 28, 2023 •

edited