New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] enable cuda graphs support for flash attention with dropout #100196
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100196
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 5c0999f: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a labelIf your changes are user facing and intended to be a part of release notes, please use a label starting with If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see |
aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h
Show resolved
Hide resolved
@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda( | |||
max_seqlen_batch_kv, | |||
output_shape) = sdpa_nested_preprocessing(query, key, value); | |||
|
|||
Tensor attention, log_sumexp, debug_attn_mask; | |||
int64_t philox_seed{0}, philox_offset{0}; | |||
Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset; | |||
std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: another great structured bindings location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if it doesn't break internal builds, I'll put it in more places.
@@ -826,7 +825,7 @@ std::tuple<Tensor, Tensor, int64_t, int64_t, Tensor> _flash_attention_forward( | |||
at::Tensor output = at::empty_like(query); | |||
|
|||
Tensor logsumexp, debug_attn_mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could be same line
@@ -34,6 +34,8 @@ | |||
#include <c10/cuda/CUDAGuard.h> | |||
#include <ATen/NativeFunctions.h> | |||
#include <ATen/cuda/CUDAGraphsUtils.cuh> | |||
#include <ATen/ops/scalar_tensor.h> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛
"scaled_dot_product_flash_attention does not support dropout with cuda graph capture mode enabled"); | ||
at::Tensor seed_t, offset_t; | ||
if (is_dropout) { | ||
|
||
// See Note [Acquire lock when using random generators] | ||
std::lock_guard<std::mutex> lock(gen->mutex_); | ||
// generator_state = at::Tensor::wrap_tensor_impl(gen -> get_state()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this comment can go and was added by me when attempting to do this the first go around
@@ -14030,12 +14030,13 @@ | |||
- func: _scaled_dot_product_attention_math(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool is_causal=False, Tensor? dropout_mask=None, *, float? scale=None) -> (Tensor, Tensor) | |||
variants: function | |||
|
|||
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, int philox_seed, int philox_offset, Tensor debug_attn_mask) | |||
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a comment for the int->Tensor refactor, but in general, we could just use generator.get_state
which is a Tensor of seed, offset instead of having 2 separate tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we cannot return generator
object in native functions, so we have to make do with tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked that calls to sdpa are serialized to scaled_dot_product_attention
, not _
APIs that I'm changing, so we can try landing as is, if something breaks we can go with plan B of leaving old functions as is, and adding new ones.
@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda( | |||
max_seqlen_batch_kv, | |||
output_shape) = sdpa_nested_preprocessing(query, key, value); | |||
|
|||
Tensor attention, log_sumexp, debug_attn_mask; | |||
int64_t philox_seed{0}, philox_offset{0}; | |||
Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset; | |||
std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if it doesn't break internal builds, I'll put it in more places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
@pytorchbot merge -f |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot merge -f "test failures unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "broke no ops build https://hud.pytorch.org/pytorch/pytorch/commit/32615618e439ce84d9365bd0d8892e34fcbe8add https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318" -c nosignal |
@pytorchbot successfully started a revert job. Check the current status here. |
@ngimel your PR has been successfully reverted. |
…pout (#100196)" This reverts commit 3261561. Reverted #100196 on behalf of https://github.com/clee2000 due to broke no ops build https://hud.pytorch.org/pytorch/pytorch/commit/32615618e439ce84d9365bd0d8892e34fcbe8add https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318 ([comment](#100196 (comment)))
22863d9
to
5c0999f
Compare
@pytorchbot merge -f "test failures unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…torch#100196) Fixes pytorch#99905 Pull Request resolved: pytorch#100196 Approved by: https://github.com/drisspg
# Summary Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](facebookresearch/xformers@6425fd0) ## Future Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken. ## Specific Changes ### Minor Changes * The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞 * Adding the new files and re-arranging/creating folder structure * Updating include paths * Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK` * Changes to xformer specific macros * Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place. ### Bigger Changes * Previous Kernel changes "Removed the chunk optimization: see discussion here: #96880" * Increased the number of cuda kernels -> potentially effecting the cuda_lib size. * Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished. * Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures. * I made these changes due to in part to the ability for this PR to land:#100196 ### Due Diligence Checks: * CUDA_lib size: * Before: 496 MiB * After: 496MiB * Performance Sweep: * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2 * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35. Pull Request resolved: #100583 Approved by: https://github.com/cpuhrsch
# Summary Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](facebookresearch/xformers@6425fd0) ## Future Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken. ## Specific Changes ### Minor Changes * The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞 * Adding the new files and re-arranging/creating folder structure * Updating include paths * Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK` * Changes to xformer specific macros * Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place. ### Bigger Changes * Previous Kernel changes "Removed the chunk optimization: see discussion here: #96880" * Increased the number of cuda kernels -> potentially effecting the cuda_lib size. * Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished. * Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures. * I made these changes due to in part to the ability for this PR to land:#100196 ### Due Diligence Checks: * CUDA_lib size: * Before: 496 MiB * After: 496MiB * Performance Sweep: * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2 * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35. Pull Request resolved: #100583 Approved by: https://github.com/cpuhrsch
Fixes #99905
cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire