Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] enable cuda graphs support for flash attention with dropout #100196

Closed
wants to merge 12 commits into from

Conversation

ngimel
Copy link
Collaborator

@ngimel ngimel commented Apr 27, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100196

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 5c0999f:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ngimel ngimel marked this pull request as draft April 27, 2023 19:48
@github-actions
Copy link

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda(
max_seqlen_batch_kv,
output_shape) = sdpa_nested_preprocessing(query, key, value);

Tensor attention, log_sumexp, debug_attn_mask;
int64_t philox_seed{0}, philox_offset{0};
Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset;
std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: another great structured bindings location

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if it doesn't break internal builds, I'll put it in more places.

@@ -826,7 +825,7 @@ std::tuple<Tensor, Tensor, int64_t, int64_t, Tensor> _flash_attention_forward(
at::Tensor output = at::empty_like(query);

Tensor logsumexp, debug_attn_mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be same line

@@ -34,6 +34,8 @@
#include <c10/cuda/CUDAGuard.h>
#include <ATen/NativeFunctions.h>
#include <ATen/cuda/CUDAGraphsUtils.cuh>
#include <ATen/ops/scalar_tensor.h>
#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛

"scaled_dot_product_flash_attention does not support dropout with cuda graph capture mode enabled");
at::Tensor seed_t, offset_t;
if (is_dropout) {

// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(gen->mutex_);
// generator_state = at::Tensor::wrap_tensor_impl(gen -> get_state());
Copy link
Contributor

@drisspg drisspg Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment can go and was added by me when attempting to do this the first go around

@@ -14030,12 +14030,13 @@
- func: _scaled_dot_product_attention_math(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool is_causal=False, Tensor? dropout_mask=None, *, float? scale=None) -> (Tensor, Tensor)
variants: function

- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, int philox_seed, int philox_offset, Tensor debug_attn_mask)
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor ouput, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a comment for the int->Tensor refactor, but in general, we could just use generator.get_state which is a Tensor of seed, offset instead of having 2 separate tensors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we cannot return generator object in native functions, so we have to make do with tensors.

Copy link
Collaborator Author

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked that calls to sdpa are serialized to scaled_dot_product_attention, not _ APIs that I'm changing, so we can try landing as is, if something breaks we can go with plan B of leaving old functions as is, and adding new ones.

@@ -710,8 +710,7 @@ _scaled_dot_product_flash_attention_nestedtensor_cuda(
max_seqlen_batch_kv,
output_shape) = sdpa_nested_preprocessing(query, key, value);

Tensor attention, log_sumexp, debug_attn_mask;
int64_t philox_seed{0}, philox_offset{0};
Tensor attention, log_sumexp, debug_attn_mask, philox_seed, philox_offset;
std::tie(attention, log_sumexp, philox_seed, philox_offset, debug_attn_mask) =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if it doesn't break internal builds, I'll put it in more places.

Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ngimel
Copy link
Collaborator Author

ngimel commented May 2, 2023

@pytorchbot merge -f

@pytorch-bot
Copy link

pytorch-bot bot commented May 2, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

@ngimel
Copy link
Collaborator Author

ngimel commented May 2, 2023

@pytorchbot merge -f "test failures unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@clee2000
Copy link
Contributor

clee2000 commented May 3, 2023

@pytorchbot revert -m "broke no ops build https://hud.pytorch.org/pytorch/pytorch/commit/32615618e439ce84d9365bd0d8892e34fcbe8add https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@ngimel your PR has been successfully reverted.

@ngimel ngimel marked this pull request as ready for review May 5, 2023 22:40
@ngimel ngimel requested a review from mruberry as a code owner May 5, 2023 22:40
@ngimel ngimel added the ciflow/trunk Trigger trunk jobs on your pull request label May 7, 2023
@ngimel
Copy link
Collaborator Author

ngimel commented May 8, 2023

@pytorchbot merge -f "test failures unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kiersten-stokes pushed a commit to kiersten-stokes/pytorch that referenced this pull request May 8, 2023
pytorchmergebot pushed a commit that referenced this pull request May 18, 2023
# Summary
Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](facebookresearch/xformers@6425fd0)

## Future
Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken.

## Specific Changes
### Minor Changes
* The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞
* Adding the new files and re-arranging/creating folder structure
* Updating include paths
* Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK`
* Changes to xformer specific macros
* Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place.

### Bigger Changes
* Previous Kernel changes "Removed the chunk optimization: see discussion here: #96880"
* Increased the number of cuda kernels -> potentially effecting the cuda_lib size.
* Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished.
* Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures.
    * I made these changes due to in part to the ability for this PR to land:#100196

### Due Diligence Checks:
* CUDA_lib size:
    * Before: 496 MiB
    * After:    496MiB
* Performance Sweep:
    * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2
    * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35.

Pull Request resolved: #100583
Approved by: https://github.com/cpuhrsch
jcaip pushed a commit that referenced this pull request May 23, 2023
# Summary
Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](facebookresearch/xformers@6425fd0)

## Future
Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken.

## Specific Changes
### Minor Changes
* The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞
* Adding the new files and re-arranging/creating folder structure
* Updating include paths
* Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK`
* Changes to xformer specific macros
* Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place.

### Bigger Changes
* Previous Kernel changes "Removed the chunk optimization: see discussion here: #96880"
* Increased the number of cuda kernels -> potentially effecting the cuda_lib size.
* Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished.
* Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures.
    * I made these changes due to in part to the ability for this PR to land:#100196

### Due Diligence Checks:
* CUDA_lib size:
    * Before: 496 MiB
    * After:    496MiB
* Performance Sweep:
    * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2
    * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35.

Pull Request resolved: #100583
Approved by: https://github.com/cpuhrsch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dropout support for SDPA with torch.compile
5 participants