-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[WIP] [ATen] Add native_multi_attention_self_attention CPU + GPU implementation #70649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 45ed5f2 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
7 similar comments
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
This pull request was exported from Phabricator. Differential Revision: D31829981 |
…ementation (#70649) Summary: Pull Request resolved: #70649 As described in https://fb.quip.com/oxpiA1uDBjgP This implements the first parts of the RFC, and is a rough draft showing the approach. The idea is that for the first cut we can maintain very close (identical I believe in this diff) numerical equivalence to the existing nn.MHA implementation, which is what this diff attempts to do. In subsequent implementations, once we have a working and adopted native self-attention implementation, we could then explore alternative implementations, etc. The current implementation is similar to existing dedicated implementations such as LightSeq/FasterTransformer/DeepSpeed, and for MHA on both CPUs and GPUs is between 1.2x and 2x faster depending on the setting. It makes some approximations/restrictions (doesn't handle masking in masked softmax, etc), but these shouldn't materially impact performance. This does the first few items: * add native_multi_head_attention(...) , native_multi_head_attention_backward(..) to native_functions.yaml * Implement native_multi_head_attention(..) on GPU, extracting bits and pieces out of LS/DS/FT as appropriate * Implement native_multi_head_attention(..) on CPU The backward implementation is still WIP, but the idea would be to: * Hook these up in derivatives.yaml Implement native_multi_head_attention_backward(..) on GPU, extracting out bits and pieces out of LS/DS (not FT since it’s inference only) * Implement native_multi_head_attention_backward(..) on CPU * In torch.nn.functional.multi_head_attention_forward https://github.com/pytorch/pytorch/blob/23321ba7a3b634ee734455aab4a984689802cad0/torch/nn/functional.py#L4953, add some conditionals to check if we are being called in a BERT/ViT-style encoder fashion, and invoke the native function directly. Test Plan: TODO Reviewed By: mikekgfb Differential Revision: D31829981 fbshipit-source-id: f5db2e758dde4d0b204899b8553110e14c1777ed
This pull request was exported from Phabricator. Differential Revision: D31829981 |
Hi @zrphercule. It looks like this PR adds a new public torch API ( Can you do one of those and cherry-pick into the 1.11 release branch? |
Hi @bdhirsh, I think it is better to make this API private for now, since we are still working on some following diffs of it. Will make a pr today, thanks! |
Summary:
As described in https://fb.quip.com/oxpiA1uDBjgP
This implements the first parts of the RFC, and is a rough draft showing the approach. The idea is that for the first cut we can maintain very close (identical I believe in this diff) numerical equivalence to the existing nn.MHA implementation, which is what this diff attempts to do. In subsequent implementations, once we have a working and adopted native self-attention implementation, we could then explore alternative implementations, etc.
The current implementation is similar to existing dedicated implementations such as LightSeq/FasterTransformer/DeepSpeed, and for MHA on both CPUs and GPUs is between 1.2x and 2x faster depending on the setting. It makes some approximations/restrictions (doesn't handle masking in masked softmax, etc), but these shouldn't materially impact performance.
This does the first few items:
The backward implementation is still WIP, but the idea would be to:
Implement native_multi_head_attention_backward(..) on GPU, extracting out bits and pieces out of LS/DS (not FT since it’s inference only)
pytorch/torch/nn/functional.py
Line 4953 in 23321ba
Test Plan: TODO
Differential Revision: D31829981