-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Support src_mask and src_key_padding_mask for Better Transformer #88488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support src_mask and src_key_padding_mask for Better Transformer #88488
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88488
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4366984: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3 similar comments
@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 additional jobs have failed, first few of them are: trunk ,trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 2, 4, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
@malfet The merge above failed, but the errors seem to be unrelated to the PR:
Could you have a look and say if those are indeed known infra failures? |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
f327c24
to
4366984
Compare
@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…orch#88488) Fixes T135842750 (follow-up for pytorch#87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: pytorch#88488 Approved by: https://github.com/mikekgfb
Fixes T135842750 (follow-up for #87377)
Description
At present, having both
src_key_padding_mask
andsrc_mask
at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.
Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
SoftMax.cu::masked_softmax_cuda
. When mask type is 2, it calls eitherdispatch_softmax_forward
->softmax_warp_forward
orat::softmax
(depending on the input size). In both cases 4D mask is supported.SoftMax.cpp::masked_softmax_cpp
. It callshosted_softmax
which supports 4D mask.Tests
test_mask_check_fastpath
to check that fast path is indeed taken in Transformer when two masks are passedtest_multihead_self_attn_two_masks_fast_path_mock
to check that fast path is taken in MHA when two masks are passedtest_multihead_self_attn_two_masks_fast_path
to check that fast and slow paths give the same result when two masks are passed in MHAtest_masked_softmax_mask_types
now covers mask type 2test_transformerencoderlayer_fast_path
(CPU smoke test) is expanded to the case of both masks provided simultaneouslytest_masked_softmax_devices_parity
checks that mask type 2 is accepted by CPU and CUDA paths