Enable `src_mask` in fast path of `TransformerEncoderLayer` #87377

sgrigory · 2022-10-20T17:39:24Z

Issues

Description

Passing a 2D attention mask src_mask into the fast path of TransformerEncoderLayer in CPU was causing an error and so was disabled in #81277. This PR unrolls this fix, enabling src_mask on the fast path:

Either attention mask src_mask of shape (L, L) or padding mask src_key_padding_mask of shape (B, L) are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, Softmax.cpp::host_softmax converts the index to match the mask dimensions, depending on the type.
If softmax is applied along the dimension other than the last, Softmax.cpp::masked_softmax_cpu expands masks to 4D, converting them to mask_type=2. Theoretically one could also add special optimized cases for dim=0, 1, 2 and process them without mask expansion, but I don't know how often is that used

Tests:

test_transformerencoderlayer_fast_path is extended to cover both attention mask and padding mask
test_masked_softmax_mask_types_0_1 is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
test_masked_softmax_devices_parity is added to ensure results from masked softmax on CPU and CUDA match

Note

I had to replace float with torch.get_default_dtype() in a couple of tests for the following reason:

test_nn.py sets the default type to torch.double
If I execute test_nn.py and test_transformers.py in one pytest run, this default still holds for transformer tests
Some tests in test_transformers.py which were previously following the slow path now switched to fast path, and hard-coded float started clashing with default double

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with pytest like this

pytorch-bot · 2022-10-20T17:39:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87377

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 2 Pending

As of commit 5c70c50:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2022-10-20T17:39:28Z

The committers listed above are authorized under a signed CLA.

✅ login: sgrigory / name: Grigory Sizov (f897a73, a4e9b1e, 7b3e282)

mikekgfb · 2022-10-21T00:53:16Z

@drisspg

sgrigory · 2022-10-21T13:43:46Z

/easycla

sgrigory · 2022-10-24T08:54:16Z

@pytorchbot rebase

pytorch-bot · 2022-10-24T08:54:18Z

You don't have permissions to rebase this PR, only people with write permissions may rebase PRs.

facebook-github-bot · 2022-10-24T19:55:29Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-25T11:13:07Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-26T10:26:48Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mikekgfb · 2022-10-26T20:09:10Z

torch/nn/modules/transformer.py

-        elif src.is_nested and src_key_padding_mask is not None:
-            why_not_sparsity_fast_path = "src_key_padding_mask is not supported with NestedTensor input for fastpath"
+        elif src.is_nested and (src_key_padding_mask is not None or src_mask is not None):
+            why_not_sparsity_fast_path = "src_key_padding_mask and src_mask are not supported with NestedTensor input"


Is the message clear? i.e., that either one causes this to fail. I am hesitant to put too much time in error message engineering but helping users get to fastpath is important

I think it can be made clearer, e.g. "supplying both src_key_padding_mask and src_mask is not supported with NestedTensor input" or "with NestedTensor input, only one of src_key_padding_mask and src_mask can be provided". Since in the next PR I'll working on making 4D inputs possible from Python level and modifying this code, I can as well adjust this message

mikekgfb · 2022-10-27T21:22:17Z

@pytorchbot merge

pytorchmergebot · 2022-10-27T21:25:11Z

Merge failed

Reason: Approval needed from one of the following (Rule 'superuser'):
EscapeZero, weiwangmeta, suphoff, ymao1993, xcheng16, ...

Details for Dev Infra team

Raised by workflow job

weiwangmeta

Symbolic review to address this comment this comment #87377 (comment)
But ideally we should not require this. Michael has already reviewed and that should be all that is needed.

malfet

semicolon after closing curly bracket is only nedeed after class definitions (or if one is assigning anonymous lambda to a variable)

aten/src/ATen/native/SoftMax.cpp

facebook-github-bot · 2022-10-28T07:08:47Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-28T12:07:11Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

sgrigory · 2022-10-28T13:16:09Z

@pytorchbot rebase

pytorchmergebot · 2022-10-28T13:18:16Z

@pytorchbot successfully started a rebase job. Check the current status here

…_fast_path

…_0_1

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

pytorchmergebot · 2022-10-28T13:18:21Z

Successfully rebased enable-src-mask-bettertransformer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout enable-src-mask-bettertransformer && git pull --rebase)

facebook-github-bot · 2022-10-28T13:20:01Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-28T16:16:41Z

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-31T19:51:02Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

facebook-github-bot · 2022-10-31T19:55:17Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2022-10-31T19:59:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#87377) ## Issues Fixes pytorch#81129 (comment) ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in pytorch#81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: pytorch#87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet

) Fixes T135842750 (follow-up for #87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: #88488 Approved by: https://github.com/mikekgfb

…#87377) ## Issues Fixes pytorch#81129 (comment) ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in pytorch#81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: pytorch#87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet

…orch#88488) Fixes T135842750 (follow-up for pytorch#87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: pytorch#88488 Approved by: https://github.com/mikekgfb

pytorch-bot bot added the release notes: nn release notes category label Oct 20, 2022

sgrigory marked this pull request as ready for review October 24, 2022 11:57

sgrigory requested review from albanD and jbschlosser as code owners October 24, 2022 11:57

sgrigory changed the title ~~[WIP] Enable src_mask in fast path of TransformerEncoderLayer~~ Enable src_mask in fast path of TransformerEncoderLayer Oct 24, 2022

sgrigory marked this pull request as draft October 24, 2022 19:58

sgrigory marked this pull request as ready for review October 25, 2022 12:48

sgrigory requested a review from mikekgfb October 25, 2022 12:50

mikekgfb approved these changes Oct 26, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 26, 2022

weiwangmeta approved these changes Oct 27, 2022

View reviewed changes

malfet approved these changes Oct 27, 2022

View reviewed changes

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

sgrigory force-pushed the enable-src-mask-bettertransformer branch from ebd3a24 to 12979e8 Compare October 28, 2022 12:29

sgrigory added 2 commits October 28, 2022 13:18

allow src_mask in fast path for transformer

e75cc8d

expand tests to cover new valid mask shapes

9332942

sgrigory and others added 11 commits October 28, 2022 13:18

allow attn_mask in fast path in MultiheadAttention

cd603ad

separate batched and unbatched inputs in test_transformerencoderlayer…

00bad9e

…_fast_path

process mask types 0, 1 without mask expanding on CPU

765c693

Fix formatting in Softmax.cpp

a0547ee

move all tensors to the same device in test_masked_softmax_mask_types…

364afa2

…_0_1

add device parity test for masked softmax, fix broken softmax test

cd44bcf

replace expand_inplace with expand in masked softmax on cpu

19cc3bb

Remove semicolon in SoftMax.cpp as reviewer suggested

8047869

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Remove semicolon in SoftMax.cpp as reviewer suggested

7379eed

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Remove semicolon in SoftMax.cpp as reviewer suggested

5b8012b

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Change type to auto in SoftMax.cpp as reviewer suggested

5c70c50

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

pytorchmergebot force-pushed the enable-src-mask-bettertransformer branch from 12979e8 to 5c70c50 Compare October 28, 2022 13:18

pytorchmergebot added the Merged label Oct 31, 2022

pytorchmergebot closed this in 4c78c7c Oct 31, 2022

sgrigory mentioned this pull request Nov 4, 2022

Support src_mask and src_key_padding_mask for Better Transformer #88488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `src_mask` in fast path of `TransformerEncoderLayer` #87377

Enable `src_mask` in fast path of `TransformerEncoderLayer` #87377

sgrigory commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited

linux-foundation-easycla bot commented Oct 20, 2022 •

edited

mikekgfb commented Oct 21, 2022

sgrigory commented Oct 21, 2022

sgrigory commented Oct 24, 2022

pytorch-bot bot commented Oct 24, 2022

facebook-github-bot commented Oct 24, 2022

facebook-github-bot commented Oct 25, 2022

facebook-github-bot commented Oct 26, 2022

mikekgfb Oct 26, 2022

sgrigory Oct 27, 2022

mikekgfb commented Oct 27, 2022

pytorchmergebot commented Oct 27, 2022

weiwangmeta left a comment

malfet left a comment

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

sgrigory commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 31, 2022

facebook-github-bot commented Oct 31, 2022

pytorchmergebot commented Oct 31, 2022

Enable src_mask in fast path of TransformerEncoderLayer #87377

Enable src_mask in fast path of TransformerEncoderLayer #87377

Conversation

sgrigory commented Oct 20, 2022 • edited

Issues

Description

Tests:

Note

pytorch-bot bot commented Oct 20, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87377

✅ No Failures, 2 Pending

linux-foundation-easycla bot commented Oct 20, 2022 • edited

mikekgfb commented Oct 21, 2022

sgrigory commented Oct 21, 2022

sgrigory commented Oct 24, 2022

pytorch-bot bot commented Oct 24, 2022

facebook-github-bot commented Oct 24, 2022

facebook-github-bot commented Oct 25, 2022

facebook-github-bot commented Oct 26, 2022

mikekgfb Oct 26, 2022

Choose a reason for hiding this comment

sgrigory Oct 27, 2022

Choose a reason for hiding this comment

mikekgfb commented Oct 27, 2022

pytorchmergebot commented Oct 27, 2022

Merge failed

weiwangmeta left a comment

Choose a reason for hiding this comment

malfet left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

sgrigory commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 28, 2022

facebook-github-bot commented Oct 31, 2022

facebook-github-bot commented Oct 31, 2022

pytorchmergebot commented Oct 31, 2022

Merge started

Enable `src_mask` in fast path of `TransformerEncoderLayer` #87377

Enable `src_mask` in fast path of `TransformerEncoderLayer` #87377

sgrigory commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited

linux-foundation-easycla bot commented Oct 20, 2022 •

edited