Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable src_mask in fast path of TransformerEncoderLayer #87377

Closed

Conversation

sgrigory
Copy link
Contributor

@sgrigory sgrigory commented Oct 20, 2022

Issues

Fixes #81129 (comment)

Description

Passing a 2D attention mask src_mask into the fast path of TransformerEncoderLayer in CPU was causing an error and so was disabled in #81277. This PR unrolls this fix, enabling src_mask on the fast path:

  • Either attention mask src_mask of shape (L, L) or padding mask src_key_padding_mask of shape (B, L) are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, Softmax.cpp::host_softmax converts the index to match the mask dimensions, depending on the type.
  • If softmax is applied along the dimension other than the last, Softmax.cpp::masked_softmax_cpu expands masks to 4D, converting them to mask_type=2. Theoretically one could also add special optimized cases for dim=0, 1, 2 and process them without mask expansion, but I don't know how often is that used

Tests:

  • test_transformerencoderlayer_fast_path is extended to cover both attention mask and padding mask
  • test_masked_softmax_mask_types_0_1 is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
  • test_masked_softmax_devices_parity is added to ensure results from masked softmax on CPU and CUDA match

Note

I had to replace float with torch.get_default_dtype() in a couple of tests for the following reason:

  • test_nn.py sets the default type to torch.double
  • If I execute test_nn.py and test_transformers.py in one pytest run, this default still holds for transformer tests
  • Some tests in test_transformers.py which were previously following the slow path now switched to fast path, and hard-coded float started clashing with default double

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with pytest like this

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87377

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 2 Pending

As of commit 5c70c50:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 20, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Oct 20, 2022
@mikekgfb
Copy link
Contributor

@drisspg

@sgrigory
Copy link
Contributor Author

/easycla

@sgrigory
Copy link
Contributor Author

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2022

You don't have permissions to rebase this PR, only people with write permissions may rebase PRs.

@sgrigory sgrigory marked this pull request as ready for review October 24, 2022 11:57
@sgrigory sgrigory changed the title [WIP] Enable src_mask in fast path of TransformerEncoderLayer Enable src_mask in fast path of TransformerEncoderLayer Oct 24, 2022
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sgrigory sgrigory marked this pull request as draft October 24, 2022 19:58
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sgrigory sgrigory marked this pull request as ready for review October 25, 2022 12:48
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

elif src.is_nested and src_key_padding_mask is not None:
why_not_sparsity_fast_path = "src_key_padding_mask is not supported with NestedTensor input for fastpath"
elif src.is_nested and (src_key_padding_mask is not None or src_mask is not None):
why_not_sparsity_fast_path = "src_key_padding_mask and src_mask are not supported with NestedTensor input"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the message clear? i.e., that either one causes this to fail. I am hesitant to put too much time in error message engineering but helping users get to fastpath is important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be made clearer, e.g. "supplying both src_key_padding_mask and src_mask is not supported with NestedTensor input" or "with NestedTensor input, only one of src_key_padding_mask and src_mask can be provided". Since in the next PR I'll working on making 4D inputs possible from Python level and modifying this code, I can as well adjust this message

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 26, 2022
@mikekgfb
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approval needed from one of the following (Rule 'superuser'):
EscapeZero, weiwangmeta, suphoff, ymao1993, xcheng16, ...

Details for Dev Infra team Raised by workflow job

Copy link
Contributor

@weiwangmeta weiwangmeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symbolic review to address this comment this comment #87377 (comment)
But ideally we should not require this. Michael has already reviewed and that should be all that is needed.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semicolon after closing curly bracket is only nedeed after class definitions (or if one is assigning anonymous lambda to a variable)

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sgrigory sgrigory force-pushed the enable-src-mask-bettertransformer branch from ebd3a24 to 12979e8 Compare October 28, 2022 12:29
@sgrigory
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased enable-src-mask-bettertransformer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout enable-src-mask-bettertransformer && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the enable-src-mask-bettertransformer branch from 12979e8 to 5c70c50 Compare October 28, 2022 13:18
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

1 similar comment
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
…#87377)

## Issues
Fixes pytorch#81129 (comment)

## Description

Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in pytorch#81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

- Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
- If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used

## Tests:
- `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
- `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
- `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match

## Note
I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
- `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
- If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
- Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

Pull Request resolved: pytorch#87377
Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Nov 10, 2022
)

Fixes T135842750 (follow-up for #87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: #88488
Approved by: https://github.com/mikekgfb
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…#87377)

## Issues
Fixes pytorch#81129 (comment)

## Description

Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in pytorch#81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

- Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
- If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used

## Tests:
- `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
- `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
- `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match

## Note
I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
- `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
- If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
- Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

Pull Request resolved: pytorch#87377
Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…orch#88488)

Fixes T135842750 (follow-up for pytorch#87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: pytorch#88488
Approved by: https://github.com/mikekgfb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: nn release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transformer and CPU path with src_mask raises error with torch 1.12
6 participants