Skip to content

Conversation

sgrigory
Copy link
Contributor

@sgrigory sgrigory commented Nov 4, 2022

Fixes T135842750 (follow-up for #87377)

Description

At present, having both src_key_padding_mask and src_mask at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:

  • on CUDA, by SoftMax.cu::masked_softmax_cuda. When mask type is 2, it calls either dispatch_softmax_forward -> softmax_warp_forward or at::softmax (depending on the input size). In both cases 4D mask is supported.
  • on CPU, by SoftMax.cpp::masked_softmax_cpp. It calls hosted_softmax which supports 4D mask.

Tests

  • Extended test_mask_check_fastpath to check that fast path is indeed taken in Transformer when two masks are passed
  • Added test_multihead_self_attn_two_masks_fast_path_mock to check that fast path is taken in MHA when two masks are passed
  • Added test_multihead_self_attn_two_masks_fast_path to check that fast and slow paths give the same result when two masks are passed in MHA
  • test_masked_softmax_mask_types now covers mask type 2
  • test_transformerencoderlayer_fast_path (CPU smoke test) is expanded to the case of both masks provided simultaneously
  • test_masked_softmax_devices_parity checks that mask type 2 is accepted by CPU and CUDA paths

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Nov 4, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88488

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4366984:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sgrigory sgrigory changed the title [WIP] Support src_mask and src_key_padding_mask for Better Transformer Support src_mask and src_key_padding_mask for Better Transformer Nov 8, 2022
@sgrigory sgrigory marked this pull request as ready for review November 8, 2022 17:09
@sgrigory sgrigory requested a review from mikekgfb November 8, 2022 17:37
@mikekgfb
Copy link
Contributor

mikekgfb commented Nov 8, 2022

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 8, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 2, 4, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@sgrigory
Copy link
Contributor Author

sgrigory commented Nov 9, 2022

@malfet The merge above failed, but the errors seem to be unrelated to the PR:

  • "ERROR ENCOUNTERED WHEN UPLOADING TO SCRIBE",
  • "KeyError: 'jobs'" in "Get workflow job id"
  • NVIDIA kernel loading error.

Could you have a look and say if those are indeed known infra failures?

@sgrigory
Copy link
Contributor Author

sgrigory commented Nov 9, 2022

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased support-two-masks-better-transformer onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout support-two-masks-better-transformer && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the support-two-masks-better-transformer branch from f327c24 to 4366984 Compare November 9, 2022 08:08
@facebook-github-bot
Copy link
Contributor

@sgrigory has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sgrigory
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…orch#88488)

Fixes T135842750 (follow-up for pytorch#87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: pytorch#88488
Approved by: https://github.com/mikekgfb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: nn release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants