[dtensor] run transformer sdpa in dtensor #122997

tianyu-l · 2024-03-29T22:14:20Z

Stack from ghstack (oldest at bottom):

Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.

Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a torch.Tensor mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?

Test files containing this Transformer:

test/distributed/tensor/parallel/test_tp_examples.py
test/distributed/_composable/fsdp/test_fully_shard_training.py
test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225 @chauhang @d4l3k @rohan-varma

[ghstack-poisoned]

pytorch-bot · 2024-03-29T22:14:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122997

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b336a91 with merge base a3d97f6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 661b23ba34e8a05ca8eb3057fe04b82832ad64d1 Pull Request resolved: #122997

XilunWu

lgtm

Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments. Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention? Test files containing this Transformer: - `test/distributed/tensor/parallel/test_tp_examples.py` - `test/distributed/_composable/fsdp/test_fully_shard_training.py` - `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py` cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 chauhang [ghstack-poisoned]

ghstack-source-id: 9a2cac1eaba0211bd115a00db2aa5eded8fae36a Pull Request resolved: #122997

tianyu-l · 2024-05-08T17:06:03Z

@pytorchbot merge

pytorchmergebot · 2024-05-08T17:08:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[dtensor] run transformer sdpa in dtensor

38a6f2d

[ghstack-poisoned]

This was referenced Mar 29, 2024

[dtensor] improve new factory strategy #122995

Closed

[dtensor] add op support for memory efficient attention #122996

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 29, 2024

tianyu-l added a commit that referenced this pull request Mar 29, 2024

[dtensor] run transformer sdpa in dtensor

cf508a9

ghstack-source-id: 661b23ba34e8a05ca8eb3057fe04b82832ad64d1 Pull Request resolved: #122997

tianyu-l added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Mar 29, 2024

tianyu-l requested review from awgu and wanchaol March 29, 2024 22:37

XilunWu approved these changes Apr 15, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request May 8, 2024

[dtensor] run transformer sdpa in dtensor

05648c0

ghstack-source-id: 9a2cac1eaba0211bd115a00db2aa5eded8fae36a Pull Request resolved: #122997

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot added the Merged label May 8, 2024

pytorchmergebot closed this in faf0015 May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

jeffdaily mentioned this pull request May 10, 2024

DISABLED test_transformer_training_is_seq_parallel_False (__main__.DistTensorParallelExampleTest) #125918

Open

github-actions bot deleted the gh/tianyu-l/15/head branch June 8, 2024 01:55

weifengpy mentioned this pull request Aug 2, 2024

[TP] verify numeric parity on Transfromers for multiple iterations #132543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor] run transformer sdpa in dtensor #122997

[dtensor] run transformer sdpa in dtensor #122997

tianyu-l commented Mar 29, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading

XilunWu left a comment

tianyu-l commented May 8, 2024

pytorchmergebot commented May 8, 2024

[dtensor] run transformer sdpa in dtensor #122997

[dtensor] run transformer sdpa in dtensor #122997

Conversation

tianyu-l commented Mar 29, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Mar 29, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122997

✅ No Failures

XilunWu left a comment

Choose a reason for hiding this comment

tianyu-l commented May 8, 2024

pytorchmergebot commented May 8, 2024

Merge started

tianyu-l commented Mar 29, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading