[dtensor] switch softmax forward ops to OpStrategy #117723

tianyu-l · 2024-01-18T02:09:49Z

Stack from ghstack (oldest at bottom):

-> [dtensor] switch softmax forward ops to OpStrategy #117723

Summary
This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation.

Test
python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd
python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @wconstab @yf225

[ghstack-poisoned]

pytorch-bot · 2024-01-18T02:09:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117723

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 7cfb395 with merge base f316c35 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_stream_across_graph_break
pull / linux-focal-py3.11-clang10 / test (dynamo, 4, 7, linux.2xlarge) (gh)
test_weak.py::WeakTest::test_make_weak_keyed_dict_from_weak_keyed_dict
pull / linux-focal-py3.8-clang10 / test (dynamo, 7, 7, linux.2xlarge) (gh)
test_weak.py::WeakTest::test_make_weak_keyed_dict_from_weak_keyed_dict

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh)
Process completed with exit code 128.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 12bb9a4383b5946dcf10402fe6162722305eea87 Pull Request resolved: #117723

XilunWu

overall LGTM! Some suggestions. Feel free to address them in another PR or in this PR.

torch/distributed/_tensor/ops/math_ops.py

XilunWu · 2024-01-18T06:30:56Z

test/distributed/_tensor/test_math_ops.py

+            dist_y = dist_softmax.sum()
            if dims[softmax_dim] == dims[shard_dim]:
-                with self.assertRaisesRegex(
-                    Exception, "Cannot run .* on sharding dimension!$"
-                ):
-                    dist_softmax = dist_x.softmax(dim=softmax_dim)
+                self.assertTrue(dist_y.placements[0].is_replicate())


shall we put dist_y = dist_softmax.sum() below assertTrue (line 114-117)?

XilunWu · 2024-01-18T06:33:53Z

test/distributed/_tensor/test_math_ops.py

-                dist_y.backward()
-                self.assertIsNotNone(dist_x.grad)
-                self.assertEqual(dist_x.grad.full_tensor(), x.grad)
+            self.assertEqual(dist_y.to_local(), local_y)


I suggest we also check dist_y.grad's sharding placements as well before redistributing.

Hmm there seems no .grad at this place. did you mean dist_y's sharding placements? If so I agree!

**Summary** This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. **Test** `python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd` `python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd` [ghstack-poisoned]

ghstack-source-id: 5b2ff590f96a8a2ed12022935289b1574d986e7e Pull Request resolved: #117723

XilunWu · 2024-01-22T20:48:29Z

I think the PR is good to merge.

tianyu-l · 2024-01-22T21:24:13Z

@pytorchbot merge

pytorchmergebot · 2024-01-22T21:26:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…pStrategy" As titled. This is a followup to PR #117723 on softmax forward ops. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

As titled. This is a followup to PR #117723 on softmax forward ops. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

…pStrategy" As titled. This is a followup to PR #117723 on softmax forward ops. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

As titled. This is a followup to PR #117723 on softmax forward ops. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

As titled. This is a followup to PR #117723 on softmax forward ops. Pull Request resolved: #119255 Approved by: https://github.com/XilunWu, https://github.com/wanchaol

Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension. 3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues #119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 wconstab yf225 [ghstack-poisoned]

Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension. 3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues #119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: #119877 Approved by: https://github.com/wanchaol

[dtensor] switch softmax forward ops to OpStrategy

89cb085

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Jan 18, 2024

[dtensor] switch softmax forward ops to OpStrategy

fcd816d

ghstack-source-id: 12bb9a4383b5946dcf10402fe6162722305eea87 Pull Request resolved: #117723

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 18, 2024

tianyu-l added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Jan 18, 2024

tianyu-l requested review from wanchaol and XilunWu January 18, 2024 02:11

tianyu-l removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 18, 2024

XilunWu approved these changes Jan 18, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Jan 19, 2024

[dtensor] switch softmax forward ops to OpStrategy

383b9d5

ghstack-source-id: 5b2ff590f96a8a2ed12022935289b1574d986e7e Pull Request resolved: #117723

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 19, 2024

pytorchmergebot added the merging label Jan 22, 2024

pytorchmergebot added the Merged label Jan 22, 2024

pytorchmergebot closed this in 86e8551 Jan 22, 2024

pytorchmergebot removed the merging label Jan 22, 2024

XilunWu mentioned this pull request Jan 22, 2024

[DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output #116437

Closed

facebook-github-bot deleted the gh/tianyu-l/3/head branch January 26, 2024 15:22

tianyu-l mentioned this pull request Feb 6, 2024

[dtensor] switch softmax backward ops to OpStrategy #119255

Closed

tianyu-l mentioned this pull request Feb 14, 2024

[dtensor] add support for loss parallel #119877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor] switch softmax forward ops to OpStrategy #117723

[dtensor] switch softmax forward ops to OpStrategy #117723

tianyu-l commented Jan 18, 2024 •

edited

Loading

pytorch-bot bot commented Jan 18, 2024 •

edited

Loading

XilunWu left a comment

XilunWu Jan 18, 2024

XilunWu Jan 18, 2024

tianyu-l Jan 19, 2024

XilunWu commented Jan 22, 2024

tianyu-l commented Jan 22, 2024

pytorchmergebot commented Jan 22, 2024

[dtensor] switch softmax forward ops to OpStrategy #117723

[dtensor] switch softmax forward ops to OpStrategy #117723

Conversation

tianyu-l commented Jan 18, 2024 • edited Loading

pytorch-bot bot commented Jan 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117723

✅ You can merge normally! (4 Unrelated Failures)

XilunWu left a comment

Choose a reason for hiding this comment

XilunWu Jan 18, 2024

Choose a reason for hiding this comment

XilunWu Jan 18, 2024

Choose a reason for hiding this comment

tianyu-l Jan 19, 2024

Choose a reason for hiding this comment

XilunWu commented Jan 22, 2024

tianyu-l commented Jan 22, 2024

pytorchmergebot commented Jan 22, 2024

Merge started

tianyu-l commented Jan 18, 2024 •

edited

Loading

pytorch-bot bot commented Jan 18, 2024 •

edited

Loading