DTensor: add comm tests to test_tp_examples #121669

d4l3k · 2024-03-11T20:59:26Z

This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for test_transformer_training.

Fixes #121649

Test plan:

pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot · 2024-03-11T20:59:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121669

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6035e67 with merge base 443444d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-docs / build-docs-python-false (gh)
Process completed with exit code 1.

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable) (gh)
test_tensorboard.py::TestTensorBoardSummary::test_hparams_number

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-03-11T21:01:52Z

Oof, this will conflict with #121660 -- we want to have a canonical TP/SP sharding for the Transformer so we can use that for FSDP + TP/SP unit tests.
cc: @wanchaol

Should we just duplicate the TP/SP sharding logic since it seems we need to insert inline tests here?

d4l3k · 2024-03-11T21:09:14Z

@awgu I can just pull the tests to wrap the whole parallelize method since that shouldn't be making any network calls. Not a problem. Do you know when that's landing?

wanchaol

Nice and fast changes! have some comments around the sharding init comm tracking

test/distributed/tensor/parallel/test_tp_examples.py

wanchaol · 2024-03-11T22:44:54Z

test/distributed/tensor/parallel/test_tp_examples.py

        self._check_module(model, model_tp)
+        if is_seq_parallel:
+            self.assertDictEqual(comm_mode.get_comm_counts(), {
+                c10d_functional.all_reduce: 30,


wat, I am shocked that there're 30 allreduces in optimizer step 😮 I was expecting around 5 allreduces hmm

could you pass foreach=True to Adam optimizer and see if that would reduce # allreduces?

I think foreach is a single kernel but the tensors are still kept separate

It seems each LayerNorm would incur 2 all-reduces, weight & bias, in the backward pass. With default two-layer Transformer, we have two attention_norm, two ffn_norm, and one final norm -- adding up to 10 all-reduces already?

In general, I wonder if we should include a breakdown of the collectives in the test -- ideally it should be a function on the number of TransformerBlocks (and other relevant configs), so even if we modify the model architecture, the test could still pass.

Traceback (most recent call last): File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 540, in wrapper self._join_processes(fn) File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 759, in _join_processes self._check_return_codes(elapsed_time) File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 809, in _check_return_codes raise RuntimeError(error) RuntimeError: Process 0 exited with error code 10 and exception: Traceback (most recent call last): File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 656, in run_test getattr(self, test_name)() File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 542, in wrapper fn() File "/home/tristanr/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper method(*args, **kwargs) File "/home/tristanr/pytorch/torch/testing/_internal/common_utils.py", line 439, in instantiated_test test(self, **param_kwargs) File "/home/tristanr/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 374, in wrapper func(self, *args, **kwargs) # type: ignore[misc] File "/home/tristanr/pytorch/torch/testing/_internal/common_distributed.py", line 181, in wrapper return func(*args, **kwargs) File "/home/tristanr/pytorch/test/distributed/tensor/parallel/test_tp_examples.py", line 247, in test_transformer_training optim_tp.step() File "/home/tristanr/pytorch/torch/optim/optimizer.py", line 391, in wrapper out = func(*args, **kwargs) File "/home/tristanr/pytorch/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "/home/tristanr/pytorch/torch/optim/adam.py", line 168, in step adam( File "/home/tristanr/pytorch/torch/optim/adam.py", line 318, in adam func(params, File "/home/tristanr/pytorch/torch/optim/adam.py", line 522, in _multi_tensor_adam torch._foreach_lerp_(device_exp_avgs, device_grads, 1 - beta1) File "/home/tristanr/pytorch/torch/distributed/_tensor/api.py", line 279, in __torch_dispatch__ return DTensor._op_dispatcher.dispatch( File "/home/tristanr/pytorch/torch/distributed/_tensor/dispatch.py", line 111, in dispatch op_info = self.unwrap_to_op_info(op_call, args, kwargs) File "/home/tristanr/pytorch/torch/distributed/_tensor/dispatch.py", line 314, in unwrap_to_op_info raise RuntimeError( RuntimeError: aten._foreach_lerp_.Scalar: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

Seems to error out with foreach=True

If you want to bypass that error:

from torch.distributed._tensor.experimental import implicit_replication with implicit_replication(): optim.step()

cc: @wanchaol

Ohh i see, this is nn.Layernorm not RMSNorm, where it have 2 allreduces per-layernorm, this make sense then

wanchaol

This lgtm! thanks for the fast PR!

d4l3k · 2024-03-15T00:41:11Z

@pytorchbot merge

pytorchmergebot · 2024-03-15T00:43:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

d4l3k requested a review from wanchaol March 11, 2024 20:59

pytorch-bot bot added the topic: not user facing topic category label Mar 11, 2024

d4l3k requested a review from kurman March 11, 2024 20:59

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 11, 2024

d4l3k requested a review from awgu March 11, 2024 21:09

wanchaol reviewed Mar 11, 2024

View reviewed changes

DTensor: add comm tests to test_tp_examples

6035e67

d4l3k force-pushed the tristanr/test_tp_examples branch from 5114cc4 to 6035e67 Compare March 14, 2024 17:24

d4l3k requested a review from wanchaol March 14, 2024 20:30

wanchaol approved these changes Mar 14, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 15, 2024

pytorchmergebot added the merging label Mar 15, 2024

pytorchmergebot added the Merged label Mar 15, 2024

pytorchmergebot closed this in e4fda04 Mar 15, 2024

pytorchmergebot removed the merging label Mar 15, 2024

d4l3k deleted the tristanr/test_tp_examples branch March 15, 2024 18:17

DTensor: add comm tests to test_tp_examples #121669

DTensor: add comm tests to test_tp_examples #121669

Uh oh!

Conversation

d4l3k commented Mar 11, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121669

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

awgu commented Mar 11, 2024

Uh oh!

d4l3k commented Mar 11, 2024

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanchaol Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

yifuwang Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Mar 15, 2024

Uh oh!

pytorchmergebot commented Mar 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

d4l3k commented Mar 11, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 11, 2024 •

edited

Loading

tianyu-l Mar 12, 2024 •

edited

Loading