[DTensor] Turn on foreach implementation of optimizer for DTensor by default #123394

wz337 · 2024-04-04T22:47:39Z

Append DTensor to the optimizer _foreach_supported_types and turn on foreach implementation of optimizer for DTensor if not specified by the users.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @d4l3k @msaroufim @rohan-varma

pytorch-bot · 2024-04-04T22:47:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123394

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit f907706 with merge base 1a28f73 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh) (disabled by #126296)
distributed/_tensor/test_attention.py::RingAttentionTest::test_ring_attention_custom_transformer
periodic / linux-vulkan-focal-py3.11-clang10 / test (default, 1, 1, linux.2xlarge) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-04-04T22:59:55Z

torch/distributed/_tensor/__init__.py

@@ -23,6 +25,11 @@
 ]


+# Append DTensor to the list of supported types for foreach implementation of optimizer
+# so that we will try to use foreach over the for-loop implementation on CUDA.
+_foreach_supported_types.append(DTensor)


Python question: Are we guaranteed that the __init__.py code will only ever run once? Do we need to check like:

if DTensor not in _foreach_supported_types: _foreach_supported_types.append(DTensor)

Good point. I will just add it for a safety check.

it should only be imported once from python importing prospective, but a guard is safer yes

Or change _foreach_supported_types to a Set. :)
cc: @janeyx99

Feel free to make it a set if it's easier haha

wz337 · 2024-04-04T23:40:55Z

@pytorchmergebot rebase

pytorchmergebot · 2024-04-04T23:42:24Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-04-04T23:42:28Z

Successfully rebased turn_on_dtensor_foreach onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout turn_on_dtensor_foreach && git pull --rebase)

wanchaol · 2024-04-04T23:44:16Z

test/distributed/_tensor/test_optimizers.py

    @with_comms
    def test_adam_1d_sharding(self):
        mesh = DeviceMesh(self.device_type, list(range(self.world_size)))

        # TODO: add fused_adam support
        adam_configs = [
-            {"lr": 0.1},
+            {"lr": 0.1, "foreach": False},


wondering why we turn foreach to False for all tests?
iiuc even if we put DTensor to _foreach_supported_types, if we pass foreach=False manually to optimizer, it would disable the foreach optimizer path too

I am just reverting the config based on whether we have "foreach": True originally.

If we have "foreach": True in the config, then we remove it, as it is turned on by default.
For the config that doesn't have "foreach": True, I am turning it to "foreach": False.
So we still have some tests for both foreach and the for-loop implementation if it makes sense.

ohh i see, make sense

wanchaol

lgtm, for grad norm clipping I guess it would happen in follow up PRs?

wanchaol · 2024-04-05T00:52:20Z

test/distributed/_tensor/test_optimizers.py

    @with_comms
    def test_adam_1d_sharding(self):
        mesh = DeviceMesh(self.device_type, list(range(self.world_size)))

        # TODO: add fused_adam support
        adam_configs = [
-            {"lr": 0.1},
+            {"lr": 0.1, "foreach": False},


ohh i see, make sense

wz337 · 2024-04-19T17:07:12Z

@pytorchmergebot rebase

pytorchmergebot · 2024-04-19T17:08:49Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-04-19T17:08:54Z

Successfully rebased turn_on_dtensor_foreach onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout turn_on_dtensor_foreach && git pull --rebase)

wz337 · 2024-05-10T21:56:52Z

@pytorchmergebot rebase

pytorchmergebot · 2024-05-10T21:58:27Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-10T21:58:30Z

Successfully rebased turn_on_dtensor_foreach onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout turn_on_dtensor_foreach && git pull --rebase)

bdhirsh · 2024-05-14T17:55:34Z

torch/distributed/_tensor/dispatch.py

@@ -302,6 +302,7 @@ def unwrap_to_op_info(
                args_schema.append(arg._spec)
                local_args.append(arg._local_tensor)
                if mesh is not None:
+                    print(f"{mesh=}, {arg.device_mesh=}")


I tried repro'ing the test_dtensor_compile failure locally. It's actually coming from here - the test passes when I remove the print statement.

I mostly could tell by looking at the stacktrace from the FakeTensor erroring, and seeing that it's coming from:

(1) this code is printing arg.device_mesh

(2) DeviceMesh is a tensor, and printing it calls tensor.toList()

(3) printing a tensor is not a very trace-friendly operation... which is why you get a kind-of-obscure error

wz337 · 2024-05-15T09:44:10Z

@pytorchmergebot merge

pytorchmergebot · 2024-05-15T09:46:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wz337 · 2024-05-15T16:43:11Z

@pytorchmergebot merge

pytorchmergebot · 2024-05-15T16:45:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: #121799 Also need this to unblock #123394 Pull Request resolved: #123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu

…default (pytorch#123394) Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users. Pull Request resolved: pytorch#123394 Approved by: https://github.com/wanchaol

Fixes pytorch#121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: pytorch#121799 Also need this to unblock pytorch#123394 Pull Request resolved: pytorch#123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 4, 2024

wz337 added module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Apr 4, 2024

wz337 changed the title ~~[DTensor] Turn on foreach implementation of optimizer for DTensor if not specified by users~~ [DTensor] Turn on foreach implementation of optimizer for DTensor by default Apr 4, 2024

wz337 requested a review from wanchaol April 4, 2024 22:50

wz337 marked this pull request as ready for review April 4, 2024 22:50

awgu reviewed Apr 4, 2024

View reviewed changes

wz337 force-pushed the turn_on_dtensor_foreach branch from 51774c9 to 9a107df Compare April 4, 2024 23:04

pytorchmergebot force-pushed the turn_on_dtensor_foreach branch from 9a107df to 76bbf9c Compare April 4, 2024 23:42

wanchaol reviewed Apr 4, 2024

View reviewed changes

wz337 requested a review from wanchaol April 5, 2024 00:47

wanchaol approved these changes Apr 5, 2024

View reviewed changes

wz337 mentioned this pull request Apr 8, 2024

[DeviceMesh] Fix hash and eq not match #123572

Closed

wz337 force-pushed the turn_on_dtensor_foreach branch from 76bbf9c to 0dfb5c0 Compare April 8, 2024 22:43

pytorchmergebot force-pushed the turn_on_dtensor_foreach branch from 0dfb5c0 to 199d77a Compare April 19, 2024 17:08

weifengpy mentioned this pull request Apr 24, 2024

enable LoRA + FSDP2 pytorch/torchtune#855

Merged

6 tasks

pytorchmergebot force-pushed the turn_on_dtensor_foreach branch from 199d77a to 240236d Compare May 10, 2024 21:58

bdhirsh reviewed May 14, 2024

View reviewed changes

wz337 force-pushed the turn_on_dtensor_foreach branch 4 times, most recently from d8643a1 to a89d0f0 Compare May 15, 2024 00:37

turn on foreach

f907706

wz337 force-pushed the turn_on_dtensor_foreach branch from a89d0f0 to f907706 Compare May 15, 2024 05:10

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2024

pytorchmergebot added the merging label May 15, 2024

pytorchmergebot removed the merging label May 15, 2024

pytorchmergebot added the merging label May 15, 2024

pytorchmergebot closed this in c1dc8bb May 15, 2024

pytorchmergebot added Merged and removed merging labels May 15, 2024

wz337 mentioned this pull request May 15, 2024

[Optim] Change _foreach_supported_types to a set #126353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Turn on foreach implementation of optimizer for DTensor by default #123394

[DTensor] Turn on foreach implementation of optimizer for DTensor by default #123394

wz337 commented Apr 4, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 4, 2024 •

edited

Loading

awgu Apr 4, 2024

wz337 Apr 4, 2024

wanchaol Apr 4, 2024

kwen2501 May 11, 2024

janeyx99 May 13, 2024

wz337 commented Apr 4, 2024

pytorchmergebot commented Apr 4, 2024

pytorchmergebot commented Apr 4, 2024

wanchaol Apr 4, 2024

wz337 Apr 5, 2024

wanchaol Apr 5, 2024

wanchaol left a comment

wanchaol Apr 5, 2024

wz337 commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

wz337 commented May 10, 2024

pytorchmergebot commented May 10, 2024

pytorchmergebot commented May 10, 2024

bdhirsh May 14, 2024

wz337 commented May 15, 2024

pytorchmergebot commented May 15, 2024

wz337 commented May 15, 2024

pytorchmergebot commented May 15, 2024

[DTensor] Turn on foreach implementation of optimizer for DTensor by default #123394

[DTensor] Turn on foreach implementation of optimizer for DTensor by default #123394

Conversation

wz337 commented Apr 4, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Apr 4, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123394

✅ You can merge normally! (2 Unrelated Failures)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wz337 commented Apr 4, 2024

pytorchmergebot commented Apr 4, 2024

pytorchmergebot commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wz337 commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

pytorchmergebot commented Apr 19, 2024

wz337 commented May 10, 2024

pytorchmergebot commented May 10, 2024

pytorchmergebot commented May 10, 2024

Choose a reason for hiding this comment

wz337 commented May 15, 2024

pytorchmergebot commented May 15, 2024

Merge started

wz337 commented May 15, 2024

pytorchmergebot commented May 15, 2024

Merge started

wz337 commented Apr 4, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 4, 2024 •

edited

Loading