[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests #106230

wz337 · 2023-07-28T21:41:04Z

This PR:
1. Drop assert for 1D DeviceMesh check to allow DTensor with nD DeviceMesh when creating write_item.
2. Add tests for both placement changes and mesh changes for both 1D and 2D scenarios.

cc. @kumpera @wanchaol @fegin

pytorch-bot · 2023-07-28T21:41:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106230

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e5738e2 with merge base 3fe8417 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fix placements change failure add resharding tests

kumpera · 2023-09-11T21:22:00Z

test/distributed/checkpoint/test_dtensor_resharding.py

+    [Replicate(), Replicate()],
+    [Replicate(), Shard(0)],
+    [Shard(0), Replicate()],
+    [Shard(0), Shard(0)],


Maybe consider adding sharding on non-zero dim?

Thanks for reviewing. Will add more test cases. Sharding on different dim on placements should be no problem. The previous bug is on offset compute is due to when sharding on one dimension twice, dtensor util only gives the local shard offset instead of global offset.

Regardless, will definitely add more test cases for this.

kumpera

LGTM.

wz337 · 2023-09-11T21:59:05Z

@pytorchmergebot merge

pytorchmergebot · 2023-09-11T22:00:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fduwjj · 2023-09-11T23:20:06Z

test/distributed/checkpoint/test_dtensor_resharding.py

+for p1 in TWO_D_PLACEMENTS:
+    for p2 in TWO_D_PLACEMENTS:


nit: you might want to use itertools.product instead?

fduwjj · 2023-09-11T23:21:10Z

test/distributed/checkpoint/test_dtensor_resharding.py

+            global_tensor = torch.arange(16, dtype=torch.float).view(4, 4)
+            mesh_shape = (self.world_size,)
+            mesh_1d = init_device_mesh(self.device_type, mesh_shape)
+            dtensor = distribute_tensor(
+                global_tensor, mesh_1d, placements=placements_1d
+            )
+            state_dict_to_save = {"dtensor": dtensor}
+
+            dist_cp.save_state_dict(
+                state_dict=state_dict_to_save,
+                storage_writer=dist_cp.FileSystemWriter(path=CHECKPOINT_DIR),
+                planner=dist_cp.DefaultSavePlanner(),
+            )


lots of codes here seems to be duplicate. Maybe you want to consolidate them a little bit?

fegin · 2023-09-12T00:12:12Z

LGTM, @fduwjj's comment is legit. Can improve it in a seperate PR.

wz337 added module: distributed_checkpoint topic: not user facing topic category labels Jul 28, 2023

wz337 force-pushed the test_2d_dtensor_checkpoint branch from fe12fb1 to 73d40f9 Compare July 28, 2023 22:17

wz337 force-pushed the test_2d_dtensor_checkpoint branch from 73d40f9 to 100c39d Compare August 16, 2023 22:25

wz337 force-pushed the test_2d_dtensor_checkpoint branch 7 times, most recently from 2db9552 to dcaf38e Compare September 8, 2023 03:40

enable nD device_mesh resharding in DCP

e5738e2

fix placements change failure add resharding tests

wz337 force-pushed the test_2d_dtensor_checkpoint branch from dcaf38e to e5738e2 Compare September 8, 2023 04:14

wz337 changed the title ~~[DCP] Enable nD device_mesh resharding in DCP and add associated tests~~ [DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests Sep 8, 2023

wz337 marked this pull request as ready for review September 8, 2023 05:33

wz337 requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners September 8, 2023 05:33

kumpera reviewed Sep 11, 2023

View reviewed changes

kumpera approved these changes Sep 11, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 11, 2023

pytorchmergebot added the merging label Sep 11, 2023

fduwjj reviewed Sep 11, 2023

View reviewed changes

pytorchmergebot added Merged and removed merging labels Sep 12, 2023

pytorchmergebot closed this in b6f9d4d Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests #106230

[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests #106230

Uh oh!

wz337 commented Jul 28, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 28, 2023 •

edited

Loading

Uh oh!

kumpera Sep 11, 2023

Uh oh!

wz337 Sep 11, 2023 •

edited

Loading

Uh oh!

kumpera left a comment

Uh oh!

wz337 commented Sep 11, 2023

Uh oh!

pytorchmergebot commented Sep 11, 2023

Uh oh!

fduwjj Sep 11, 2023

Uh oh!

fduwjj Sep 11, 2023

Uh oh!

fegin commented Sep 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests #106230

[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests #106230

Uh oh!

Conversation

wz337 commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106230

✅ No Failures

Uh oh!

kumpera Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

wz337 Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumpera left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 commented Sep 11, 2023

Uh oh!

pytorchmergebot commented Sep 11, 2023

Merge started

Uh oh!

fduwjj Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

fegin commented Sep 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wz337 commented Jul 28, 2023 •

edited

Loading

pytorch-bot bot commented Jul 28, 2023 •

edited

Loading

wz337 Sep 11, 2023 •

edited

Loading