Skip to content

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Jul 28, 2023

This PR:
1. Drop assert for 1D DeviceMesh check to allow DTensor with nD DeviceMesh when creating write_item.
2. Add tests for both placement changes and mesh changes for both 1D and 2D scenarios.

cc. @kumpera @wanchaol @fegin

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106230

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e5738e2 with merge base 3fe8417 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wz337 wz337 force-pushed the test_2d_dtensor_checkpoint branch from fe12fb1 to 73d40f9 Compare July 28, 2023 22:17
@wz337 wz337 force-pushed the test_2d_dtensor_checkpoint branch from 73d40f9 to 100c39d Compare August 16, 2023 22:25
@wz337 wz337 force-pushed the test_2d_dtensor_checkpoint branch 7 times, most recently from 2db9552 to dcaf38e Compare September 8, 2023 03:40
fix placements change failure
add resharding tests
@wz337 wz337 force-pushed the test_2d_dtensor_checkpoint branch from dcaf38e to e5738e2 Compare September 8, 2023 04:14
@wz337 wz337 changed the title [DCP] Enable nD device_mesh resharding in DCP and add associated tests [DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests Sep 8, 2023
@wz337 wz337 marked this pull request as ready for review September 8, 2023 05:33
[Replicate(), Replicate()],
[Replicate(), Shard(0)],
[Shard(0), Replicate()],
[Shard(0), Shard(0)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider adding sharding on non-zero dim?

Copy link
Contributor Author

@wz337 wz337 Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. Will add more test cases. Sharding on different dim on placements should be no problem. The previous bug is on offset compute is due to when sharding on one dimension twice, dtensor util only gives the local shard offset instead of global offset.

Regardless, will definitely add more test cases for this.

Copy link
Contributor

@kumpera kumpera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wz337
Copy link
Contributor Author

wz337 commented Sep 11, 2023

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 11, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Comment on lines +42 to +43
for p1 in TWO_D_PLACEMENTS:
for p2 in TWO_D_PLACEMENTS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might want to use itertools.product instead?

Comment on lines +161 to +173
global_tensor = torch.arange(16, dtype=torch.float).view(4, 4)
mesh_shape = (self.world_size,)
mesh_1d = init_device_mesh(self.device_type, mesh_shape)
dtensor = distribute_tensor(
global_tensor, mesh_1d, placements=placements_1d
)
state_dict_to_save = {"dtensor": dtensor}

dist_cp.save_state_dict(
state_dict=state_dict_to_save,
storage_writer=dist_cp.FileSystemWriter(path=CHECKPOINT_DIR),
planner=dist_cp.DefaultSavePlanner(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of codes here seems to be duplicate. Maybe you want to consolidate them a little bit?

@fegin
Copy link
Contributor

fegin commented Sep 12, 2023

LGTM, @fduwjj's comment is legit. Can improve it in a seperate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants