[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed #89398

wz337 · 2022-11-21T05:35:33Z

This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This is used when flatten nested dict and flatten sharded tensors.

Docstring and comments will be added in the following PRs.

Test:

python3 test/distributed/_tensor/parallel/test_2d_parallel.py

and CI

pytorch-bot · 2022-11-21T05:35:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89398

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9182656:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol · 2022-11-21T23:34:45Z

test/distributed/checkpoint/test_traverse.py

+            data,
+        )
+        self.assertEqual(
+            data[


is there a reason why all those asserts linting in this way? can we shorten them in fewer lines?

Will reformat and clean this up. I am taking this directly from Tau. I remember Rodrigo mentioned that he purposely formatted it this way to pass CI. Not sure about the details. lol

wanchaol · 2022-11-21T23:50:43Z

torch/distributed/checkpoint/traverse.py

+    STATE_DICT_TYPE,
+)
+from torch.distributed._shard.sharded_tensor.api import ShardedTensor
+from torch.distributed._tensor import DTensor as DT


nit: let's just call it DTensor? I feel DT is a bit non-readable.

wanchaol · 2022-11-21T23:51:05Z

torch/distributed/checkpoint/traverse.py

+OBJ_PATH = Tuple[PATH_ITEM, ...]
+T = TypeVar("T")
+
+STATE_DICT_ITEM = object


why we make this type alias for object?

My gut feeling is that this is for readability since we are traversing state dict here.

wanchaol · 2022-11-22T00:01:04Z

torch/distributed/checkpoint/traverse.py

+        _print_nested(
+            value._local_tensor,
+            f"{padding}\t",
+            "(offset ???) ",


is this ??? intentional

I think he is trying to print the offset for a given _local_tensor here, but don't have an API for it. I am removing this for now. To my knowledge, we don't have anything for this yet, right?

do you mean a nested ST + DT? I didn't aware of it. Could you test this with the 2-D tests to make sure removing it does not break anything? thanks!

Added a TODO here to revisit this. Removing it for now as it doesn't break the test_2d_parallel.py tests.

wanchaol

lgtm, thanks for fixing the lint. Have one more comment.

wanchaol · 2022-11-22T01:57:54Z

torch/distributed/checkpoint/traverse.py

+
+def _print_nested(
+    value: STATE_DICT_ITEM,
+    padding: str = "",


what's padding mean here? I didn't see it being used anywhere, shall we remove this arg?

Seems redundant. Removing it as well. Thanks for pointing out!

wz337 · 2022-11-22T04:45:57Z

@pytorchmergebot merge

pytorchmergebot · 2022-11-22T04:47:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… distributed (pytorch#89398) This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This is used when flatten nested dict and flatten sharded tensors. Docstring and comments will be added in the following PRs. Test: ``` python3 test/distributed/_tensor/parallel/test_2d_parallel.py ``` and CI Pull Request resolved: pytorch#89398 Approved by: https://github.com/wanchaol

add_traverse

1d32ffa

wz337 added 3 commits November 21, 2022 05:43

add TODO

da76b76

fix lint

21c41cc

fix lint

b244e7f

wz337 changed the title ~~add_traverse~~ [Checkpoint][2D][1/N] Add traverse for distributed checkpoint Nov 21, 2022

wz337 changed the title ~~[Checkpoint][2D][1/N] Add traverse for distributed checkpoint~~ [Checkpoint][2D][1/N] Move traverse for distributed checkpoint to core distributed Nov 21, 2022

wz337 changed the title ~~[Checkpoint][2D][1/N] Move traverse for distributed checkpoint to core distributed~~ [Checkpoint][2D][2/N] Move traverse for distributed checkpoint to core distributed Nov 21, 2022

wz337 changed the title ~~[Checkpoint][2D][2/N] Move traverse for distributed checkpoint to core distributed~~ [Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed Nov 21, 2022

wz337 requested a review from wanchaol November 21, 2022 16:05

wz337 marked this pull request as ready for review November 21, 2022 16:05

wz337 requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang, awgu and kwen2501 as code owners November 21, 2022 16:05

wanchaol reviewed Nov 22, 2022

View reviewed changes

wz337 added 4 commits November 22, 2022 01:12

reformat and fix nit

36a1658

reformat

5e682f8

add todo for traverse.py

03efb38

reformat

d990e8e

wanchaol approved these changes Nov 22, 2022

View reviewed changes

fix lint and remove unused args

9182656

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2022

pytorchmergebot added the Merged label Nov 22, 2022

pytorchmergebot closed this in 6b085d5 Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed #89398

[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed #89398

wz337 commented Nov 21, 2022 •

edited

pytorch-bot bot commented Nov 21, 2022 •

edited

wanchaol Nov 21, 2022

wz337 Nov 22, 2022

wanchaol Nov 21, 2022

wanchaol Nov 21, 2022

wz337 Nov 22, 2022

wanchaol Nov 22, 2022

wz337 Nov 22, 2022

wanchaol Nov 22, 2022

wz337 Nov 22, 2022

wanchaol left a comment

wanchaol Nov 22, 2022

wz337 Nov 22, 2022

wz337 commented Nov 22, 2022

pytorchmergebot commented Nov 22, 2022

[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed #89398

[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed #89398

Conversation

wz337 commented Nov 21, 2022 • edited

pytorch-bot bot commented Nov 21, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89398

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wz337 commented Nov 22, 2022

pytorchmergebot commented Nov 22, 2022

Merge started

wz337 commented Nov 21, 2022 •

edited

pytorch-bot bot commented Nov 21, 2022 •

edited