Get rid of dim_groups attribute from DeviceMesh #103105

wanchaol · 2023-06-06T20:05:29Z

Stack from ghstack (oldest at bottom):

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR [ghstack-poisoned]

pytorch-bot · 2023-06-06T20:05:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103105

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5de352e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR [ghstack-poisoned]

XilunWu

Yeah we start to make DeviceMesh pickable, thanks for the work!! Some suggestions (I could be wrong). Let me know what you think ;-)

torch/distributed/_tensor/device_mesh.py

torch/distributed/tensor/parallel/_utils.py

torch/distributed/tensor/parallel/fsdp.py

This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR [ghstack-poisoned]

fduwjj

So basically if we store PG as attribute it will make it unpickable but now what we store is List[Tuple[str, List[int]]] so it works now. Is this understanding correct?

XilunWu · 2023-06-08T16:14:29Z

@fduwjj That is correct. For some reason ProcessGroup is not a picklable object.

wanchaol · 2023-06-08T17:55:18Z

So basically if we store PG as attribute it will make it unpickable but now what we store is List[Tuple[str, List[int]]] so it works now. Is this understanding correct?

Yep, it's impossible to pickle process group, the previous sharded tensor one is fragile and not good for state_dict as it needs to add state_dict hook just for the sake of process group

This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR [ghstack-poisoned]

XilunWu

LGTM!

fduwjj

LGTM, hopefully this will not add too much perf overhead.

wanchaol · 2023-06-08T21:09:33Z

LGTM, hopefully this will not add too much perf overhead.

hmmm what do you mean by perf overhead here? I think this would not have any perf implications?

This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR [ghstack-poisoned]

wanchaol requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, fegin, kiukchung and d4l3k as code owners June 6, 2023 20:05

wanchaol mentioned this pull request Jun 6, 2023

[dtensor] support torch.save/load with DTensor #103106

Closed

wanchaol requested review from kumpera, fduwjj, XilunWu and wz337 June 6, 2023 20:06

wanchaol added 3 commits June 6, 2023 23:23

XilunWu reviewed Jun 7, 2023

View reviewed changes

wanchaol requested a review from XilunWu June 7, 2023 22:50

fduwjj reviewed Jun 8, 2023

View reviewed changes

XilunWu approved these changes Jun 8, 2023

View reviewed changes

fduwjj approved these changes Jun 8, 2023

View reviewed changes

wanchaol added the release notes: distributed (dtensor) release notes category label Jun 9, 2023

pytorchmergebot added the Merged label Jun 9, 2023

pytorchmergebot closed this in d31707a Jun 9, 2023

facebook-github-bot deleted the gh/wanchaol/323/head branch June 12, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of dim_groups attribute from DeviceMesh #103105

Get rid of dim_groups attribute from DeviceMesh #103105

wanchaol commented Jun 6, 2023 •

edited

pytorch-bot bot commented Jun 6, 2023 •

edited

XilunWu left a comment •

edited

fduwjj left a comment

XilunWu commented Jun 8, 2023 •

edited

wanchaol commented Jun 8, 2023

XilunWu left a comment

fduwjj left a comment

wanchaol commented Jun 8, 2023

Get rid of dim_groups attribute from DeviceMesh #103105

Get rid of dim_groups attribute from DeviceMesh #103105

Conversation

wanchaol commented Jun 6, 2023 • edited

pytorch-bot bot commented Jun 6, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103105

✅ No Failures

XilunWu left a comment • edited

Choose a reason for hiding this comment

fduwjj left a comment

Choose a reason for hiding this comment

XilunWu commented Jun 8, 2023 • edited

wanchaol commented Jun 8, 2023

XilunWu left a comment

Choose a reason for hiding this comment

fduwjj left a comment

Choose a reason for hiding this comment

wanchaol commented Jun 8, 2023

wanchaol commented Jun 6, 2023 •

edited

pytorch-bot bot commented Jun 6, 2023 •

edited

XilunWu left a comment •

edited

XilunWu commented Jun 8, 2023 •

edited