Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of dim_groups attribute from DeviceMesh #103105

Closed
wants to merge 8 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Jun 6, 2023

Stack from ghstack (oldest at bottom):

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103105

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5de352e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we start to make DeviceMesh pickable, thanks for the work!! Some suggestions (I could be wrong). Let me know what you think ;-)

torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Outdated Show resolved Hide resolved
torch/distributed/_tensor/device_mesh.py Show resolved Hide resolved
torch/distributed/tensor/parallel/_utils.py Outdated Show resolved Hide resolved
torch/distributed/tensor/parallel/fsdp.py Outdated Show resolved Hide resolved
This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
@wanchaol wanchaol requested a review from XilunWu June 7, 2023 22:50
This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically if we store PG as attribute it will make it unpickable but now what we store is List[Tuple[str, List[int]]] so it works now. Is this understanding correct?

@XilunWu
Copy link
Contributor

XilunWu commented Jun 8, 2023

@fduwjj That is correct. For some reason ProcessGroup is not a picklable object.

@wanchaol
Copy link
Contributor Author

wanchaol commented Jun 8, 2023

So basically if we store PG as attribute it will make it unpickable but now what we store is List[Tuple[str, List[int]]] so it works now. Is this understanding correct?

Yep, it's impossible to pickle process group, the previous sharded tensor one is fragile and not good for state_dict as it needs to add state_dict hook just for the sake of process group

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, hopefully this will not add too much perf overhead.

@wanchaol
Copy link
Contributor Author

wanchaol commented Jun 8, 2023

LGTM, hopefully this will not add too much perf overhead.

hmmm what do you mean by perf overhead here? I think this would not have any perf implications?

This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR

[ghstack-poisoned]
@wanchaol wanchaol added the release notes: distributed (dtensor) release notes category label Jun 9, 2023
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/323/head branch June 12, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants