New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsdp support create hybrid-sharded process group for custom backend #100622
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100622
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2c13398: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The failed tests in the CI seems to be unrelated to this modification: bash: ./.ci/pytorch/functorch_doc_push_script.sh: No such file or directory
Error: Process completed with exit code 127. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks!
@pytorchbot merge -s |
@pytorchbot merge -r |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
4d42667
to
ac21ab1
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@medivh-xp I a bit swamped this week. I will try to find some time to understand the issue when I can. |
@pytorchbot drci |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
f438c1c
to
10f55ac
Compare
Hi, @awgu I moved the initialization of device handle to the beginning of FSDP (after deciding ignored params, as the type of device handle depends on managed params). Wish a glance when you have some free time~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an initial look, my main concern is that we may want to do a broader refactor for this. Otherwise, there is duplicate logic between the new _init_device_handle()
and the existing functions like _check_single_device_module()
and _get_compute_device()
.
It seems to me that _init_device_handle()
moves the logic from _check_single_device_module()
and _get_compute_device()
to be earlier. We should de-duplicate those latter ones if we want to go for this.
- Is it true that the
determined_device
in_init_device_handle()
is always the return value of_get_compute_device()
? - Is the single-device-module check in
_init_device_handle()
the same as_check_single_device_module()
?
If all the parameters of a module are located on the same device, it seems possible to determine the compute device in advance. However, I am a bit confused because if some of the parameters of a sub-module are located on cuda:0 while others are on cuda:1, it is possible to use clever auto wrap logic to make the compute devices of the two modules different (when device_id is not specified). For now, because auto_wrap (for child modules) depends on the creation of PG, and the creation of PG depends on obtaining the number of devices. Therefore, the type of device needs to be determined before creating PG, so that the device count can be obtained through device_handle.device_count() (currently obtained through torch.cuda.device_count()). Therefore, the current PR hopes to determine the type of device before initializing PG (since the specific device cannot be determined as auto_wrap has not yet been executed). It seems impossible to determine the computing device before auto wrapping, as it is unclear which parameters are being managed and whether they may include meta-type parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
FSDP creates communication groups for intra-node communication through dist.new_subgroups. Previously, dist.new_subgroups only supported creation based on the number of CUDA devices. However, issue #99706 removed the avaliable-check for CUDA devices, allowing for custom backend create group based on num of custom devices per node.
This PR allows FSDP to explicitly pass device num within the node when creating communication groups for intra-node communication, instead of defaulting to the number of CUDA devices.