Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsdp support create hybrid-sharded process group for custom backend #100622

Closed
wants to merge 7 commits into from

Conversation

medivh-xp
Copy link
Contributor

@medivh-xp medivh-xp commented May 4, 2023

FSDP creates communication groups for intra-node communication through dist.new_subgroups. Previously, dist.new_subgroups only supported creation based on the number of CUDA devices. However, issue #99706 removed the avaliable-check for CUDA devices, allowing for custom backend create group based on num of custom devices per node.

This PR allows FSDP to explicitly pass device num within the node when creating communication groups for intra-node communication, instead of defaulting to the number of CUDA devices.

@pytorch-bot
Copy link

pytorch-bot bot commented May 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100622

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c13398:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label May 4, 2023
@medivh-xp medivh-xp changed the title fsdp support custom cuda-like device fsdp hybrid shard support custom cuda-like device May 4, 2023
@medivh-xp medivh-xp changed the title fsdp hybrid shard support custom cuda-like device fsdp support create hybrid-sharded group for custom backend May 15, 2023
@medivh-xp medivh-xp marked this pull request as ready for review May 15, 2023 06:02
@medivh-xp medivh-xp changed the title fsdp support create hybrid-sharded group for custom backend fsdp support create hybrid-sharded process group for custom backend May 15, 2023
@medivh-xp
Copy link
Contributor Author

The failed tests in the CI seems to be unrelated to this modification:

bash: ./.ci/pytorch/functorch_doc_push_script.sh: No such file or directory
Error: Process completed with exit code 127.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2023
Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks!

@awgu
Copy link
Contributor

awgu commented May 15, 2023

@pytorchbot merge -s

@awgu
Copy link
Contributor

awgu commented May 15, 2023

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2023
@pytorch pytorch deleted a comment from pytorch-bot bot May 15, 2023
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_head onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_head && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@awgu
Copy link
Contributor

awgu commented May 16, 2023

@medivh-xp I a bit swamped this week. I will try to find some time to understand the issue when I can.

@medivh-xp
Copy link
Contributor Author

@pytorchbot drci

@medivh-xp
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_head onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_head && git pull --rebase)

@medivh-xp
Copy link
Contributor Author

@medivh-xp I a bit swamped this week. I will try to find some time to understand the issue when I can.

Hi, @awgu I moved the initialization of device handle to the beginning of FSDP (after deciding ignored params, as the type of device handle depends on managed params). Wish a glance when you have some free time~

Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an initial look, my main concern is that we may want to do a broader refactor for this. Otherwise, there is duplicate logic between the new _init_device_handle() and the existing functions like _check_single_device_module() and _get_compute_device().

It seems to me that _init_device_handle() moves the logic from _check_single_device_module() and _get_compute_device() to be earlier. We should de-duplicate those latter ones if we want to go for this.

  • Is it true that the determined_device in _init_device_handle() is always the return value of _get_compute_device()?
  • Is the single-device-module check in _init_device_handle() the same as _check_single_device_module()?

@medivh-xp
Copy link
Contributor Author

medivh-xp commented May 18, 2023

From an initial look, my main concern is that we may want to do a broader refactor for this. Otherwise, there is duplicate logic between the new _init_device_handle() and the existing functions like _check_single_device_module() and _get_compute_device().

It seems to me that _init_device_handle() moves the logic from _check_single_device_module() and _get_compute_device() to be earlier. We should de-duplicate those latter ones if we want to go for this.

  • Is it true that the determined_device in _init_device_handle() is always the return value of _get_compute_device()?
  • Is the single-device-module check in _init_device_handle() the same as _check_single_device_module()?

If all the parameters of a module are located on the same device, it seems possible to determine the compute device in advance. However, I am a bit confused because if some of the parameters of a sub-module are located on cuda:0 while others are on cuda:1, it is possible to use clever auto wrap logic to make the compute devices of the two modules different (when device_id is not specified).

For now, because auto_wrap (for child modules) depends on the creation of PG, and the creation of PG depends on obtaining the number of devices. Therefore, the type of device needs to be determined before creating PG, so that the device count can be obtained through device_handle.device_count() (currently obtained through torch.cuda.device_count()). Therefore, the current PR hopes to determine the type of device before initializing PG (since the specific device cannot be determined as auto_wrap has not yet been executed).
when calling _check_single_device_module and _get_compute_device, auto_wrap has already completed, managed parameters have been determined and moved to the correct device, so it can verify that the model is on the same device and determine the specific compute device. However, this cannot be achieved at the beginning of FSDP...

It seems impossible to determine the computing device before auto wrapping, as it is unclear which parameters are being managed and whether they may include meta-type parameters.

Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me!

torch/distributed/fsdp/_init_utils.py Outdated Show resolved Hide resolved
@medivh-xp
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (fsdp) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants