fsdp support create hybrid-sharded process group for custom backend #100622

medivh-xp · 2023-05-04T07:43:52Z

FSDP creates communication groups for intra-node communication through dist.new_subgroups. Previously, dist.new_subgroups only supported creation based on the number of CUDA devices. However, issue #99706 removed the avaliable-check for CUDA devices, allowing for custom backend create group based on num of custom devices per node.

This PR allows FSDP to explicitly pass device num within the node when creating communication groups for intra-node communication, instead of defaulting to the number of CUDA devices.

pytorch-bot · 2023-05-04T07:43:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100622

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c13398:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

medivh-xp · 2023-05-15T06:08:13Z

The failed tests in the CI seems to be unrelated to this modification:

bash: ./.ci/pytorch/functorch_doc_push_script.sh: No such file or directory
Error: Process completed with exit code 127.

awgu

This looks good to me. Thanks!

awgu · 2023-05-15T20:40:15Z

@pytorchbot merge -s

awgu · 2023-05-15T20:41:48Z

@pytorchbot merge -r

pytorchmergebot · 2023-05-15T20:43:39Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-05-15T20:43:45Z

Successfully rebased fsdp_head onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_head && git pull --rebase)

pytorchmergebot · 2023-05-15T20:44:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2023-05-16T03:50:46Z

@medivh-xp I a bit swamped this week. I will try to find some time to understand the issue when I can.

…ignored paramams

…ignored params

medivh-xp · 2023-05-17T01:05:36Z

@pytorchbot drci

medivh-xp · 2023-05-17T01:06:35Z

@pytorchbot rebase

pytorchmergebot · 2023-05-17T01:09:20Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-05-17T01:09:27Z

Successfully rebased fsdp_head onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_head && git pull --rebase)

medivh-xp · 2023-05-17T08:05:09Z

@medivh-xp I a bit swamped this week. I will try to find some time to understand the issue when I can.

Hi, @awgu I moved the initialization of device handle to the beginning of FSDP (after deciding ignored params, as the type of device handle depends on managed params). Wish a glance when you have some free time~

awgu

From an initial look, my main concern is that we may want to do a broader refactor for this. Otherwise, there is duplicate logic between the new _init_device_handle() and the existing functions like _check_single_device_module() and _get_compute_device().

It seems to me that _init_device_handle() moves the logic from _check_single_device_module() and _get_compute_device() to be earlier. We should de-duplicate those latter ones if we want to go for this.

Is it true that the determined_device in _init_device_handle() is always the return value of _get_compute_device()?
Is the single-device-module check in _init_device_handle() the same as _check_single_device_module()?

medivh-xp · 2023-05-18T02:51:43Z

From an initial look, my main concern is that we may want to do a broader refactor for this. Otherwise, there is duplicate logic between the new _init_device_handle() and the existing functions like _check_single_device_module() and _get_compute_device().

It seems to me that _init_device_handle() moves the logic from _check_single_device_module() and _get_compute_device() to be earlier. We should de-duplicate those latter ones if we want to go for this.

Is it true that the determined_device in _init_device_handle() is always the return value of _get_compute_device()?

Is the single-device-module check in _init_device_handle() the same as _check_single_device_module()?

If all the parameters of a module are located on the same device, it seems possible to determine the compute device in advance. However, I am a bit confused because if some of the parameters of a sub-module are located on cuda:0 while others are on cuda:1, it is possible to use clever auto wrap logic to make the compute devices of the two modules different (when device_id is not specified).

For now, because auto_wrap (for child modules) depends on the creation of PG, and the creation of PG depends on obtaining the number of devices. Therefore, the type of device needs to be determined before creating PG, so that the device count can be obtained through device_handle.device_count() (currently obtained through torch.cuda.device_count()). Therefore, the current PR hopes to determine the type of device before initializing PG (since the specific device cannot be determined as auto_wrap has not yet been executed).
when calling _check_single_device_module and _get_compute_device, auto_wrap has already completed, managed parameters have been determined and moved to the correct device, so it can verify that the model is on the same device and determine the specific compute device. However, this cannot be achieved at the beginning of FSDP...

It seems impossible to determine the computing device before auto wrapping, as it is unclear which parameters are being managed and whether they may include meta-type parameters.

awgu

This sounds good to me!

torch/distributed/fsdp/_init_utils.py

medivh-xp · 2023-05-19T06:06:23Z

@pytorchbot merge

pytorchmergebot · 2023-05-19T06:08:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label May 4, 2023

medivh-xp changed the title ~~fsdp support custom cuda-like device~~ fsdp hybrid shard support custom cuda-like device May 4, 2023

pytorchbot added the open source label May 4, 2023

medivh-xp force-pushed the fsdp_head branch from 8c42c84 to 4d42667 Compare May 15, 2023 03:45

medivh-xp changed the title ~~fsdp hybrid shard support custom cuda-like device~~ fsdp support create hybrid-sharded group for custom backend May 15, 2023

medivh-xp marked this pull request as ready for review May 15, 2023 06:02

medivh-xp requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners May 15, 2023 06:02

medivh-xp changed the title ~~fsdp support create hybrid-sharded group for custom backend~~ fsdp support create hybrid-sharded process group for custom backend May 15, 2023

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2023

awgu approved these changes May 15, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2023

pytorch deleted a comment from pytorch-bot bot May 15, 2023

pytorchmergebot force-pushed the fsdp_head branch from 4d42667 to ac21ab1 Compare May 15, 2023 20:43

pytorchmergebot added the merging label May 15, 2023

medivh-xp force-pushed the fsdp_head branch from ac21ab1 to 4d42667 Compare May 16, 2023 08:24

medivh-xp added 3 commits May 16, 2023 19:18

init device handle before any other initializations after determined …

174cf77

…ignored paramams

lintfix

16fc0f3

fsdp support create hybrid-sharded group for custom backend

e50a264

medivh-xp force-pushed the fsdp_head branch from 4d42667 to 20bfb47 Compare May 16, 2023 14:05

medivh-xp requested a review from yhcharles as a code owner May 16, 2023 14:05

init device handle before any other initializations after determined …

8ac84cd

…ignored params

medivh-xp force-pushed the fsdp_head branch from 20bfb47 to f438c1c Compare May 16, 2023 14:10

pytorchmergebot force-pushed the fsdp_head branch from f438c1c to 10f55ac Compare May 17, 2023 01:09

use determined device for device handle

dd032d5

medivh-xp force-pushed the fsdp_head branch from 10f55ac to dd032d5 Compare May 17, 2023 01:10

medivh-xp requested a review from awgu May 17, 2023 05:42

awgu reviewed May 17, 2023

View reviewed changes

awgu approved these changes May 18, 2023

View reviewed changes

torch/distributed/fsdp/_init_utils.py Outdated Show resolved Hide resolved

nit fix

2c13398

awgu approved these changes May 19, 2023

View reviewed changes

pytorchmergebot added the merging label May 19, 2023

pytorchmergebot added Merged and removed merging labels May 19, 2023

pytorchmergebot closed this in e06bd8f May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsdp support create hybrid-sharded process group for custom backend #100622

fsdp support create hybrid-sharded process group for custom backend #100622

medivh-xp commented May 4, 2023 •

edited

pytorch-bot bot commented May 4, 2023 •

edited

medivh-xp commented May 15, 2023

awgu left a comment

awgu commented May 15, 2023

awgu commented May 15, 2023

pytorchmergebot commented May 15, 2023

pytorchmergebot commented May 15, 2023

pytorchmergebot commented May 15, 2023

awgu commented May 16, 2023

medivh-xp commented May 17, 2023

medivh-xp commented May 17, 2023

pytorchmergebot commented May 17, 2023

pytorchmergebot commented May 17, 2023

medivh-xp commented May 17, 2023

awgu left a comment

medivh-xp commented May 18, 2023 •

edited

awgu left a comment

medivh-xp commented May 19, 2023

pytorchmergebot commented May 19, 2023

fsdp support create hybrid-sharded process group for custom backend #100622

fsdp support create hybrid-sharded process group for custom backend #100622

Conversation

medivh-xp commented May 4, 2023 • edited

pytorch-bot bot commented May 4, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100622

✅ No Failures

medivh-xp commented May 15, 2023

awgu left a comment

Choose a reason for hiding this comment

awgu commented May 15, 2023

awgu commented May 15, 2023

pytorchmergebot commented May 15, 2023

pytorchmergebot commented May 15, 2023

pytorchmergebot commented May 15, 2023

Merge started

awgu commented May 16, 2023

medivh-xp commented May 17, 2023

medivh-xp commented May 17, 2023

pytorchmergebot commented May 17, 2023

pytorchmergebot commented May 17, 2023

medivh-xp commented May 17, 2023

awgu left a comment

Choose a reason for hiding this comment

medivh-xp commented May 18, 2023 • edited

awgu left a comment

Choose a reason for hiding this comment

medivh-xp commented May 19, 2023

pytorchmergebot commented May 19, 2023

Merge started

medivh-xp commented May 4, 2023 •

edited

pytorch-bot bot commented May 4, 2023 •

edited

medivh-xp commented May 18, 2023 •

edited