[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846

awgu · 2022-12-14T17:05:20Z

Stack from ghstack:

[FSDP][5/N] Add manual "wrapping" support for fully_shard #90874 [FSDP][5/N] Add manual "wrapping" support for fully_shard
[FSDP][4/N] Refactor func to share state/init handle attrs #90871 [FSDP][4/N] Refactor func to share state/init handle attrs
[FSDP][3/N] Move fsdp_modules(root_only=True) -> _get_fsdp_root_states() #90862 [FSDP][3/N] Move fsdp_modules(root_only=True) -> _get_fsdp_root_states()
[FSDP][2/N] Move fsdp_modules(root_only=False) -> _get_fsdp_states() #90861 [FSDP][2/N] Move fsdp_modules(root_only=False) -> _get_fsdp_states()
[FSDP][Easy] Rename entry -> fsdp_module to be more descriptive #90864 [FSDP][Easy] Rename entry -> fsdp_module to be more descriptive
[FSDP][1/N] Add _get_fsdp_states() #90860 [FSDP][1/N] Add _get_fsdp_states()
[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846 [FSDP] Enable mixed hybrid/non-hybrid sharding strategies
[FSDP][Easy] Use run_subtests for hybrid shard test #90859 [FSDP][Easy] Use run_subtests for hybrid shard test
[FSDP][Easy] ufmt files #90858 [FSDP][Easy] ufmt files
[FSDP][BE] Remove _module_to_handles, HandleConfig; use term "fqn"; clarify docs #90840 [FSDP][BE] Remove _module_to_handles, HandleConfig; use term "fqn"; clarify docs

In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this.

[ghstack-poisoned]

pytorch-bot · 2022-12-14T17:05:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90846

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c814a10:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this. [ghstack-poisoned]

rohan-varma

Apologies for the preemptive stamp - coming back for a full review :)

rohan-varma · 2022-12-14T19:00:40Z

test/distributed/fsdp/test_fsdp_hybrid_shard.py

 import sys
+from collections import Counter


Could we have formatting changes separated out and in a different PR? Formatting changes with logical changes make it harder for reviewers to understand the critical parts of the PR to review.

Sorry about that. I separated it out. Eventually, we should try to have all developers use ufmt so that we can ufmt files right before pushing. Converging to the same formatter saves the uncertainty of how code should be formatted, and we chose ufmt since PyTorch recommends that.

Curious about how we will encourage this practice. lintrunner is today's automated tool that can run pre-commit and is enforced by PyTorch CI.

Should we work with dev infra / CI folks to add ufmt to lintrunner? Without automation, enforcement of a linting standard is prone to breaking.

test/distributed/fsdp/test_fsdp_hybrid_shard.py

torch/distributed/fsdp/_runtime_utils.py

In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this. [ghstack-poisoned]

ghstack-source-id: 54419fe Pull Request resolved: pytorch#90846

awgu · 2022-12-15T04:15:36Z

I will fix test failures tomorrow.

In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this. [ghstack-poisoned]

ghstack-source-id: b14bebe Pull Request resolved: pytorch#90846

awgu · 2022-12-15T15:33:00Z

@pytorchbot merge

pytorchmergebot · 2022-12-15T15:36:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies

422f472

[ghstack-poisoned]

awgu requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang, kwen2501 and wanchaol as code owners December 14, 2022 17:05

awgu mentioned this pull request Dec 14, 2022

[FSDP][BE] Remove _module_to_handles, HandleConfig; use term "fqn"; clarify docs #90840

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 14, 2022

Update on "[FSDP] Enable mixed hybrid/non-hybrid sharding strategies"

8872a9f

[ghstack-poisoned]

awgu added the topic: improvements topic category label Dec 14, 2022

rohan-varma approved these changes Dec 14, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 14, 2022

rohan-varma reviewed Dec 14, 2022

View reviewed changes

This was referenced Dec 14, 2022

[FSDP][Easy] ufmt files #90858

Closed

[FSDP][Easy] Use run_subtests for hybrid shard test #90859

Closed

[FSDP][1/N] Add _get_fsdp_states() #90860

Closed

This was referenced Dec 14, 2022

[FSDP][2/N] Move fsdp_modules(root_only=False) -> _get_fsdp_states() #90861

Closed

[FSDP][3/N] Move fsdp_modules(root_only=True) -> _get_fsdp_root_states() #90862

Closed

awgu mentioned this pull request Dec 14, 2022

[FSDP][Easy] Rename entry -> fsdp_module to be more descriptive #90864

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Dec 14, 2022

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies

ec1ee80

ghstack-source-id: 54419fe Pull Request resolved: pytorch#90846

This was referenced Dec 14, 2022

[FSDP][4/N] Refactor func to share state/init handle attrs #90871

Closed

[FSDP][5/N] Add manual "wrapping" support for fully_shard #90872

Closed

[FSDP][5/N] Add manual "wrapping" support for fully_shard #90874

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Dec 15, 2022

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies

42a0282

ghstack-source-id: b14bebe Pull Request resolved: pytorch#90846

pytorchmergebot added the Merged label Dec 15, 2022

pytorchmergebot closed this in c4718e9 Dec 15, 2022

awgu mentioned this pull request Dec 15, 2022

[FSDP][6/N] Add note explaining idioms for _FSDPState traversal #90959

Closed

facebook-github-bot deleted the gh/awgu/269/head branch June 8, 2023 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846

Uh oh!

awgu commented Dec 14, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 14, 2022 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

rohan-varma Dec 14, 2022

Uh oh!

awgu Dec 14, 2022

Uh oh!

rohan-varma Dec 14, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awgu commented Dec 15, 2022

Uh oh!

awgu commented Dec 15, 2022

Uh oh!

pytorchmergebot commented Dec 15, 2022

Uh oh!

Uh oh!

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846

[FSDP] Enable mixed hybrid/non-hybrid sharding strategies #90846

Uh oh!

Conversation

awgu commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90846

✅ No Failures

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awgu commented Dec 15, 2022

Uh oh!

awgu commented Dec 15, 2022

Uh oh!

pytorchmergebot commented Dec 15, 2022

Merge started

Uh oh!

Uh oh!

awgu commented Dec 14, 2022 •

edited

Loading

pytorch-bot bot commented Dec 14, 2022 •

edited

Loading