[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path #104346

awgu · 2023-06-28T14:41:38Z

Stack from ghstack (oldest at bottom):

This PR is the first in refactoring the auto wrapping, only affecting ModuleWrapPolicy for wrapper FullyShardedDataParallel. The end goal is to improve the auto wrapping infra to support:

Checking valid frozen parameters (uniform frozenness per FSDP)
Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher)
Writing auto wrapping policies that may take multiple passes over the module tree
Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy)

The way I envision achieving this is that, we decouple the actual "wrapping" (which is _post_order_apply() in this PR) from constructing the wrapping targets and kwargs (which is target_module_to_kwargs in this PR). In that way, a policy reduces to just constructing that latter target_module_to_kwargs mapping.

I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple.

The change to how old_dtype is handled is mainly to avoid keeping a reference to _override_module_mixed_precision() function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.)

To-do in follow-ups (not in order):

Add frozen parameter check before _post_order_apply()
Add shared parameter check before _post_order_apply()
Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg)

[ghstack-poisoned]

pytorch-bot · 2023-06-28T14:41:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104346

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ba95ffc:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: eafbb91bf2339aeea230b92a0356e52003c5ce7a Pull Request resolved: #104346

[ghstack-poisoned]

ghstack-source-id: 3c7ff44f030362717f1ae2fbc2b4cd9014ebedc6 Pull Request resolved: #104346

The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook. We can directly store the global state as a mapping. To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) - Refactor `fully_shard()` auto wrap unify with `FullyShardedDataParallel` auto wrap, where the only difference should be `fn_to_apply` in `_post_order_apply()` - This means that for `fully_shard()`'s auto wrap, it will call `fully_shard()` on the target submodules, constructing a new `_FSDPState` object for each just like for the wrapper path. - This change prohibits extensions like non-module-aligned wrapping, but it allows for unifying the code paths to decrease the likelihood for bugs. I do not foresee us pursuing non-module-aligned wrapping in the near term. - After this change, we can then revisit the `ignored_states` with auto wrapping fix and land that without changing `_unshard_params()`. [ghstack-poisoned]

ghstack-source-id: 892168e233a9e74b7750227c086cb4689ab23572 Pull Request resolved: #104346

ghstack-source-id: 892168e233a9e74b7750227c086cb4689ab23572 Pull Request resolved: pytorch#104346

The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) - Refactor `fully_shard()` auto wrap unify with `FullyShardedDataParallel` auto wrap, where the only difference should be `fn_to_apply` in `_post_order_apply()` - This means that for `fully_shard()`'s auto wrap, it will call `fully_shard()` on the target submodules, constructing a new `_FSDPState` object for each just like for the wrapper path. - This change prohibits extensions like non-module-aligned wrapping, but it allows for unifying the code paths to decrease the likelihood for bugs. I do not foresee us pursuing non-module-aligned wrapping in the near term. - After this change, we can then revisit the `ignored_states` with auto wrapping fix and land that without changing `_unshard_params()`. [ghstack-poisoned]

ghstack-source-id: 304383832c2b6bc8eafbfc7730d8198365049780 Pull Request resolved: pytorch#104346

This PR is the first in refactoring the auto wrapping. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) [ghstack-poisoned]

ghstack-source-id: b62d622a3c0745bc3149202a81ac730a7aefe995 Pull Request resolved: pytorch#104346

This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) [ghstack-poisoned]

ghstack-source-id: 6600d7ad0d44834537abefee871178fd3cdd6ff7 Pull Request resolved: pytorch#104346

This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) [ghstack-poisoned]

rohan-varma

LGTM

rohan-varma · 2023-07-06T22:22:06Z

torch/distributed/fsdp/_utils.py

+                # NOTE: If the forward did not have any floating-point tensors,
+                # then the dtype will not be set for this module, and we do not
+                # upcast the dtype.
+                if module in _MODULE_TO_INP_DTYPE:


so _MODULE_TO_INP_DTYPE generalizes old_dtype to be on a per-module basis?

torch/distributed/fsdp/wrap.py

wanchaol

meta comment on this PR: is this really a "move wrapper policy to the new path"? The PR added a lots of new logic, i.e. adding new post_order_apply, maybe the PR name should be more descriptive?

awgu · 2023-07-06T23:06:39Z

meta comment on this PR: is this really a "move wrapper policy to the new path"? The PR added a lots of new logic, i.e. adding new post_order_apply, maybe the PR name should be more descriptive?

Good point! I wonder how we can fit more info given the title character limit though :/

torch/distributed/fsdp/wrap.py

This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) [ghstack-poisoned]

ghstack-source-id: 158dc95b1093450c647296ee05ef13192ed67fc9 Pull Request resolved: #104346

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

48ffd0d

[ghstack-poisoned]

pytorch-bot bot added release notes: distributed (fsdp) release notes category labels Jun 28, 2023

awgu added a commit that referenced this pull request Jun 28, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

8300f53

ghstack-source-id: eafbb91bf2339aeea230b92a0356e52003c5ce7a Pull Request resolved: #104346

Update on "[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path"

a8fb118

[ghstack-poisoned]

awgu added a commit that referenced this pull request Jun 28, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

ef159b8

ghstack-source-id: 3c7ff44f030362717f1ae2fbc2b4cd9014ebedc6 Pull Request resolved: #104346

awgu added a commit that referenced this pull request Jun 28, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

645bdb8

ghstack-source-id: 892168e233a9e74b7750227c086cb4689ab23572 Pull Request resolved: #104346

This was referenced Jun 28, 2023

[FSDP] Annotate modules for fully_shard #104363

Closed

[FSDP][2/N][Easy] Prepare _auto_wrap for fully_shard #104407

Closed

[FSDP][3/N] Unify fully_shard auto wrap #104408

Closed

[FSDP][4/N] Remove _get_fully_sharded_module_to_states #104409

Closed

awgu added a commit to awgu/pytorch that referenced this pull request Jun 29, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

74b6656

ghstack-source-id: 892168e233a9e74b7750227c086cb4689ab23572 Pull Request resolved: pytorch#104346

awgu added a commit to awgu/pytorch that referenced this pull request Jun 29, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

48a85e8

ghstack-source-id: 304383832c2b6bc8eafbfc7730d8198365049780 Pull Request resolved: pytorch#104346

This was referenced Jun 29, 2023

[FSDP][5/N] Unblock ignored_states + auto wrap (for now) #104418

Closed

[FSDP][6/N] Check valid param freezing for ModuleWrapPolicy #104427

Closed

awgu added the topic: not user facing topic category label Jun 29, 2023

awgu added a commit to awgu/pytorch that referenced this pull request Jun 29, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

8836487

ghstack-source-id: b62d622a3c0745bc3149202a81ac730a7aefe995 Pull Request resolved: pytorch#104346

awgu added a commit to awgu/pytorch that referenced this pull request Jun 30, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

dc2332f

ghstack-source-id: b62d622a3c0745bc3149202a81ac730a7aefe995 Pull Request resolved: pytorch#104346

awgu added a commit to awgu/pytorch that referenced this pull request Jun 30, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

a296f1c

ghstack-source-id: 6600d7ad0d44834537abefee871178fd3cdd6ff7 Pull Request resolved: pytorch#104346

awgu added a commit to awgu/pytorch that referenced this pull request Jul 5, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

600332f

ghstack-source-id: 6600d7ad0d44834537abefee871178fd3cdd6ff7 Pull Request resolved: pytorch#104346

awgu added a commit to awgu/pytorch that referenced this pull request Jul 5, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

7150aca

ghstack-source-id: 6600d7ad0d44834537abefee871178fd3cdd6ff7 Pull Request resolved: pytorch#104346

awgu marked this pull request as ready for review July 5, 2023 15:35

awgu requested review from mrshenli and zhaojuanmao as code owners July 5, 2023 15:35

awgu requested review from rohan-varma, H-Huang, kwen2501, wanchaol, fegin, fduwjj, kiukchung and d4l3k as code owners July 5, 2023 15:35

This was referenced Jul 6, 2023

SetVariable in dynamo #103205

Closed

[WIP] Living branch / PR for FSDP development #103711

Closed

Migrate tuple(handle) -> handle #104488

Closed

rohan-varma approved these changes Jul 6, 2023

View reviewed changes

wanchaol reviewed Jul 6, 2023

View reviewed changes

fegin reviewed Jul 7, 2023

View reviewed changes

torch/distributed/fsdp/wrap.py Outdated Show resolved Hide resolved

fegin reviewed Jul 7, 2023

View reviewed changes

torch/distributed/fsdp/wrap.py Show resolved Hide resolved

fegin approved these changes Jul 7, 2023

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2023

awgu added 2 commits July 7, 2023 14:42

pytorchmergebot added the Merged label Jul 8, 2023

pytorchmergebot closed this in d58f75b Jul 8, 2023

facebook-github-bot deleted the gh/awgu/409/head branch July 11, 2023 14:16

voznesenskym pushed a commit that referenced this pull request Jul 19, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

8974b9b

ghstack-source-id: 158dc95b1093450c647296ee05ef13192ed67fc9 Pull Request resolved: #104346

voznesenskym pushed a commit that referenced this pull request Jul 21, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

6d580aa

ghstack-source-id: 158dc95b1093450c647296ee05ef13192ed67fc9 Pull Request resolved: #104346

voznesenskym pushed a commit that referenced this pull request Aug 7, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path

7995e77

ghstack-source-id: 158dc95b1093450c647296ee05ef13192ed67fc9 Pull Request resolved: #104346

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path #104346

[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path #104346

awgu commented Jun 28, 2023 •

edited

Loading

pytorch-bot bot commented Jun 28, 2023 •

edited

Loading

rohan-varma left a comment

rohan-varma Jul 6, 2023

awgu Jul 6, 2023

wanchaol left a comment

awgu commented Jul 6, 2023

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path #104346

[FSDP][1/N] Move wrapper ModuleWrapPolicy to new path #104346

Conversation

awgu commented Jun 28, 2023 • edited Loading

pytorch-bot bot commented Jun 28, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104346

✅ No Failures

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Jul 6, 2023

Choose a reason for hiding this comment

awgu Jul 6, 2023

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

awgu commented Jul 6, 2023

[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path #104346

[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path #104346

awgu commented Jun 28, 2023 •

edited

Loading

pytorch-bot bot commented Jun 28, 2023 •

edited

Loading