[FSDP] Fix wrapped module changing after ctor #87837

awgu · 2022-10-27T02:19:40Z

Stack from ghstack:

[FSDP] Fix wrapped module changing after ctor #87837 [FSDP] Fix wrapped module changing after ctor
[FSDP] ufmt FSDP test #87812 [FSDP] ufmt FSDP test
[FSDP] ufmt /fsdp #87811 [FSDP] ufmt /fsdp

Recently, I retired FlattenParamsWrapper, which meant that FSDP registers its FlatParameter on the wrapped module instead of the FlattenParamsWrapper instance. This is only relevant for use_orig_params=False.

If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the FlatParameter is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the FlatParameter is currently registered as an early return condition for rank0_only=True.

The solution in this PR is to re-establish the wrapped module in _lazy_init(), de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon _lazy_init().

The direct access to the private attribute _parameters from nn.Module is not ideal, but we already rely on it for the dynamic FlatParameter registration. The tradeoff is whether we want an additional nn.Module wrapper (FlattenParamsWrapper) and use delattr plus a singleton list to do the dynamic registration or we want to access _parameters. If this becomes a problem, we can work with Core team on a solution.

Differential Revision: D40799962

[ghstack-poisoned]

pytorch-bot · 2022-10-27T02:19:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87837

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures, 1 Pending

As of commit 3ac9c0d:

The following jobs have failed:

linux-bionic-cuda11.7-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2346ee6752e3b1930356e6462adda990aebb0636 Pull Request resolved: #87837

[ghstack-poisoned]

ghstack-source-id: 4595d556ad345c213c38e92c2dbc8a2d2a0c6bf5 Pull Request resolved: #87837

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. [ghstack-poisoned]

ghstack-source-id: 963850d8bc3bcf4a8eeff07594b84f6096c992a7 Pull Request resolved: #87837

test/distributed/fsdp/test_fsdp_misc.py

fegin · 2022-10-27T04:06:11Z

test/distributed/fsdp/test_fsdp_state_dict.py

@@ -220,9 +234,10 @@ def _validate_state_dict_contents(

    @skip_if_lt_x_gpu(2)
    @parametrize("state_dict_type", _UNFLATTENED_STATE_DICT_IMPLS)
-    @parametrize("checkpoint_wrap", ["first", "second", "both"])
+    @parametrize("checkpoint_wrap", ["source", "dest", "both", "source_after_wrap"])


Curious, if we have both_after_wrap, then with rank0_only_and_offload being False, would it be possible to reproduce the loading stuck error?

"both_after_wrap" with rank0_only_and_offload=False produces a different error:

RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel: While copying the parameter named "_fsdp_wrapped_module.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param", whose dimensions in the model are torch.Size([100]) and whose dimensions in the checkpoint are torch.Size([100]), an exception occurred : ('CUDA error: invalid argument\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

However, since it is still an error, I added the "both_after_wrap" option to the unit test. We still need to figure out how to reproduce the load error.

torch/distributed/fsdp/fully_sharded_data_parallel.py

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. [ghstack-poisoned]

ghstack-source-id: 44cfbc7cd4c37babdedc151567f91630d55ab903 Pull Request resolved: #87837

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. [ghstack-poisoned]

ghstack-source-id: 6f069b9970df34a028a13bfdbe906f58f5292036 Pull Request resolved: #87837

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. [ghstack-poisoned]

ghstack-source-id: 971ee81a4fba4ffc12cf41d0017932ec1e4c460b Pull Request resolved: #87837

zhaojuanmao · 2022-10-27T22:35:54Z

once we decided to move checkpoint wrapper to hook based, we may think about whether we want to support this use case, and maybe just error out and say changing module structures after FSDP wrapping is not supported

awgu · 2022-10-27T23:20:32Z

@pytorchbot merge

pytorchmergebot · 2022-10-27T23:37:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-28T00:43:50Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. Pull Request resolved: pytorch#87837 Approved by: https://github.com/zhaojuanmao

awgu · 2022-10-28T14:20:43Z

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. Pull Request resolved: pytorch#87837 Approved by: https://github.com/zhaojuanmao

[FSDP] Fix wrapped module changing after ctor

95202fd

[ghstack-poisoned]

awgu requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang and kwen2501 as code owners October 27, 2022 02:19

awgu mentioned this pull request Oct 27, 2022

[FSDP] ufmt /fsdp #87811

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 27, 2022

awgu mentioned this pull request Oct 27, 2022

[FSDP] ufmt FSDP test #87812

Closed

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

12ad526

ghstack-source-id: 2346ee6752e3b1930356e6462adda990aebb0636 Pull Request resolved: #87837

Update on "[FSDP] Fix wrapped module changing after ctor"

1ece315

[ghstack-poisoned]

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

d462ecb

ghstack-source-id: 4595d556ad345c213c38e92c2dbc8a2d2a0c6bf5 Pull Request resolved: #87837

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

fd54b48

ghstack-source-id: 963850d8bc3bcf4a8eeff07594b84f6096c992a7 Pull Request resolved: #87837

fegin reviewed Oct 27, 2022

View reviewed changes

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

4bbb447

ghstack-source-id: 44cfbc7cd4c37babdedc151567f91630d55ab903 Pull Request resolved: #87837

awgu requested a review from fegin October 27, 2022 13:56

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

2178e28

ghstack-source-id: 6f069b9970df34a028a13bfdbe906f58f5292036 Pull Request resolved: #87837

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 27, 2022

awgu added a commit that referenced this pull request Oct 27, 2022

[FSDP] Fix wrapped module changing after ctor

4e94e7a

ghstack-source-id: 971ee81a4fba4ffc12cf41d0017932ec1e4c460b Pull Request resolved: #87837

zhaojuanmao approved these changes Oct 27, 2022

View reviewed changes

pytorchmergebot added the Merged label Oct 28, 2022

pytorchmergebot closed this in 9225f26 Oct 28, 2022

awgu added the topic: improvements topic category label Oct 28, 2022

This was referenced Oct 28, 2022

[FSDP] New fix for composing with other module wrappers #87950

Closed

[AC] Add trailing "." to _CHECKPOINT_PREFIX like FSDP #87951

Closed

facebook-github-bot deleted the gh/awgu/143/head branch June 8, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Fix wrapped module changing after ctor #87837

[FSDP] Fix wrapped module changing after ctor #87837

awgu commented Oct 27, 2022 •

edited

pytorch-bot bot commented Oct 27, 2022 •

edited

fegin Oct 27, 2022

awgu Oct 27, 2022

zhaojuanmao commented Oct 27, 2022

awgu commented Oct 27, 2022

pytorchmergebot commented Oct 27, 2022

github-actions bot commented Oct 28, 2022

awgu commented Oct 28, 2022

[FSDP] Fix wrapped module changing after ctor #87837

[FSDP] Fix wrapped module changing after ctor #87837

Conversation

awgu commented Oct 27, 2022 • edited

pytorch-bot bot commented Oct 27, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87837

❌ 1 Failures, 1 Pending

fegin Oct 27, 2022

Choose a reason for hiding this comment

awgu Oct 27, 2022

Choose a reason for hiding this comment

zhaojuanmao commented Oct 27, 2022

awgu commented Oct 27, 2022

pytorchmergebot commented Oct 27, 2022

Merge started

github-actions bot commented Oct 28, 2022

awgu commented Oct 28, 2022

awgu commented Oct 27, 2022 •

edited

pytorch-bot bot commented Oct 27, 2022 •

edited