[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) #89900

fegin · 2022-11-30T07:24:04Z

Stack from ghstack (oldest at bottom):

What:
This PR add the optim state_dict support of use_orig_params with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern.

… (non-broadcast version) [ghstack-poisoned]

pytorch-bot · 2022-11-30T07:24:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89900

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit dc15a9c:

The following jobs failed but were likely due to broken trunk (merge base 41c3b41):

linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… (non-broadcast version) ghstack-source-id: abf34819697920be5c92caac648f28c71368b499 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: b419bba2391fc3d589c69fa16f67ac95c846c450 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: 5b1738e2fe4078281c7820cffad0ffd66c8a4988 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: b419bba2391fc3d589c69fa16f67ac95c846c450 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: 28a4888d06fb4d378e4403264fe6bfcc0ea9a8c9 Pull Request resolved: #89900

awgu

Big thanks for working on this and getting through the crazy Cartesian product of code paths!

I made an initial pass and will continue to revisit.

awgu · 2022-11-30T20:52:44Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        FullyShardedDataParallel._warn_optim_input(optim_input)
        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
            optim_input,
            optim,
        )
+        use_orig_params: bool = False


Note to self: It looks like we assume that use_orig_params is uniform for a given FSDP tree. We should check for that in _lazy_init() and raise an error as needed.

Agreed, I think we should just enforce this, I can't see a use case where we'd want to support mixing and matching.

cc: @rohan-varma We were talking about which FSDP constructor args we assume to be uniform. This is one place evidencing that use_orig_params must be uniform (even if we already intuitively thought that already).

test/distributed/fsdp/test_fsdp_optim_state.py

torch/distributed/fsdp/_optim_utils.py

awgu · 2022-11-30T22:12:22Z

torch/distributed/fsdp/_optim_utils.py

+        value = orig_state[state_name]
+        if not isinstance(value, list) or not torch.is_tensor(value[0]):
+            continue
+        value = torch.concat(value)[: flat_param._numels[param_idx]].reshape(


Conceptual: In what case would torch.concat(value) have more numel than flat_param._numels[param_idx] (meaning that we are truncating)?

Excellent point. The padding shouldn't be visible here. However, I'm not sure if it is always true as we were discussing the cost of F.pad and may change the implementation. This is just a safety guard. If you believe padding will never be seen, I can remove it.

Ah good catch. Let us keep the trimming logic here. The safe guard is good.

torch/distributed/fsdp/_optim_utils.py

awgu · 2022-11-30T22:15:22Z

torch/distributed/fsdp/_optim_utils.py

+    param_id_to_param: List[nn.Parameter],
+    param_to_fqns: Dict[nn.Parameter, List[str]],
+    fqn_to_fsdp_param_info: Dict[str, FSDPParamInfo],
+    merge_key: bool = False,


Do we need merge_key=True for use_orig_params=True to handle the fact that ranks who do not have any part of an original parameter will not have any optimizer state for that parameter?

Then, we also have merge_key=False to preserve the existing behavior for use_orig_params=False?

That's correct.

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: 6ccdeaf17cba17bc5f7372d04081e8c26ee954e0 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: bb9a41a63f5104b5f78796628e3dc3dc0494b6f6 Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

awgu

This looks good to me! Hopefully, you can get another review from someone else as well, but it is not strictly necessary.

I left some minor comments.

torch/distributed/fsdp/_optim_utils.py

awgu · 2022-12-09T00:46:15Z

torch/distributed/fsdp/_optim_utils.py

+        value = orig_state[state_name]
+        if not isinstance(value, list) or not torch.is_tensor(value[0]):
+            continue
+        value = torch.concat(value)[: flat_param._numels[param_idx]].reshape(


Ah good catch. Let us keep the trimming logic here. The safe guard is good.

awgu · 2022-12-09T00:48:23Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        FullyShardedDataParallel._warn_optim_input(optim_input)
        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
            optim_input,
            optim,
        )
+        use_orig_params: bool = False


cc: @rohan-varma We were talking about which FSDP constructor args we assume to be uniform. This is one place evidencing that use_orig_params must be uniform (even if we already intuitively thought that already).

torch/distributed/fsdp/fully_sharded_data_parallel.py

awgu · 2022-12-09T00:52:30Z

torch/distributed/fsdp/_optim_utils.py

+
+
+def _shard_orig_param_state(
+    fqn: str,


nit: For this function and _gather_orig_param_state, can we specify if fqn is the local FQN (as stored in flat_param._fqns) or the global FQN (which may require prepending a prefix starting from the local FSDP root).

Ugh, we might need to clarify our terminology at some point and just write it down somewhere.

awgu · 2022-12-09T00:53:14Z

torch/distributed/fsdp/_optim_utils.py

+    object_list: List[Dict[str, Any]] = [
+        {} for _ in range(cast(int, fsdp_state.world_size))
+    ]
+    dist.all_gather_object(object_list, state_objects)


I forget; for the use_orig_params=False case, what collective do we use? all_gather_into_tensor?

That's correct.

rohan-varma

LGTM, will let @awgu to accept it as well, thanks!

rohan-varma · 2022-12-08T23:07:14Z

test/distributed/fsdp/test_fsdp_optim_state.py

+        self._test_load_optim_state(
+            _ModelClass.NESTED,
+            use_multiple_param_groups=False,
+            halve_world_size=False,


noob q: are we interested in testing other configs such as halve_world_size=True or any of the other flags?

More tests will be introduced after we make FSDP.optim_state_dict support all use cases.

rohan-varma · 2022-12-08T23:09:42Z

torch/distributed/fsdp/_optim_utils.py

+        is_fsdp_managed = isinstance(param, FlatParameter)
+        if is_fsdp_managed:
+            assert fqns[0] in fqn_to_fsdp_param_info
+        is_fsdp_managed = fqns[0] in fqn_to_fsdp_param_info


if param is not a FlatParameter, can this ever be True? If not, then can we just omit this line, because if it is a FlatParameter, we've already asserted on this being true above?

Yes, for use_orig_params case, this line is required.

rohan-varma · 2022-12-08T23:10:53Z

torch/distributed/fsdp/_optim_utils.py

+) -> Dict[str, Any]:
+    """
+    Gather the optimizer state for the original parameter with the name ``fqn``.
+    This API should only be used when ``use_orig_params`` is True.


would it be valuable to check this with an assert?

rohan-varma · 2022-12-08T23:13:15Z

torch/distributed/fsdp/_optim_utils.py

+                torch.cuda.device_count(),
+                cast(dist.ProcessGroup, fsdp_state.process_group),
+            )
+        value = value.cpu()


curious: do we always CPU offload?

Yes, we do in the original code path. This is the inconsistency between state_dict and optim_state_dict. I had a PR to fix this. But it never land. Will introduce this later when I complete optim_state_dict().

rohan-varma · 2022-12-08T23:14:04Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        FullyShardedDataParallel._warn_optim_input(optim_input)
        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
            optim_input,
            optim,
        )
+        use_orig_params: bool = False


Agreed, I think we should just enforce this, I can't see a use case where we'd want to support mixing and matching.

rohan-varma · 2022-12-08T23:14:54Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+        you know why this API exists and how this API works.
+
+        Returns the optimizer state. The state will be sharded or consolidated
+        based on ``state_dict_type`` set by :meth:`set_state_dict_type` or


do we couple model and optimizer state dict types for now, and is there interest in changing this?

We have not enforced this yet. But it will be enforced when optim_state_dict support all use cases.

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

fegin · 2022-12-13T00:34:06Z

@pytorchbot merge

pytorchmergebot · 2022-12-13T00:35:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-12-13T01:31:26Z

Merge failed

Reason: The following mandatory check(s) failed (Rule Distributed):

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

… (non-broadcast version) ghstack-source-id: 43777a0dd75d4d26dabb67552dabb1b12c5f03aa Pull Request resolved: #89900

…_state_dict (non-broadcast version)" **What:** This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. [ghstack-poisoned]

fegin · 2022-12-13T20:43:38Z

@pytorchbot merge -f "The failing test is unrelated."

pytorchmergebot · 2022-12-13T20:45:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

6999b1d

… (non-broadcast version) [ghstack-poisoned]

fegin requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang, awgu and kwen2501 as code owners November 30, 2022 07:24

This was referenced Nov 30, 2022

[FSDP][optim_state_dict][1/N] Restructure _optim_state_dict to prepare the support of use_orig_param #89898

Closed

[FSDP][optim_state_dict][2/N] Add _get_fqn_to_fsdp_param_info to map from original FQN to flat_param #89899

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 30, 2022

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

bcc9677

… (non-broadcast version) ghstack-source-id: abf34819697920be5c92caac648f28c71368b499 Pull Request resolved: #89900

Update on "[FSDP][optim_state_dict][3/N] Support use_orig_param optim…

fd586ef

…_state_dict (non-broadcast version)" [ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

415c0e7

… (non-broadcast version) ghstack-source-id: b419bba2391fc3d589c69fa16f67ac95c846c450 Pull Request resolved: #89900

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

8e50500

… (non-broadcast version) ghstack-source-id: 5b1738e2fe4078281c7820cffad0ffd66c8a4988 Pull Request resolved: #89900

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

ecabd9c

… (non-broadcast version) ghstack-source-id: b419bba2391fc3d589c69fa16f67ac95c846c450 Pull Request resolved: #89900

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

3f560db

… (non-broadcast version) ghstack-source-id: 28a4888d06fb4d378e4403264fe6bfcc0ea9a8c9 Pull Request resolved: #89900

awgu reviewed Nov 30, 2022

View reviewed changes

fegin requested a review from wanchaol as a code owner November 30, 2022 22:56

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

c61640c

… (non-broadcast version) ghstack-source-id: 6ccdeaf17cba17bc5f7372d04081e8c26ee954e0 Pull Request resolved: #89900

fegin added a commit that referenced this pull request Nov 30, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

9cd2f38

… (non-broadcast version) ghstack-source-id: bb9a41a63f5104b5f78796628e3dc3dc0494b6f6 Pull Request resolved: #89900

This was referenced Nov 30, 2022

[FSDP] Unify the optimizer state_dict APIs #82232

Closed

[FSDP][optim_state_dict][4/N] Remove the unused _get_flat_param_to_fsdp_module API #89980

Closed

[FSDP][optim_state_dict][5/N] Remove optim_inputs for sharded state_dict. #89981

Closed

fegin added 2 commits December 5, 2022 18:22

awgu approved these changes Dec 9, 2022

View reviewed changes

rohan-varma approved these changes Dec 9, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 13, 2022

fegin added a commit that referenced this pull request Dec 13, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict…

44b7a38

… (non-broadcast version) ghstack-source-id: 43777a0dd75d4d26dabb67552dabb1b12c5f03aa Pull Request resolved: #89900

fegin added the with-ssh label Dec 13, 2022

pytorchmergebot added the Merged label Dec 13, 2022

pytorchmergebot closed this in 043de8d Dec 13, 2022

facebook-github-bot deleted the gh/fegin/49/head branch June 8, 2023 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) #89900

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) #89900

fegin commented Nov 30, 2022 •

edited

pytorch-bot bot commented Nov 30, 2022 •

edited

awgu left a comment

awgu Nov 30, 2022

rohan-varma Dec 8, 2022

awgu Dec 9, 2022

awgu Nov 30, 2022

fegin Nov 30, 2022

awgu Dec 9, 2022

awgu Nov 30, 2022

fegin Nov 30, 2022

awgu left a comment

awgu Dec 9, 2022

awgu Dec 9, 2022

awgu Dec 9, 2022

awgu Dec 9, 2022

fegin Dec 12, 2022

rohan-varma left a comment

rohan-varma Dec 8, 2022

fegin Dec 12, 2022

rohan-varma Dec 8, 2022

fegin Dec 12, 2022

rohan-varma Dec 8, 2022

rohan-varma Dec 8, 2022

fegin Dec 12, 2022

rohan-varma Dec 8, 2022

rohan-varma Dec 8, 2022

fegin Dec 12, 2022

fegin commented Dec 13, 2022

pytorchmergebot commented Dec 13, 2022

pytorchmergebot commented Dec 13, 2022

fegin commented Dec 13, 2022

pytorchmergebot commented Dec 13, 2022

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) #89900

[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) #89900

Conversation

fegin commented Nov 30, 2022 • edited

pytorch-bot bot commented Nov 30, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89900

❌ 1 Failures

awgu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin commented Dec 13, 2022

pytorchmergebot commented Dec 13, 2022

Merge started

pytorchmergebot commented Dec 13, 2022

Merge failed

fegin commented Dec 13, 2022

pytorchmergebot commented Dec 13, 2022

Merge started

fegin commented Nov 30, 2022 •

edited

pytorch-bot bot commented Nov 30, 2022 •

edited