New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to support hooks #90798

Closed

fegin wants to merge 4 commits into gh/fegin/52/base from gh/fegin/52/head

Contributor

fegin commented Dec 13, 2022 •

edited

Stack from ghstack (oldest at bottom):

What does this PR do?

This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add _optim_state_dict_post_hook and _load_optim_state_dict_pre_hook for the integration with NamedOptimzer.


          [FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to s…

aa77ed3

…upport hooks

[ghstack-poisoned]

fegin requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang, awgu, kwen2501 and wanchaol as code owners

December 13, 2022 21:35

pytorch-bot bot added the release notes: distributed (fsdp) label

pytorch-bot bot commented Dec 13, 2022 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90798

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit c08d03d:

NEW FAILURES - The following jobs have failed:

cuda11.6-py3.10-gcc7-sm86 / test (default, 3, 4, linux.g5.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin added a commit that referenced this pull request


          [FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to s…

c0cae59

…upport hooks

ghstack-source-id: 19f16d686b7ed549af2a4b2030ab87a629f5572d
Pull Request resolved: #90798

fegin marked this pull request as draft

December 13, 2022 21:36

fegin marked this pull request as ready for review

December 20, 2022 08:26


          Update on "[FSDP][optim_state_dict][6/N] Refactor the optim_state_dic…

c447907

…t APIs to support hooks"

[ghstack-poisoned]

fegin mentioned this pull request

[FSDP][optim_state_dict][7/N] Make FSDP support NamedOptimizer #91160

Closed

rohan-varma approved these changes

View reviewed changes

Member

rohan-varma left a comment

LGTM, thanks!

torch/distributed/fsdp/_optim_utils.py

@@ @@ -1204,6 +1204,7 @@ def _unflatten_process_groups( @@
               def _optim_state_dict(
                   model: torch.nn.Module,
                   optim: torch.optim.Optimizer,
+                  optim_state_dict: Dict[str, Any],

Member

rohan-varma Dec 20, 2022

noob q: do we expect this to be the vanilla state_dict from optim.state_dict() or named_optim.state_dict()?

Contributor Author

fegin Dec 20, 2022

Both are accepted.

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                      The internal API that is used by all the optim_state_dict implementations.
+                      """
+                      if full_state_dict:
+                          FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(

Member

rohan-varma Dec 20, 2022

nit: to avoid confusion, might be worth adding a comment here or in doc of things function to clarify the existing surfaces for which optim state checkpointing works (i.e. the product of use_orig_params, rank0_only, sharded checkpoint, etc).

torch/distributed/fsdp/fully_sharded_data_parallel.py Outdated

+                      use_orig_params = False
+                      for module in FullyShardedDataParallel.fsdp_modules(model):
+                          use_orig_params = module._use_orig_params

Member

rohan-varma Dec 20, 2022

are we concerned about potential inconsistency here? should we check to ensure the setting is the same for all modules?

Contributor Author

fegin Dec 20, 2022

Sure, can error out if that's not true.

Contributor

awgu Dec 20, 2022

I changed FSDP to enforce same use_orig_params for all in the same tree in #90871.

torch/distributed/fsdp/fully_sharded_data_parallel.py Outdated

+                      use_orig_params = False
+                      for module in FullyShardedDataParallel.fsdp_modules(model):
+                          use_orig_params = module._use_orig_params
+                          break

Member

rohan-varma Dec 20, 2022

looks like we just take the setting of the first module - should we just do

use_orig  = next(FSDP.fsdp_modules(model)).use_orig_params

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                      )
+                  @staticmethod
+                  def _optim_state_dict_to_load_impl(

Member

rohan-varma Dec 20, 2022

nit: just for consistency might be good to add a docstring here similar to above API.

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                          True,
+                          use_orig_params,
+                      )
+                      if is_named_optimizer:

Member

rohan-varma Dec 20, 2022

might be useful to have a small comment here saying that NamedOptim expects the keys to be FQNs instead of regular optimizers.

Contributor Author

fegin Dec 20, 2022

Will add this in the next PR.

awgu approved these changes

View reviewed changes

Contributor

awgu left a comment

Nice!


          Update on "[FSDP][optim_state_dict][6/N] Refactor the optim_state_dic…

df907e4

…t APIs to support hooks"


**What does this PR do?**

This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add `_optim_state_dict_post_hook` and `_load_optim_state_dict_pre_hook` for the integration with `NamedOptimzer`.



[ghstack-poisoned]

fegin mentioned this pull request

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load #91234

Closed

Contributor Author

fegin commented Dec 21, 2022

@pytorchbot merge

pytorch-bot bot added the ciflow/trunk label

Collaborator

pytorchmergebot commented Dec 21, 2022

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Collaborator

pytorchmergebot commented Dec 21, 2022

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 1521bf7dfc3da19cdddddd86d675b3c6932b77e7 returned non-zero exit code 1

Auto-merging torch/distributed/fsdp/_optim_utils.py
Auto-merging torch/distributed/fsdp/fully_sharded_data_parallel.py
CONFLICT (content): Merge conflict in torch/distributed/fsdp/fully_sharded_data_parallel.py
error: could not apply 1521bf7dfc... [FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to support hooks
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job


          Update on "[FSDP][optim_state_dict][6/N] Refactor the optim_state_dic…

c08d03d

…t APIs to support hooks"


**What does this PR do?**

This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add `_optim_state_dict_post_hook` and `_load_optim_state_dict_pre_hook` for the integration with `NamedOptimzer`.



[ghstack-poisoned]

Contributor Author

fegin commented Dec 21, 2022

@pytorchbot merge

Collaborator

pytorchmergebot commented Dec 21, 2022

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Collaborator

pytorchmergebot commented Dec 21, 2022

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 3, 4, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

Contributor Author

fegin commented Dec 21, 2022

@pytorchbot merge -f "The failing test is not related"

Collaborator

pytorchmergebot commented Dec 21, 2022

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

1ab6ac4

fegin mentioned this pull request

[FSDP][optim_state_dict][9/N] Rewrite the all-gather flow of optimizer state to support older GPUs #91343

Closed

ShisuiUzumaki pushed a commit to ShisuiUzumaki/pytorch that referenced this pull request


          [FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to s…

f825a1a

…upport hooks (pytorch#90798)

**What does this PR do?**

This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add `_optim_state_dict_post_hook` and `_load_optim_state_dict_pre_hook` for the integration with `NamedOptimzer`.

Pull Request resolved: pytorch#90798
Approved by: https://github.com/rohan-varma, https://github.com/awgu

facebook-github-bot deleted the gh/fegin/52/head branch

June 8, 2023 17:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

rohan-varma rohan-varma approved these changes

awgu awgu approved these changes

mrshenli Awaiting requested review from mrshenli

zhaojuanmao Awaiting requested review from zhaojuanmao

pritamdamania87 Awaiting requested review from pritamdamania87

H-Huang Awaiting requested review from H-Huang

kwen2501 Awaiting requested review from kwen2501

wanchaol Awaiting requested review from wanchaol