[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

fegin · 2023-02-04T08:07:06Z

Stack from ghstack (oldest at bottom):

[FSDP][optim_state_dict] Returns the initial states of the empty parameters for KeyedOptimizer/NamedOptimizer #94130
-> [FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When use_orig_params=True , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: D42982778

…aged parameters that do not reside on the rank When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state. This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank. Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/) [ghstack-poisoned]

pytorch-bot · 2023-02-04T08:07:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94129

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 4f58406:

NEW FAILURES - The following jobs have failed:

android-emulator-build-test / build-and-test

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…on-FSDP managed parameters that do not reside on the rank" When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state. This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank. Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/) [ghstack-poisoned]

rohan-varma

LGTM

rohan-varma · 2023-02-06T17:00:49Z

test/distributed/fsdp/test_fsdp_optim_state.py

+                dist.all_reduce(sparse)
+                return self.dense(sparse)
+
+        models = [FakeMPModel().cuda(), FakeMPModel().cuda()]


Do we need to test wrapping the entire FakeModel with FSDP and telling FSDP to ignore the sparse0 / sparse1? Wondering which approach most accurately simulates the FSDP / TRec integration case.

afaik, this way to wrap the model, the root module not being a FSDP model but having model parallelism -- is more close to the real use case.

rohan-varma · 2023-02-06T17:01:31Z

test/distributed/fsdp/test_fsdp_optim_state.py


        # Train one batch and see if optim_state_dict are the same.
-        batch = torch.rand(5, 8)
+        batch = torch.rand(5, 8).cuda()


nit: torch.rand(5, 8, device=cuda) creates the tensor directly on the device.

…on-FSDP managed parameters that do not reside on the rank" When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state. This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank. Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/) [ghstack-poisoned]

…aged parameters that do not reside on the rank Pull Request resolved: #94129 When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state. This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank. ghstack-source-id: 179458431 Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)

fegin · 2023-02-07T06:27:31Z

@pytorchbot merge -f "The failing test is not related"

pytorchmergebot · 2023-02-07T06:29:25Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fegin requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501 and wanchaol as code owners February 4, 2023 08:07

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Feb 4, 2023

fegin mentioned this pull request Feb 4, 2023

[FSDP][optim_state_dict] Returns the initial states of the empty parameters for KeyedOptimizer/NamedOptimizer #94130

Closed

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 4, 2023

rohan-varma approved these changes Feb 6, 2023

View reviewed changes

pytorchmergebot added the Merged label Feb 7, 2023

pytorchmergebot closed this in bc6d54f Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

fegin commented Feb 4, 2023 •

edited

pytorch-bot bot commented Feb 4, 2023 •

edited

rohan-varma left a comment

rohan-varma Feb 6, 2023

fegin Feb 6, 2023

rohan-varma Feb 6, 2023

fegin commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

Conversation

fegin commented Feb 4, 2023 • edited

pytorch-bot bot commented Feb 4, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94129

❌ 1 Failures

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Feb 6, 2023

Choose a reason for hiding this comment

fegin Feb 6, 2023

Choose a reason for hiding this comment

rohan-varma Feb 6, 2023

Choose a reason for hiding this comment

fegin commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

Merge started

fegin commented Feb 4, 2023 •

edited

pytorch-bot bot commented Feb 4, 2023 •

edited