Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank #94129

Closed
wants to merge 3 commits into from

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Feb 4, 2023

Stack from ghstack (oldest at bottom):

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When use_orig_params=True , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: D42982778

…aged parameters that do not reside on the rank

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94129

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 4f58406:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…on-FSDP managed parameters that do not reside on the rank"

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)

[ghstack-poisoned]
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dist.all_reduce(sparse)
return self.dense(sparse)

models = [FakeMPModel().cuda(), FakeMPModel().cuda()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test wrapping the entire FakeModel with FSDP and telling FSDP to ignore the sparse0 / sparse1? Wondering which approach most accurately simulates the FSDP / TRec integration case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, this way to wrap the model, the root module not being a FSDP model but having model parallelism -- is more close to the real use case.


# Train one batch and see if optim_state_dict are the same.
batch = torch.rand(5, 8)
batch = torch.rand(5, 8).cuda()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: torch.rand(5, 8, device=cuda) creates the tensor directly on the device.

…on-FSDP managed parameters that do not reside on the rank"

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Feb 6, 2023
…aged parameters that do not reside on the rank

Pull Request resolved: #94129

When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.
ghstack-source-id: 179458431

Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)
@fegin
Copy link
Contributor Author

fegin commented Feb 7, 2023

@pytorchbot merge -f "The failing test is not related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants