Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP][optim_state_dict] Ensure correct devices for tensors when doing all_gather #92992

Closed
wants to merge 3 commits into from

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Jan 25, 2023

Stack from ghstack (oldest at bottom):

When doing _all_gather_optim_state, we need to ensure that step tensors are on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 25, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92992

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90022ac:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 25, 2023
fegin added a commit that referenced this pull request Jan 25, 2023
…g all_gather

ghstack-source-id: 3bb40dbbac0df7a2772e96ee5c1005ce227bf12b
Pull Request resolved: #92992
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

…s when doing all_gather"


When doing `_all_gather_optim_state`, we need to ensure that `step` tensors are  on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality. 

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 25, 2023
…g all_gather

ghstack-source-id: f9f72d8fbbce3a7a90b2c971cca760c0b47bf26c
Pull Request resolved: #92992
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 26, 2023
…s when doing all_gather"


When doing `_all_gather_optim_state`, we need to ensure that `step` tensors are  on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality. 

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 26, 2023
…g all_gather

ghstack-source-id: b6c2b50393ff9935614b84df33c388847458463a
Pull Request resolved: #92992
@fegin
Copy link
Contributor Author

fegin commented Jan 27, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/fegin/64/head branch June 8, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants