Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make object-related collective functions accept a device parameter #98130

Closed
wants to merge 1 commit into from

Conversation

shaoyf42
Copy link
Contributor

@shaoyf42 shaoyf42 commented Apr 1, 2023

Make object-related collective functions accept a device parameter. Then those functions can support custom device to fix #97938
The next step will be to extend the _get_pg_device for supporting the custom device and backend.

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98130

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 83ee045:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Apr 1, 2023
@shaoyf42
Copy link
Contributor Author

shaoyf42 commented Apr 4, 2023

@H-Huang could you take a look

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure why not

@ezyang
Copy link
Contributor

ezyang commented Apr 5, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 5, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@ezyang
Copy link
Contributor

ezyang commented Apr 5, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR:
@pytorchbot merge -r
Or just rebase by leaving @pytorchbot rebase comment

Details for Dev Infra team Raised by workflow job

@ezyang
Copy link
Contributor

ezyang commented Apr 5, 2023

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased dev_c10d onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dev_c10d && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a temporary hold here as we were discussing in #97938 (comment) that we may want to move all **_object collectives to cpu only. (In that case, custom backend don't need to implement them.)

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 10, 2023
@kwen2501
Copy link
Contributor

FYI of similar try:
Figure out device to use for object collectives #100238

@shaoyf42
Copy link
Contributor Author

Closing as the rebased version #100954 has been merged to main.
A thousand thanks to @huihoaan @kwen2501 !

@shaoyf42 shaoyf42 closed this May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support custom device and custom backend for Dispatching PyTorch Distributed Collectives
6 participants