Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations  #84423

Closed
wants to merge 15 commits into from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Sep 1, 2022

Stack from ghstack:

Changes

  • Updates for the allgather collective

Context

#86225

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 1, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit 36137b9 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build Lint / lintrunner (1/1)

Step: "Run lintrunner on all files" (full log | diagnosis details)

2022-09-01T23:26:01.6614738Z ##[error]Process completed with exit code 1.
2022-09-01T23:26:01.6595624Z �[0m    �[2m155  154�[0m |�[1m�[2m �[0m�[2m    const c10::intrusive_ptr<ProcessGroup>& process_group,
2022-09-01T23:26:01.6596374Z �[0m    �[2m155     �[0m |�[31m�[1m-�[0m�[31m    const �[0m�[31m�[40m�[4mstd::vector<std::vector<at::Tensor>>�[0m�[31m �[0m�[31m�[40m�[4m&output_tensors,�[0m�[31m
2022-09-01T23:26:01.6597142Z �[0m    �[2m     155�[0m |�[32m�[1m+�[0m�[32m    const �[0m�[32m�[40m�[4mstd::vector<std::vector<at::Tensor>>&�[0m�[32m �[0m�[32m�[40m�[4moutput_tensors,�[0m�[32m
2022-09-01T23:26:01.6597759Z �[0m    �[2m157  156�[0m |�[1m�[2m �[0m�[2m    const at::TensorList& input_tensors,
2022-09-01T23:26:01.6598249Z �[0m    �[2m158  157�[0m |�[1m�[2m �[0m�[2m    const AllgatherOptions& opts) {
2022-09-01T23:26:01.6598815Z �[0m    �[2m159  158�[0m |�[1m�[2m �[0m�[2m  static auto op = c10::Dispatcher::singleton()
2022-09-01T23:26:01.6599156Z �[0m
2022-09-01T23:26:01.6599297Z 
2022-09-01T23:26:01.6599598Z �[1m�[36mYou can reproduce these results locally by using `lintrunner`.�[0m
2022-09-01T23:26:01.6600223Z �[1m�[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.�[0m
2022-09-01T23:26:01.6614738Z ##[error]Process completed with exit code 1.
2022-09-01T23:26:01.6655432Z ##[group]Run # Use jq to massage the JSON lint output into GitHub Actions workflow commands.
2022-09-01T23:26:01.6655850Z �[36;1m# Use jq to massage the JSON lint output into GitHub Actions workflow commands.�[0m
2022-09-01T23:26:01.6656136Z �[36;1mjq --raw-output \�[0m
2022-09-01T23:26:01.6656536Z �[36;1m  '"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \�[0m
2022-09-01T23:26:01.6657030Z �[36;1m  lint.json�[0m
2022-09-01T23:26:01.6703096Z shell: /usr/bin/bash -e {0}
2022-09-01T23:26:01.6703298Z env:
2022-09-01T23:26:01.6703546Z   pythonLocation: /opt/hostedtoolcache/Python/3.8.13/x64
2022-09-01T23:26:01.6703854Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.13/x64/lib
2022-09-01T23:26:01.6704094Z ##[endgroup]

🕵️‍♀️ 1 failure not recognized by patterns:

The following CI failures may be due to changes from the PR
Job Step
CircleCI Checks build Unknown

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 1, 2022
… / CUDA implementations "

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 1, 2022
…lementations

ghstack-source-id: d273b2420b1d6555afa65d8228d251b8b1fcc8e9
Pull Request resolved: #84423
… / CUDA implementations "

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 7, 2022
…lementations

ghstack-source-id: 586c29cbd7ef90b1947fc82e8861165581dccf31
Pull Request resolved: #84423
… / CUDA implementations "

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 13, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84423

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 02a23f9:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… / CUDA implementations "

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 13, 2022
…lementations

ghstack-source-id: 6ac06e20eb678e0f57a2200dee211bca8ad2d3fa
Pull Request resolved: #84423
… / CUDA implementations "

[ghstack-poisoned]
… / CUDA implementations "

[ghstack-poisoned]
… / CUDA implementations "

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 30, 2022
…lementations

ghstack-source-id: 3c5e30ad10f79b78d779bd79142da3a242e85f3b
Pull Request resolved: #84423
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 30, 2022
… / CUDA implementations "

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 3, 2022
…lementations

ghstack-source-id: e8cb29a61246fcb62a568ff6cb9724c0e34a2827
Pull Request resolved: #84423
… / CUDA implementations "

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Oct 3, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

… / CUDA implementations "


### Changes
- Updates for the allgather collective

### Context
#86225

Differential Revision: [D40044549](https://our.internmc.facebook.com/intern/diff/D40044549)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Oct 6, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

… / CUDA implementations "


### Changes
- Updates for the allgather collective

### Context
#86225

Differential Revision: [D40044549](https://our.internmc.facebook.com/intern/diff/D40044549)

[ghstack-poisoned]
… / CUDA implementations "


### Changes
- Updates for the allgather collective

### Context
#86225

Differential Revision: [D40044549](https://our.internmc.facebook.com/intern/diff/D40044549)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Oct 7, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@H-Huang
Copy link
Member Author

H-Huang commented Oct 10, 2022

@pytorchbot merge -f "Ran ciflow/trunk and all tests pass, using force to skip meta internal-only check since the files are the same as internal"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator

Details for Dev Infra team Raised by workflow job

@H-Huang
Copy link
Member Author

H-Huang commented Oct 10, 2022

@pytorchbot merge -f "Ran ciflow/trunk and all tests pass, using force to skip meta internal-only check since the files are the same as internal"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link

Hey @H-Huang.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@facebook-github-bot facebook-github-bot deleted the gh/H-Huang/80/head branch June 8, 2023 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants