Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Mar 30, 2021

Stack from ghstack:

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: D27444917

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 30, 2021

💊 CI failures summary and remediations

As of commit 532b673 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Mar 30, 2021
rohan-varma added a commit that referenced this pull request Mar 30, 2021
Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

ghstack-source-id: 125304647
Pull Request resolved: #54991
@rohan-varma rohan-varma changed the title [NCCL] Log when barrier guesses device to use [NCCL] warn when barrier guesses device to use Mar 30, 2021
" using best-guess GPU ",
deviceIdx,
" to perform barrier as devices used by this process are currently unknown. ",
"Specify device_ids in barrier() to force use of a particular device."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to say "To avoid any potential hang due to a wrong guess, please specify ...". This can improve the debuggability by mentioning the potential outcome if the warning here is ignored.

Copy link
Contributor

@wayi1 wayi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the debuggability!

" to perform barrier as devices used by this process are currently unknown. ",
"Specify device_ids in barrier() to force use of a particular device."
);
LOG(WARNING) << bestGuessWarning;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since bestGuessWarning wont' be used anywhere else, can just log everything directly after LOG(WARNING) and no need to construct a separate string.

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 12, 2021
Pull Request resolved: #54991

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 1263518

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 657b66e.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/275/head branch April 17, 2021 14:17
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#54991

Actual proposed fix is in
pytorch#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27444917

fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants