Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix distributed documentation for asynchronous collective Work objects #45709

Closed
wants to merge 4 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Oct 2, 2020

Closes #42247. Clarifies some documentation related to Work object semantics (outputs of async collective functions). Clarifies the difference between CPU operations and CUDA operations (on Gloo or NCCL backend), and provides an example where the difference in CUDA operation's wait() semantics is necessary to understand for correct code.
sync

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 2, 2020
@codecov
Copy link

codecov bot commented Oct 2, 2020

Codecov Report

Merging #45709 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #45709   +/-   ##
=======================================
  Coverage   68.25%   68.25%           
=======================================
  Files         410      410           
  Lines       53246    53246           
=======================================
+ Hits        36343    36344    +1     
+ Misses      16903    16902    -1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 78.57% <0.00%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8d76ff...258d7cf. Read the comment docs.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

return anything

asynchronous operation - when ``async_op`` is set to True. The collective operation function
Every collective operation function supports the following two kinds of operations, depending on the setting of the ``async_op`` flag passed into the collective:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's break this into shorter lines

further function calls utilizing the output of the collective call will behave as expected. For CUDA collectives,
function calls utilizing the output on the same CUDA stream will behave as expected. Users must take care of
synchronization under the scenario of running under different streams. For details on CUDA semantics such as stream
synchronization, see `cuda semantics <https://pytorch.org/docs/stable/autograd.html#profiler>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda semantics points to profiler, is this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! It should actually point to https://pytorch.org/docs/stable/notes/cuda.html

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in 154347d.

@facebook-github-bot facebook-github-bot deleted the work_docs_fix branch January 27, 2021 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix distributed documentation for asynchronous collective Work objects
3 participants