Skip to content

Commit

Permalink
Enhance new_group doc to mention using NCCL concurrently.
Browse files Browse the repository at this point in the history
Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

ghstack-source-id: 117932333
Pull Request resolved: #48872
  • Loading branch information
pritamdamania committed Dec 5, 2020
1 parent 0e4f9a7 commit c0ea183
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions torch/distributed/distributed_c10d.py
Expand Up @@ -2213,6 +2213,12 @@ def new_group(ranks=None, timeout=default_pg_timeout, backend=None):
if they are not going to be members of the group. Additionally, groups
should be created in the same order in all processes.
.. warning::
Using multiple process groups with the ``NCCL`` backend concurrently
is not safe and the user should perform explicit synchronization in
their application to ensure only one process group is used at a time.
See `Using multiple NCCL communicators concurrently <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently>`_ for more details.
Arguments:
ranks (list[int]): List of ranks of group members. If ``None``, will be
set to all ranks. Default is ``None``.
Expand Down

0 comments on commit c0ea183

Please sign in to comment.