From 0353911d0366610221cb00c876a8a68babdb94cb Mon Sep 17 00:00:00 2001 From: Omkar Salpekar Date: Tue, 10 Nov 2020 12:28:48 -0800 Subject: [PATCH] [Docs] Minor doc fixes for init_process_group Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644 Minor Update to the init_process_group docs. ghstack-source-id: 116337449 Differential Revision: [D24633432](https://our.internmc.facebook.com/intern/diff/D24633432/) --- torch/distributed/distributed_c10d.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/torch/distributed/distributed_c10d.py b/torch/distributed/distributed_c10d.py index 6f07c9412d5d..d97fa774ef30 100644 --- a/torch/distributed/distributed_c10d.py +++ b/torch/distributed/distributed_c10d.py @@ -401,11 +401,11 @@ def init_process_group(backend, asynchronously and the process will crash. ``NCCL_BLOCKING_WAIT`` will provide errors to the user which can be caught and handled, but due to its blocking nature, it has a performance overhead. On - the other hand, ``NCCL_ASYNC_ERROR_HANDLING`` has little + the other hand, ``NCCL_ASYNC_ERROR_HANDLING`` has very little performance overhead, but crashes the process on errors. This is done since CUDA execution is async and it is no longer safe to continue executing user code since failed async NCCL operations - might result in subsequent CUDA operations to run on corrupted + might result in subsequent CUDA operations running on corrupted data. Only one of these two environment variables should be set. group_name (str, optional, deprecated): Group name.