Skip to content

Commit

Permalink
[C10D] Document destroy_process_group usage (#122358)
Browse files Browse the repository at this point in the history
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.
<img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9">

Fixes #48203
Pull Request resolved: #122358
Approved by: https://github.com/shuqiangzhang
  • Loading branch information
wconstab authored and pytorchmergebot committed May 9, 2024
1 parent 257d40b commit 26b942c
Showing 1 changed file with 30 additions and 0 deletions.
30 changes: 30 additions & 0 deletions docs/source/distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,36 @@ check whether the process group has already been initialized use :func:`torch.di

.. autofunction:: get_world_size

Shutdown
--------

It is important to clean up resources on exit by calling :func:`destroy_process_group`.

The simplest pattern to follow is to destroy every process group and backend by calling
:func:`destroy_process_group()` with the default value of None for the `group` argument, at a
point in the training script where communications are no longer needed, usually near the
end of main(). The call should be made once per trainer-process, not at the outer
process-launcher level.

if :func:`destroy_process_group` is not called by all ranks in a pg within the timeout duration,
especially when there are multiple process-groups in the application e.g. for N-D parallelism,
hangs on exit are possible. This is because the destructor for ProcessGroupNCCL calls ncclCommAbort,
which must be called collectively, but the order of calling ProcessGroupNCCL's destructor if called
by python's GC is not deterministic. Calling :func:`destroy_process_group` helps by ensuring
ncclCommAbort is called in a consistent order across ranks, and avoids calling ncclCommAbort
during ProcessGroupNCCL's destructor.

Reinitialization
^^^^^^^^^^^^^^^^

`destroy_process_group` can also be used to destroy individual process groups. One use
case could be fault tolerant training, where a process group may be destroyed and then
a new one initialized during runtime. In this case, it's critical to synchronize the trainer
processes using some means other than torch.distributed primitives _after_ calling destroy and
before subsequently initializing. This behavior is currently unsupported/untested, due to
the difficulty of achieving this synchronization, and is considered a known issue. Please file
a github issue or RFC if this is a use case that's blocking you.

--------------------------------------------------------------------------------

Distributed Key-Value Store
Expand Down

0 comments on commit 26b942c

Please sign in to comment.