Distribute GPUs in round robin mode for distributed_test #46389

The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one. See pytorch#46248 for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute GPUs in round robin mode for distributed_test #46389

Distribute GPUs in round robin mode for distributed_test #46389

Commits on Oct 15, 2020