[NVIDIA TF] CollectiveAllToAllV2 + NCCL all-to-all fixes #60001

trevor-m · 2023-03-15T20:33:44Z

This PR fixes a few small bugs with CollectiveAllToAllV2 and NCCL all-to-all:

NCCL does not support in-place all-to-all, so don't reuse the input tensor in CollectiveAllToAllV2. We could potentially keep forward_input_or_allocate_output and allocate a temporary buffer inside the NCCL backend only when the input is reused. This would allow other backends to still perform an in-place all-to-all. It looks like the ring algorithm uses temporary output buffers anyway.
Update calls to ncclSend and ncclRecv to follow the mapping by using global_rank. The TF runtime can sometimes reorder the GPUs to optimize the ring algorithms. While this PR fixes nccl all-to-all to follow the ordering, the ring algorithm is also affected and gives incorrect results.
VLOG outputs in nccl_manager.cc would cause a crash due to interpretting the data buffer as a string.

trevor-m · 2023-03-15T20:34:03Z

cc @rainwoodman

chsigg

Thank you!

trevor-m added 2 commits March 15, 2023 13:19

DOn't reuse input buffer for nccl all-to-all, in-place is not supported

e66b910

Use global_rank to refer to GPU incase order is changed by TF runtime

40d7c12

trevor-m requested a review from chsigg as a code owner March 15, 2023 20:33

google-ml-butler bot added awaiting review Pull request awaiting review size:S CL Change Size: Small labels Mar 15, 2023

google-ml-butler bot assigned gbaned Mar 15, 2023

github-actions bot added the kokoro:force-run Tests on submitted change label Mar 15, 2023

github-actions bot assigned bfontain and rohan100jain Mar 15, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 15, 2023

gbaned added the comp:core issues related to core part of tensorflow label Mar 16, 2023

gbaned requested a review from rainwoodman March 16, 2023 13:53

Don't reuse input buffer in collective_nccl_test either

101c6f9

github-actions bot added the kokoro:force-run Tests on submitted change label Mar 16, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 16, 2023

chsigg approved these changes Mar 16, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 16, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 16, 2023

copybara-service bot merged commit b0d0da9 into tensorflow:master Mar 17, 2023

This was referenced Oct 27, 2023

Keras.fit stuck/error in TensorFlow 2.13/2.14 (TPU is fine, inference on GPU is fine, 2.11 GPU is fine) #62234

Closed

TensorFlow 2.13 distributed training fail #61314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA TF] CollectiveAllToAllV2 + NCCL all-to-all fixes #60001

[NVIDIA TF] CollectiveAllToAllV2 + NCCL all-to-all fixes #60001

Uh oh!

trevor-m commented Mar 15, 2023

Uh oh!

trevor-m commented Mar 15, 2023

Uh oh!

chsigg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[NVIDIA TF] CollectiveAllToAllV2 + NCCL all-to-all fixes #60001

[NVIDIA TF] CollectiveAllToAllV2 + NCCL all-to-all fixes #60001

Uh oh!

Conversation

trevor-m commented Mar 15, 2023

Uh oh!

trevor-m commented Mar 15, 2023

Uh oh!

chsigg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants