Skip to content

Conversation

@trevor-m
Copy link
Contributor

This PR fixes a few small bugs with CollectiveAllToAllV2 and NCCL all-to-all:

  • NCCL does not support in-place all-to-all, so don't reuse the input tensor in CollectiveAllToAllV2. We could potentially keep forward_input_or_allocate_output and allocate a temporary buffer inside the NCCL backend only when the input is reused. This would allow other backends to still perform an in-place all-to-all. It looks like the ring algorithm uses temporary output buffers anyway.
  • Update calls to ncclSend and ncclRecv to follow the mapping by using global_rank. The TF runtime can sometimes reorder the GPUs to optimize the ring algorithms. While this PR fixes nccl all-to-all to follow the ordering, the ring algorithm is also affected and gives incorrect results.
  • VLOG outputs in nccl_manager.cc would cause a crash due to interpretting the data buffer as a string.

@trevor-m trevor-m requested a review from chsigg as a code owner March 15, 2023 20:33
@google-ml-butler google-ml-butler bot added awaiting review Pull request awaiting review size:S CL Change Size: Small labels Mar 15, 2023
@trevor-m
Copy link
Contributor Author

cc @rainwoodman

@github-actions github-actions bot added the kokoro:force-run Tests on submitted change label Mar 15, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 15, 2023
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Mar 16, 2023
@gbaned gbaned requested a review from rainwoodman March 16, 2023 13:53
@github-actions github-actions bot added the kokoro:force-run Tests on submitted change label Mar 16, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 16, 2023
Copy link
Contributor

@chsigg chsigg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 16, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 16, 2023
@copybara-service copybara-service bot merged commit b0d0da9 into tensorflow:master Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Pull request awaiting review comp:core issues related to core part of tensorflow ready to pull PR ready for merge process size:S CL Change Size: Small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants