[ProcessGroupNCCL] Avoid recording stream for synchronous ops #111431

kwen2501 · 2023-10-17T19:30:59Z

For synchronous ops (i.e. asyncOp = False), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged (i.e. not telling the caching allocator there is a stream-stream hand off). The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.

To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting TORCH_NCCL_AVOID_RECORD_STREAMS=1).

This mechanism change is for libraries like FSDP which uses all_gather_into_tensor and reduce_scatter_tensor in a synchronous way and which cannot set TORCH_NCCL_AVOID_RECORD_STREAMS=1 for their users. And therefore, this change is limited to these two collectives for now.

Cc: @awgu @janeyx99 @albanD

pytorch-bot · 2023-10-17T19:31:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111431

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4113d9c with merge base 6f06832 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Change sounds good once CI is happy!

albanD · 2023-10-17T21:38:25Z

torch/csrc/distributed/c10d/Ops.cpp

@@ -25,15 +25,15 @@ TORCH_LIBRARY(c10d, m) {
  m.def(
      "allgather_(Tensor[][] output_tensors, Tensor[] input_tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, int timeout) -> (Tensor[][], __torch__.torch.classes.c10d.Work)");
  m.def(
-      "_allgather_base_(Tensor output_tensor, Tensor input_tensor, __torch__.torch.classes.c10d.ProcessGroup process_group) -> (Tensor, __torch__.torch.classes.c10d.Work)");
+      "_allgather_base_(Tensor output_tensor, Tensor input_tensor, __torch__.torch.classes.c10d.ProcessGroup process_group, bool asyncOp, int timeout) -> (Tensor, __torch__.torch.classes.c10d.Work)");


Is the timeout arg new here?

It is new for this op. However, it exists for every other ops, so I just added it to stay in line.

H-Huang

Looks good, have a few comments feel free to address at your convenience

H-Huang · 2023-10-17T22:03:23Z

torch/csrc/distributed/c10d/Ops.cpp

  m.def(
      "allgather_coalesced_(Tensor[][] output_lists, Tensor[] input_list, __torch__.torch.classes.c10d.ProcessGroup process_group) -> __torch__.torch.classes.c10d.Work");
  m.def(
      "allgather_into_tensor_coalesced_(Tensor[] outputs, Tensor[] inputs, __torch__.torch.classes.c10d.ProcessGroup process_group) -> __torch__.torch.classes.c10d.Work");
  m.def(
      "reduce_scatter_(Tensor[] output_tensors, Tensor[][] input_tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, int timeout) -> (Tensor[], __torch__.torch.classes.c10d.Work)");
  m.def(
-      "_reduce_scatter_base_(Tensor output_tensor, Tensor input_tensor, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, int timeout) -> (Tensor, __torch__.torch.classes.c10d.Work)");
+      "_reduce_scatter_base_(Tensor output_tensor, Tensor input_tensor, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, bool asyncOp, int timeout) -> (Tensor, __torch__.torch.classes.c10d.Work)");


pytorch/test/forward_backward_compatibility/check_forward_backward_compatibility.py

Line 42 in 2dc1726

ALLOW_LIST = [

ALLOW_LIST will need to be updated with _reduce_scatter_base_ and _all_gather_base_ to fix backward compat complaint in CI

H-Huang · 2023-10-17T22:06:41Z

torch/csrc/distributed/c10d/Types.hpp

@@ -137,6 +137,7 @@ struct ReduceOptions {

 struct AllgatherOptions {
  std::chrono::milliseconds timeout = kUnsetTimeout;
+  bool asyncOp = true;


should these be true by default? Our current collectives have async = false by default right?

For the python APIs, yes, async = false is the default.
But for C++ APIs, i.e. APIs defined by the Backend class, async = true is the default behavior.

Here struct AllgatherOptions is mainly for passing the async option into the C++ implementations, so I chose true as the default behavior.

torch/_C/_distributed_c10d.pyi

kwen2501 · 2023-10-19T00:39:04Z

@pytorchbot merge -f "The failure in test_python_ref_executor__refs_sinc_executor_aten_cuda_complex128 does not seem related"

pytorchmergebot · 2023-10-19T00:40:59Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: #112195 Approved by: https://github.com/awgu

…ytorch#112896) Summary: Follows PR pytorch#111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365

@awgu

…h#111431) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: pytorch#111431 Approved by: https://github.com/H-Huang

…2195) Even after PR pytorch#111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: pytorch#112195 Approved by: https://github.com/awgu

…112896) Summary: Follows PR #111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: #112896 Approved by: https://github.com/wanchaol

…2195) Even after PR pytorch#111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: pytorch#112195 Approved by: https://github.com/awgu

…ytorch#112896) Summary: Follows PR pytorch#111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: pytorch#112896 Approved by: https://github.com/wanchaol

Avoid recording stream for synchronous ops

deb8ac4

kwen2501 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, wanchaol, fegin, fduwjj, kiukchung, d4l3k and wz337 as code owners October 17, 2023 19:31

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 17, 2023

albanD reviewed Oct 17, 2023

View reviewed changes

H-Huang approved these changes Oct 17, 2023

View reviewed changes

awgu reviewed Oct 18, 2023

View reviewed changes

torch/_C/_distributed_c10d.pyi Show resolved Hide resolved

kwen2501 added 3 commits October 18, 2023 08:53

Lint

0d27d64

Fix backward compat check

11a860c

Fix typo

4113d9c

pytorchmergebot added the merging label Oct 19, 2023

pytorchmergebot added Merged and removed merging labels Oct 19, 2023

pytorchmergebot closed this in 18cc8a9 Oct 19, 2023

kwen2501 mentioned this pull request Oct 26, 2023

[c10d] Pass avoidRecordStreams into collective() function #112195

Closed

kwen2501 mentioned this pull request Nov 3, 2023

[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter #112896

Closed

github-actions bot deleted the avoid_record_ag_rs branch April 20, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ProcessGroupNCCL] Avoid recording stream for synchronous ops #111431

[ProcessGroupNCCL] Avoid recording stream for synchronous ops #111431

Uh oh!

kwen2501 commented Oct 17, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 17, 2023 •

edited

Loading

Uh oh!

albanD left a comment

Uh oh!

albanD Oct 17, 2023

Uh oh!

kwen2501 Oct 18, 2023

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang Oct 17, 2023

Uh oh!

H-Huang Oct 17, 2023

Uh oh!

kwen2501 Oct 18, 2023

Uh oh!

kwen2501 Oct 18, 2023

Uh oh!

Uh oh!

kwen2501 commented Oct 19, 2023

Uh oh!

pytorchmergebot commented Oct 19, 2023

Uh oh!

Uh oh!

[ProcessGroupNCCL] Avoid recording stream for synchronous ops #111431

[ProcessGroupNCCL] Avoid recording stream for synchronous ops #111431

Uh oh!

Conversation

kwen2501 commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111431

❌ 1 New Failure

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwen2501 commented Oct 19, 2023

Uh oh!

pytorchmergebot commented Oct 19, 2023

Merge started

Uh oh!

Uh oh!

kwen2501 commented Oct 17, 2023 •

edited

Loading

pytorch-bot bot commented Oct 17, 2023 •

edited

Loading