Switch PG::Work to Future in default_comm_hooks.cpp #59398

agolynski · 2021-06-03T18:13:22Z

Stack from ghstack:

Switch PG::Work to Future in default_comm_hooks.cpp #59398 Switch PG::Work to Future in default_comm_hooks.cpp

Differential Revision: D28876182

[ghstack-poisoned]

facebook-github-bot · 2021-06-03T18:13:27Z

💊 CI failures summary and remediations

As of commit ce80301 (more details on the Dr. CI page):

6/6 failures possibly* introduced in this PR
- 1/6 non-scanned failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_wheels.yml
Auto-merging .github/workflows/build_linux_wheels.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_libtorch.yml
Auto-merging .github/workflows/build_linux_libtorch.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_conda.yml
Auto-merging .github/workflows/build_linux_conda.yml
CONFLICT (add/add): Merge conflict in .github/templates/linux_ci_workflow.yml.j2
Auto-merging .github/templates/linux_ci_workflow.yml.j2
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_populate_env.sh
Auto-merging .circleci/scripts/binary_populate_env.sh
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (2/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 04 21:51:28 ../../../../opt/conda/lib/pytho..._test.py::TestCaffe2Basic::test_cast FAILED [ 22%]

Jun 04 21:51:27 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_2layer_attention PASSED [ 22%]
Jun 04 21:51:27 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_attention PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_multi_decoder PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testAddParam PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testConstuctor PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetAllParams PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetNonTrainableParams PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_conv_graph SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_relu_graph SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTransformTest::test_resnet50_core SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/tests/c2_ref_test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:28 
Jun 04 21:51:28 =================================== FAILURES ===================================
Jun 04 21:51:28 __________________________ TestCaffe2Basic.test_cast ___________________________
Jun 04 21:51:28 
Jun 04 21:51:28 self = <caffe2.python.onnx.tests.c2_ref_test.TestCaffe2Basic testMethod=test_cast>
Jun 04 21:51:28 
Jun 04 21:51:28     def test_cast(self):
Jun 04 21:51:28         X = np.random.randn(1, 2, 3).astype(np.float32)
Jun 04 21:51:28     
Jun 04 21:51:28         for to_type in ['INT8', caffe2_pb2.TensorProto.INT8,

pytorch_linux_xenial_py3_6_gcc5_4_build (3/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_wheels.yml
Auto-merging .github/workflows/build_linux_wheels.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_libtorch.yml
Auto-merging .github/workflows/build_linux_libtorch.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_conda.yml
Auto-merging .github/workflows/build_linux_conda.yml
CONFLICT (add/add): Merge conflict in .github/templates/linux_ci_workflow.yml.j2
Auto-merging .github/templates/linux_ci_workflow.yml.j2
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_populate_env.sh
Auto-merging .circleci/scripts/binary_populate_env.sh
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_clang7_onnx_ort_test1 (4/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 04 21:51:58 ../../../../opt/conda/lib/pytho..._test.py::TestCaffe2Basic::test_cast FAILED [ 22%]

Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_2layer_attention PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_attention PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_multi_decoder PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testAddParam PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testConstuctor PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetAllParams PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetNonTrainableParams PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_conv_graph SKIPPED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_relu_graph SKIPPED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTransformTest::test_resnet50_core SKIPPED [ 22%]
Jun 04 21:51:58 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/tests/c2_ref_test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:58 
Jun 04 21:51:58 =================================== FAILURES ===================================
Jun 04 21:51:58 __________________________ TestCaffe2Basic.test_cast ___________________________
Jun 04 21:51:58 
Jun 04 21:51:58 self = <caffe2.python.onnx.tests.c2_ref_test.TestCaffe2Basic testMethod=test_cast>
Jun 04 21:51:58 
Jun 04 21:51:58     def test_cast(self):
Jun 04 21:51:58         X = np.random.randn(1, 2, 3).astype(np.float32)
Jun 04 21:51:58     
Jun 04 21:51:58         for to_type in ['INT8', caffe2_pb2.TensorProto.INT8,

1 failure not recognized by patterns:

Job	Step	Action
^{Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results}	^{Download PyTorch Test Reports}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: d406dc4 Pull Request resolved: #59398

agolynski · 2021-06-03T18:27:11Z

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

wayi1

Thanks for the fix!

wayi1 · 2021-06-03T19:10:56Z

torch/lib/c10d/default_comm_hooks.cpp

-  auto allreduce_work = state_->allreduce(tensors);
+  auto allreduce_fut = state_->allreduce(tensors)->getFuture();

  // FIXME Access the result through the Future passed as argument, instead of


Please remove the FIXME comment.

lw · 2021-06-04T14:51:18Z

torch/lib/c10d/default_comm_hooks.cpp

-  // capturing the Work.
-  auto div_by_process_group_size = [allreduce_work,
+  auto div_by_process_group_size = [allreduce_fut,
                                    this](c10::ivalue::Future& /* unused */) {
-    auto tensor = allreduce_work->result()[0] / state_->getSize();
+
+    auto result = allreduce_fut->value();
+    TORCH_INTERNAL_ASSERT(result.isTensorList(),
+        "ProcessGroup::allreduce should return TensorList");
+    auto tensor = result.toTensorVector()[0] / state_->getSize();
    return c10::IValue(tensor);
  };

-  auto fut = allreduce_work->getFuture();
-  return fut->then(div_by_process_group_size, fut->elementType());
+  return allreduce_fut->then(div_by_process_group_size, allreduce_fut->elementType());


This code introduces a (potential) memory leak: by adding a callback, you end up storing on allreduce_fut a lambda whose closure contains an owning pointer to allreduce_fut itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.

This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing allreduce_fut, we should use the argument that is being passed to the lambda, which also points to allreduce_fut, but which doesn't cause the reference cycle.

In other words, please do this:

auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) { auto result = allreduce_fut.value(); ... }; return allreduce_fut->then(div_by_process_group_size, ...);

The same applies to the other FIXME just below.

To go the extra mile, it might also be good to avoid capturing this in the lambda, because it boils down to capturing a raw pointer, which means that there is no guarantee that the pointed-to object will still be alive once the callback fires.

The only reason for capturing this is to access state_->getSize() (I believe), hence what we could do instead is capturing that directly by value:

auto div_by_process_group_size = [size{state_->getSize()}](c10::ivalue::Future& allreduce_fut) { ... };

The only reason for capturing this is to access state_->getSize() (I believe), hence what we could do instead is capturing that directly by value:

thanks, capturing 'this' here is not a good idea indeed, this can easily lead to a crash when Future is not immediately returned (e.g. GLOO and MPI backends)

This code introduces a (potential) memory leak: by adding a callback, you end up storing on allreduce_fut a lambda whose closure contains an owning pointer to allreduce_fut itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.

This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing allreduce_fut, we should use the argument that is being passed to the lambda, which also points to allreduce_fut, but which doesn't cause the reference cycle.

In other words, please do this:

auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) { auto result = allreduce_fut.value(); ... }; return allreduce_fut->then(div_by_process_group_size, ...);

The same applies to the other FIXME just below.

Done.

Talked offline: this is a defensive measure as if Future is not returned, typically behavior is not defined if we depend on result of that future. Memory leak will be the least of problems in this case.

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

ghstack-source-id: 25a5a34 Pull Request resolved: #59398

agolynski · 2021-06-04T19:47:25Z

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

lw

Looks good, thanks!

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

ghstack-source-id: ee13141 Pull Request resolved: #59398

agolynski · 2021-06-04T21:18:03Z

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-04T22:28:37Z

@agolynski merged this pull request in 1183fa3.

Summary: Pull Request resolved: pytorch#59398 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28876182 Pulled By: agolynski fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e

Switch PG::Work to Future in default_comm_hooks.cpp

20f9e56

[ghstack-poisoned]

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 3, 2021

agolynski added a commit that referenced this pull request Jun 3, 2021

Switch PG::Work to Future in default_comm_hooks.cpp

f366e13

ghstack-source-id: d406dc4 Pull Request resolved: #59398

agolynski requested review from lw and wayi1 June 3, 2021 18:21

wayi1 approved these changes Jun 3, 2021

View reviewed changes

wayi1 reviewed Jun 3, 2021

View reviewed changes

lw requested changes Jun 4, 2021

View reviewed changes

Update on "Switch PG::Work to Future in default_comm_hooks.cpp"

c37bac8

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

Update on "Switch PG::Work to Future in default_comm_hooks.cpp"

0baa312

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

Update on "Switch PG::Work to Future in default_comm_hooks.cpp"

28713a9

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

agolynski added a commit that referenced this pull request Jun 4, 2021

Switch PG::Work to Future in default_comm_hooks.cpp

1e5772f

ghstack-source-id: 25a5a34 Pull Request resolved: #59398

agolynski marked this pull request as ready for review June 4, 2021 19:27

agolynski requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners June 4, 2021 19:27

lw approved these changes Jun 4, 2021

View reviewed changes

Update on "Switch PG::Work to Future in default_comm_hooks.cpp"

ce80301

Differential Revision: [D28876182](https://our.internmc.facebook.com/intern/diff/D28876182) [ghstack-poisoned]

agolynski added a commit that referenced this pull request Jun 4, 2021

Switch PG::Work to Future in default_comm_hooks.cpp

cde1307

ghstack-source-id: ee13141 Pull Request resolved: #59398

agolynski mentioned this pull request Jun 4, 2021

[DDP] Merge work and future_work in reducer #58937

Closed

facebook-github-bot closed this in 1183fa3 Jun 4, 2021

facebook-github-bot added the Merged label Jun 4, 2021

facebook-github-bot deleted the gh/agolynski/23/head branch June 8, 2021 14:17

Switch PG::Work to Future in default_comm_hooks.cpp #59398

Switch PG::Work to Future in default_comm_hooks.cpp #59398

Uh oh!

Conversation

agolynski commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 4 new failures recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (2/4)

pytorch_linux_xenial_py3_6_gcc5_4_build (3/4)

pytorch_linux_xenial_py3_clang7_onnx_ort_test1 (4/4)

1 failure not recognized by patterns:

Uh oh!

agolynski commented Jun 3, 2021

Uh oh!

wayi1 left a comment

Choose a reason for hiding this comment

Uh oh!

wayi1 Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

agolynski Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

lw Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

lw Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

agolynski Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

agolynski Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

agolynski commented Jun 4, 2021

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

agolynski commented Jun 4, 2021

Uh oh!

facebook-github-bot commented Jun 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

agolynski commented Jun 3, 2021 •

edited

Loading

facebook-github-bot commented Jun 3, 2021 •

edited

Loading