Skip to content

Conversation

agolynski
Copy link
Contributor

@agolynski agolynski commented Jun 3, 2021

Stack from ghstack:

Differential Revision: D28876182

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 3, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 3, 2021

💊 CI failures summary and remediations

As of commit ce80301 (more details on the Dr. CI page):


  • 6/6 failures possibly* introduced in this PR
    • 1/6 non-scanned failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_wheels.yml
Auto-merging .github/workflows/build_linux_wheels.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_libtorch.yml
Auto-merging .github/workflows/build_linux_libtorch.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_conda.yml
Auto-merging .github/workflows/build_linux_conda.yml
CONFLICT (add/add): Merge conflict in .github/templates/linux_ci_workflow.yml.j2
Auto-merging .github/templates/linux_ci_workflow.yml.j2
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_populate_env.sh
Auto-merging .circleci/scripts/binary_populate_env.sh
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_clang7_onnx_ort_test2 (2/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 04 21:51:28 ../../../../opt/conda/lib/pytho..._test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:27 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_2layer_attention PASSED [ 22%]
Jun 04 21:51:27 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_attention PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_multi_decoder PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testAddParam PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testConstuctor PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetAllParams PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetNonTrainableParams PASSED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_conv_graph SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_relu_graph SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTransformTest::test_resnet50_core SKIPPED [ 22%]
Jun 04 21:51:28 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/tests/c2_ref_test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:28 
Jun 04 21:51:28 =================================== FAILURES ===================================
Jun 04 21:51:28 __________________________ TestCaffe2Basic.test_cast ___________________________
Jun 04 21:51:28 
Jun 04 21:51:28 self = <caffe2.python.onnx.tests.c2_ref_test.TestCaffe2Basic testMethod=test_cast>
Jun 04 21:51:28 
Jun 04 21:51:28     def test_cast(self):
Jun 04 21:51:28         X = np.random.randn(1, 2, 3).astype(np.float32)
Jun 04 21:51:28     
Jun 04 21:51:28         for to_type in ['INT8', caffe2_pb2.TensorProto.INT8,

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (3/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_wheels.yml
Auto-merging .github/workflows/build_linux_wheels.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_libtorch.yml
Auto-merging .github/workflows/build_linux_libtorch.yml
CONFLICT (add/add): Merge conflict in .github/workflows/build_linux_conda.yml
Auto-merging .github/workflows/build_linux_conda.yml
CONFLICT (add/add): Merge conflict in .github/templates/linux_ci_workflow.yml.j2
Auto-merging .github/templates/linux_ci_workflow.yml.j2
CONFLICT (add/add): Merge conflict in .circleci/scripts/binary_populate_env.sh
Auto-merging .circleci/scripts/binary_populate_env.sh
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_clang7_onnx_ort_test1 (4/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 04 21:51:58 ../../../../opt/conda/lib/pytho..._test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_2layer_attention PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_attention PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_beam_search_test.py::Seq2SeqBeamSearchTest::test_multi_decoder PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testAddParam PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testConstuctor PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetAllParams PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/models/seq2seq/seq2seq_model_helper_test.py::Seq2SeqModelHelperTest::testGetNonTrainableParams PASSED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_conv_graph SKIPPED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTest::test_relu_graph SKIPPED [ 22%]
Jun 04 21:51:57 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/test_onnxifi.py::OnnxifiTransformTest::test_resnet50_core SKIPPED [ 22%]
Jun 04 21:51:58 ../../../../opt/conda/lib/python3.6/site-packages/caffe2/python/onnx/tests/c2_ref_test.py::TestCaffe2Basic::test_cast FAILED [ 22%]
Jun 04 21:51:58 
Jun 04 21:51:58 =================================== FAILURES ===================================
Jun 04 21:51:58 __________________________ TestCaffe2Basic.test_cast ___________________________
Jun 04 21:51:58 
Jun 04 21:51:58 self = <caffe2.python.onnx.tests.c2_ref_test.TestCaffe2Basic testMethod=test_cast>
Jun 04 21:51:58 
Jun 04 21:51:58     def test_cast(self):
Jun 04 21:51:58         X = np.random.randn(1, 2, 3).astype(np.float32)
Jun 04 21:51:58     
Jun 04 21:51:58         for to_type in ['INT8', caffe2_pb2.TensorProto.INT8,

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results Download PyTorch Test Reports 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

agolynski added a commit that referenced this pull request Jun 3, 2021
ghstack-source-id: d406dc4
Pull Request resolved: #59398
@agolynski agolynski requested review from lw and wayi1 June 3, 2021 18:21
@agolynski
Copy link
Contributor Author

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@wayi1 wayi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

auto allreduce_work = state_->allreduce(tensors);
auto allreduce_fut = state_->allreduce(tensors)->getFuture();

// FIXME Access the result through the Future passed as argument, instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the FIXME comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 15 to 24
// capturing the Work.
auto div_by_process_group_size = [allreduce_work,
auto div_by_process_group_size = [allreduce_fut,
this](c10::ivalue::Future& /* unused */) {
auto tensor = allreduce_work->result()[0] / state_->getSize();

auto result = allreduce_fut->value();
TORCH_INTERNAL_ASSERT(result.isTensorList(),
"ProcessGroup::allreduce should return TensorList");
auto tensor = result.toTensorVector()[0] / state_->getSize();
return c10::IValue(tensor);
};

auto fut = allreduce_work->getFuture();
return fut->then(div_by_process_group_size, fut->elementType());
return allreduce_fut->then(div_by_process_group_size, allreduce_fut->elementType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code introduces a (potential) memory leak: by adding a callback, you end up storing on allreduce_fut a lambda whose closure contains an owning pointer to allreduce_fut itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.

This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing allreduce_fut, we should use the argument that is being passed to the lambda, which also points to allreduce_fut, but which doesn't cause the reference cycle.

In other words, please do this:

  auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) {
    auto result = allreduce_fut.value();
    ...
  };
  return allreduce_fut->then(div_by_process_group_size, ...);

The same applies to the other FIXME just below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To go the extra mile, it might also be good to avoid capturing this in the lambda, because it boils down to capturing a raw pointer, which means that there is no guarantee that the pointed-to object will still be alive once the callback fires.

The only reason for capturing this is to access state_->getSize() (I believe), hence what we could do instead is capturing that directly by value:

auto div_by_process_group_size = [size{state_->getSize()}](c10::ivalue::Future& allreduce_fut) { ... };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason for capturing this is to access state_->getSize() (I believe), hence what we could do instead is capturing that directly by value:

thanks, capturing 'this' here is not a good idea indeed, this can easily lead to a crash when Future is not immediately returned (e.g. GLOO and MPI backends)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code introduces a (potential) memory leak: by adding a callback, you end up storing on allreduce_fut a lambda whose closure contains an owning pointer to allreduce_fut itself. This creates a reference cycle, which means that if for some reason the future is never completed it will never be "garbage collected" because its refcount will never reach 0.

This is precisely the problem that was being addressed in the FIXME that you removed: instead of capturing allreduce_fut, we should use the argument that is being passed to the lambda, which also points to allreduce_fut, but which doesn't cause the reference cycle.

In other words, please do this:

  auto div_by_process_group_size = [this](c10::ivalue::Future& allreduce_fut) {
    auto result = allreduce_fut.value();
    ...
  };
  return allreduce_fut->then(div_by_process_group_size, ...);

The same applies to the other FIXME just below.

Done.

Talked offline: this is a defensive measure as if Future is not returned, typically behavior is not defined if we depend on result of that future. Memory leak will be the least of problems in this case.

agolynski added a commit that referenced this pull request Jun 4, 2021
ghstack-source-id: 25a5a34
Pull Request resolved: #59398
@agolynski agolynski marked this pull request as ready for review June 4, 2021 19:27
@agolynski
Copy link
Contributor Author

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

agolynski added a commit that referenced this pull request Jun 4, 2021
ghstack-source-id: ee13141
Pull Request resolved: #59398
@agolynski
Copy link
Contributor Author

@agolynski has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@agolynski merged this pull request in 1183fa3.

@facebook-github-bot facebook-github-bot deleted the gh/agolynski/23/head branch June 8, 2021 14:17
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary: Pull Request resolved: pytorch#59398

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28876182

Pulled By: agolynski

fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants