Use std::shared_ptr for DistAutogradContext. #29770

pritamdamania87 · 2019-11-14T00:15:33Z

Stack from ghstack:

Use std::shared_ptr for DistAutogradContext. #29770 Use std::shared_ptr for DistAutogradContext.

We were passing around const and non-const references for
DistAutogradContext from DistAutogradContainer. This wasn't safe since the
context could be deleted from the container and a thread might still be using
the reference. This usually would happen when a backward pass fails on the node
driving the backward pass (resulting in delete context messages being sent to
all nodes) but other nodes are still executing code related to that autograd
context.

This was also the reason why test_backward_autograd_engine_error was flaky.

Using a std::shared_ptr everywhere ensures we're safe and never crash.

Closes #28928
Closes #26922

Differential Revision: D18494814

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) ghstack-source-id: 93873107 Pull Request resolved: #29770

rohan-varma · 2019-11-14T00:39:07Z

test/dist_autograd_test.py

+        t.start()
+
+        with self.assertRaisesRegex(RuntimeError, "Could not find autograd context with id"):
+            dist_autograd.backward([t1.sum()])


I'm not 100% clear on what behavior is being tested here and what should be different than from before this PR. Does this result in the backward call throwing an exception (if so, what is exactly is the exception that should be thrown)?

The exception being thrown is "Could not find autograd context id for". This happens because the autograd context is cleaned up and some thread on some node is still looking for that autograd context.

Before this PR, this test would cause the process to crash since a thread would be using a reference to DistAutogradContext but our clean up logic would delete the context from the container.

Does this assume that the thread cleared the context before the backward uses it? How do we guarantee that?

test/dist_autograd_test.py

pietern · 2019-11-14T00:45:52Z

test/dist_autograd_test.py

+        # ensures we simulate a case where we clean up the context while the
+        # backward pass is running.
+        while not DistAutogradTest._my_backward_func_executed:
+            time.sleep(0.1)


This assumes the backward pass takes longer than 0.1s. Potential flakiness.

pietern · 2019-11-14T00:47:24Z

torch/csrc/distributed/autograd/functions/recvrpc_backward.h

 class DistAutogradContext;
+using ContextPtr = std::shared_ptr<DistAutogradContext>;


Defined twice? Also in dist_autograd_container.h.

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 93879405 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

pietern

Thanks for addressing, @pritamdamania87. One more pedantic concern.

pietern · 2019-11-14T15:38:13Z

test/dist_autograd_test.py

+        # backward pass is running.
+        with DistAutogradTest._my_backward_func_executed:
+            DistAutogradTest._my_backward_func_executed.wait()
+        dist_autograd._release_context(context._context_id())


Is it possible that this executes after the backwards pass has completed? It's a long shot, seeing as this thread would have to be stalled for at least another 50 autograd steps, but calling it out for the sake of robustness.

Would it be possible to simply call release_context from the backward function itself?

Otherwise, adding another condition to mark completion of the release would make this always work. In the current setting, you could insert a sleep before the release_context and make the test fail, which means that unfortunate thread scheduling could in theory make the test fail.

mrshenli · 2019-11-14T18:21:42Z

torch/csrc/distributed/autograd/engine/dist_engine.h

@@ -67,7 +67,7 @@ class TORCH_API DistEngine {
  // We also determine all leaf nodes(functions) in the graph and accumulate
  // them in outputEdges.
  void computeDependencies(
-      DistAutogradContext& context,
+      ContextPtr context,


This can be a reference of the shared_ptr, as the call site always holds a copy of the shared_ptr?

You can use a const ref.

A const ref to a shared_ptr doesn't mean the wrapped object is const.

mrshenli · 2019-11-14T18:26:49Z

test/dist_autograd_test.py

+        t.start()
+
+        with self.assertRaisesRegex(RuntimeError, "Could not find autograd context with id"):
+            dist_autograd.backward([t1.sum()])


Does this assume that the thread cleared the context before the backward uses it? How do we guarantee that?

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 93942754 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

pritamdamania87 · 2019-11-14T21:47:27Z

@pietern @mrshenli Made a few changes to ensure there is no race in the unit test. This was a little tricky since release_context clears the local context, but sends async rpcs to clear the context on other nodes. In the updated unit test I've ensured we wait for all contexts to be cleaned up on all nodes before proceeding with the backward pass.

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 94036381 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

mrshenli · 2019-11-18T21:17:01Z

test/dist_autograd_test.py

+        def backward(ctx, input):
+            assert(DistAutogradTest._test_clean_context_backward_context_id is not None)
+
+            # Release the context to simulate error (use barrier before releasing context to ensure all nodes execute the backward function).


comment line too long

What is the convention for python line length in OSS? The python linter isn't complaining and I recall internally the line length is 150 chars.

I am not aware if there is any convention here, I personally use 100. Feel free to ignore this comment though if all linters do not complain

mrshenli · 2019-11-18T21:23:38Z

test/dist_autograd_test.py

+        # Send the context id to all nodes.
+        for i in range(0, self.world_size):
+            if i != self.rank:
+                rpc.rpc_sync("worker{}".format(i), _set_rpc_done, args=(context_id, 1))


why the rank distance is always 1 for _set_rpc_done here?

Didn't matter for this test case, will fix it though.

mrshenli · 2019-11-18T21:25:52Z

test/dist_autograd_test.py

+        for i in range(0, 100):
+            dst = self._next_rank()
+            t1 = rpc.rpc_sync("worker{}".format(dst), torch.add, args=(t1, t1))
+            if i == 99:


maybe just do this outside of this loop instead of having a if clause here?

mrshenli · 2019-11-18T21:26:47Z

test/dist_autograd_test.py

+
+            # Release the context to simulate error (use barrier before releasing context to ensure all nodes execute the backward function).
+            dist.barrier()
+            dist_autograd._release_context(DistAutogradTest._test_clean_context_backward_context_id)


Is this blocking? It has to be to guarantee correctness, right?

This is not blocking, it releases the local context and then just sends async RPCs to release other contexts. The method below _all_contexts_cleaned_up is blocking and ensures that contexts are cleaned up on all nodes.

mrshenli

The changes LGTM!

mrshenli · 2019-11-18T21:34:29Z

test/dist_autograd_test.py

+            dist_autograd.backward([t1.sum()])
+
+        # HACK: Killing workers since otherwise the autograd engine gets stuck on
+        # other nodes. The proper fix would be addressing:


Let's be more specific on why it might stuck on others.

Would I be correct if I assume it stuck because the crashing backward destroyed the context on this node, and hence the next_rank won't be able to clear the context when exiting the scope.

mrshenli · 2019-11-18T21:36:39Z

torch/csrc/distributed/autograd/functions/recvrpc_backward.h

@@ -20,7 +21,7 @@ class TORCH_API RecvRpcBackward : public torch::autograd::Node {
 public:
  explicit RecvRpcBackward(
      const AutogradMetadata& autogradMetadata,
-      DistAutogradContext& autogradContext,
+      std::shared_ptr<DistAutogradContext> autogradContext,


Why this is not using ContextPtr?

Would have a circular dependency, thats why we need to have a forward declaration for DistAutogradContext.

mrshenli · 2019-11-18T21:36:55Z

torch/csrc/distributed/autograd/functions/recvrpc_backward.h

@@ -30,7 +31,7 @@ class TORCH_API RecvRpcBackward : public torch::autograd::Node {
  const AutogradMetadata autogradMetadata_;

  // Hold a reference to the autograd context.
-  DistAutogradContext& autogradContext_;
+  std::shared_ptr<DistAutogradContext> autogradContext_;


We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 94159438 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

mrshenli

LGTM!

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 94176890 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/) [ghstack-poisoned]

mrshenli

New commit triggers build failures. Will drop "request changes" when build and test pass.

Pull Request resolved: #29770 We were passing around const and non-const references for DistAutogradContext from DistAutogradContainer. This wasn't safe since the context could be deleted from the container and a thread might still be using the reference. This usually would happen when a backward pass fails on the node driving the backward pass (resulting in delete context messages being sent to all nodes) but other nodes are still executing code related to that autograd context. This was also the reason why `test_backward_autograd_engine_error` was flaky. Using a std::shared_ptr everywhere ensures we're safe and never crash. Closes #28928 Closes #26922 ghstack-source-id: 94201446 Differential Revision: [D18494814](https://our.internmc.facebook.com/intern/diff/D18494814/)

mrshenli

Tests look OK now. Created #30110 to track the failed test, which is irrelevant to this PR.

facebook-github-bot · 2019-11-20T00:35:47Z

This pull request has been merged in 63c957c.

pritamdamania87 requested review from mrshenli and pietern as code owners November 14, 2019 00:15

pritamdamania87 requested a review from rohan-varma November 14, 2019 00:16

rohan-varma reviewed Nov 14, 2019

View reviewed changes

pietern reviewed Nov 14, 2019

View reviewed changes

pritamdamania87 requested a review from pietern November 14, 2019 01:08

pietern reviewed Nov 14, 2019

View reviewed changes

mrshenli reviewed Nov 14, 2019

View reviewed changes

pritamdamania87 requested review from pietern and mrshenli November 14, 2019 21:47

mrshenli reviewed Nov 18, 2019

View reviewed changes

pritamdamania87 requested a review from mrshenli November 19, 2019 00:25

mrshenli approved these changes Nov 19, 2019

View reviewed changes

mrshenli mentioned this pull request Nov 19, 2019

test_graph_for_py_nested_remote_call (__main__.DistAutogradTestWithSpawn) is flaky #29938

Closed

pritamdamania87 requested a review from zhaojuanmao as a code owner November 19, 2019 17:42

mrshenli requested changes Nov 19, 2019

View reviewed changes

mrshenli approved these changes Nov 19, 2019

View reviewed changes

facebook-github-bot closed this in 63c957c Nov 19, 2019

facebook-github-bot added the merged label Nov 20, 2019

facebook-github-bot deleted the gh/pritamdamania87/25/head branch November 23, 2019 15:16

mruberry added the Merged label Oct 28, 2020

		class DistAutogradContext;
		using ContextPtr = std::shared_ptr<DistAutogradContext>;

Use std::shared_ptr for DistAutogradContext. #29770

Use std::shared_ptr for DistAutogradContext. #29770

Uh oh!

Conversation

pritamdamania87 commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pietern Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 commented Nov 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 20, 2019

Uh oh!

Uh oh!

pritamdamania87 commented Nov 14, 2019 •

edited

Loading

pietern Nov 14, 2019 •

edited

Loading