Add autograd hook for python rpc call #27576

zhaojuanmao · 2019-10-08T21:52:27Z

Stack from ghstack:

Add autograd hook for python rpc call #27576 Add autograd hook for python rpc call

currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal.

This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads

meanwhile create a utiliy to attach autograd info and functions as needed
add autograd send/recv functions for python rpc call
make changes to support nested python rpc calls
disallow nested dist autograd context (was landed in Distributed Autograd - FAST mode backward pass implementation. #27022)

Differential Revision: D17819153

1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context ghstack-source-id: 91555159 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

rohan-varma · 2019-10-08T22:27:23Z

torch/csrc/distributed/autograd/utils.cpp

  return nullptr;
 }

+std::shared_ptr<FutureMessage> sendMessageWithAutograd(


it looks like this function doesn't always send a message with autograd (if there is no valid context). If we want to name this function sendMessageWithAutograd , should we refactor to remove this check and do the check before invoking sendMessageWithAutograd?

pritamdamania87

Overall changes look great! I have mostly comments around additional tests and some code structure.

pritamdamania87 · 2019-10-09T20:58:49Z

test/dist_autograd_test.py

+                         _set_rpc_done_from_prev_rank, args=(context_id,))

-            # Get send function.
+            self._verify_send_recv_functions_in_client(context_id, t1, t2, ret)


We should call this verify_current_rank_context

pritamdamania87 · 2019-10-09T20:59:04Z

test/dist_autograd_test.py

+            # Now verify the autograd graph.
+            ctx = dist_autograd._retrieve_context(prev_rank_context_id)
+
+            self._verify_send_recv_functions_in_tensor_run(ctx)


This should be called verify_prev_rank_context

but it is not prev_rank, for nested call, it is for prev_prev_rank, let me think about the naming more

pritamdamania87 · 2019-10-09T23:19:29Z

test/dist_autograd_test.py

-            self.assertEqual(t2, next_funcs[1][0].variable)
-            self.assertEqual(0, next_funcs[1][1])
+    @dist_init
+    def test_autograd_functions_for_python_nested_call(self):


Can we add a test where the nested RPC calls itself? Basically A->B->A?

Also, can we add a test for more than 1 layer of nested calls? (could extend this test itself too). Something like A->B->C->D.

will add them

pritamdamania87 · 2019-10-09T23:19:59Z

test/dist_autograd_test.py

+        with self.assertRaises(RuntimeError):
+            with dist_autograd.context() as context_id_1:
+                with dist_autograd.context() as context_id_2:
+                    a = 1


nit: you can use pass if you don't want to do anything in a block.

pritamdamania87 · 2019-10-09T23:20:22Z

torch/csrc/distributed/autograd/context/dist_autograd_container.cpp

  std::lock_guard<std::mutex> guard(autograd_context_lock_);
+  TORCH_CHECK(
+      !hasValidContext(),
+      "Next context can be created only when there is no valid context.");


nit: New instead of Next

pritamdamania87 · 2019-10-09T23:21:27Z

torch/csrc/distributed/autograd/context/dist_autograd_container.cpp

  return max_id_;
 }

+void DistAutogradContainer::set_current_context_id(int64_t context_id) {


Should we add a TORCH_INTERNAL_ASSERT here such that the current_context_id_ isn't set already?

pritamdamania87 · 2019-10-09T23:25:50Z

torch/csrc/distributed/rpc/request_callback_impl.cpp

+      // After processRpc() is done, clean up current_context_id_ to be -1,
+      // for a recv thread, current_context_id_ should always be invalid after
+      // processRpc() is done.
+      if (autogradContext != nullptr) {
+        autogradContainer.set_current_context_id(-1);
+      }


We should clear the context id in request_callback.cpp in the operator() function (since we're guaranteed every RPC would go through here). We can do something like this:

Message RequestCallback::operator()(Message& request) const { ClearAutogradContextGuard guard; try { return processMessage(request); } catch (std::exception& e) { LOG(ERROR) << "Received error while processing request type " << request.type() << ": " << e.what(); return createException(request, e); } }

Basically in the destructor of ClearAutogradContextGuard we set the current context id to -1. Also, we shouldn't pass -1 here like this. The value -1 is an internal detail of the DistAutogradContext class. We should add a method called clearCurrentContext() which sets the value to -1.

pritamdamania87 · 2019-10-09T23:28:40Z

torch/csrc/distributed/rpc/request_callback_impl.cpp

      // Wrap the response with autograd, need a new autograd message id for
      // each send/recv pair.
-      auto& autogradContainer = DistAutogradContainer::getInstance();
      AutogradMetadata responseAutogradMetadata(


We should use sendMessageWithAutograd here.

mrshenli · 2019-10-10T22:56:05Z

torch/csrc/distributed/rpc/request_callback_impl.cpp

+      // passed in the chain calls.
+      auto& autogradContainer = DistAutogradContainer::getInstance();
+      if (autogradContext != nullptr) {
+        autogradContainer.set_current_context_id(autogradContext->context_id());


What prevents another thread setting the context id to a different value immediately after this line?

current_context_id is a thread local variable.

mrshenli · 2019-10-10T22:59:02Z

torch/csrc/distributed/rpc/request_callback_impl.cpp

+      // rpc call in python rpc call, original context_id from client can be
+      // passed in the chain calls.
+      auto& autogradContainer = DistAutogradContainer::getInstance();
+      if (autogradContext != nullptr) {


is it expected to see MESSAGE_WITH_AUTOGRAD_REQ if there is no valid context?

autogradContext == nullptr here means: no valid context or tensors do not require grads

mrshenli · 2019-10-11T16:30:45Z

test/dist_autograd_test.py

+
+            # Wait for the prev rank to be done with rpc.
+            while not prev_rank_rpc_done:
+                time.sleep(0.1)


Shall we add a timeout for this?

mrshenli · 2019-10-11T16:34:53Z

test/dist_autograd_test.py

+        self.assertEqual(1, len(recv_functions))
+        self.assertEqual(ret.grad_fn, list(recv_functions.values())[0])
+
+    # Host receives tensors and actually runs tensor operations, return tensor


Could you please elaborate a bit more in the comments? What does _in_tensor_run mean? Isn't this function verifying the autograd graph structure?

yes, the naming is bad, I will change it

mrshenli · 2019-10-11T16:38:34Z

test/dist_autograd_test.py

+        next_funcs = list(send_functions.values())[0].next_functions
+        self.assertEqual(1, len(next_funcs))
+        add_backward_fn = next_funcs[0][0]
+        self.assertEqual("AddBackward0", add_backward_fn.name())


This assumes a very specific autograd graph structure, but the name of this function seems to suggest it could work for any ctx. Let's add more descriptions in the function comments to describe what is the expected structure, and what functions are called in the forward pass.

mrshenli · 2019-10-11T16:45:22Z

test/dist_autograd_test.py

+            next_funcs = list(send_functions.values())[1].next_functions
+            self.assertEqual(1, len(next_funcs))
+            self.assertEqual(
+                "torch::distributed::autograd::RecvRpcBackward", next_funcs[0][0].name()


(This does not need to be done in this PR)

It seems autograd function checks are scattered in several places, making it a little difficult to track. Is it possible to implement one general autograd function checking utility method which takes a ctx and a nested list/tuple/dict (or we could just add our own graph structure) expected_graph. Then a test could do the forward pass, then construct expected_graph right next the forward pass (so that it will be much easier to verify they match), then call the generic autograd function checking method to verify correctness.

mrshenli · 2019-10-11T16:50:14Z

test/dist_autograd_test.py

+
+    @dist_init
+    def test_nested_contex(self):
+        with self.assertRaises(RuntimeError):


Will the RuntimeError also come with a string message? Let's also verify it is the exact RuntimeError we are expecting.

Q: is this just a temporary limitation that we do not support nested context or will it always be in that way?

it is always like this, it did not work for nested context right now, as current_context_id will be cleared when inner context exits, I'm just adding checks to make it more explicit

pritamdamania87 · 2019-10-11T19:34:53Z

torch/csrc/distributed/autograd/context/dist_autograd_container.cpp

  return max_id_;
 }

+void DistAutogradContainer::set_current_context_id(int64_t context_id) {


nit: Use camelCase everywhere.

rohan-varma · 2019-10-16T01:43:48Z

test/dist_autograd_test.py

+                pass
+
+            # Now verify the autograd graph.
+            ctx = dist_autograd._retrieve_context(prev_rank_context_id)


This is not going to succeed with the changes in #27951, because it is possible that the prev_rank (i.e. the rank with context_id prev_rank_context_id) would have exited the autograd context manager in their thread, and thus destroyed the autograd context. Instead of using time.sleep() as above, I think we need to do something like rpc.sync_rpc(), though @pritamdamania87 mentioned that this is going away. Is there any other to wait for all outstanding RPCs to complete ?

thanks for heads up, if your diff landed first, I will rebase on it. otherwise I will leave this implementation as it is right now, and your diff can rebase on mine and add the change properly. does it sound good to you?

Sounds good!

rohan-varma · 2019-10-16T01:44:58Z

test/dist_autograd_test.py

                    self.assertIsNone(next_funcs[i][0])
+
+    @dist_init
+    def test_nested_contex(self):


nit: test_nested_context

1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context ghstack-source-id: 92041048 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context ghstack-source-id: 92067068 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

1. add autograd send/recv functions for python rpc call 2. make changes to support nested python rpc calls 4. disallow nested dist autograd context Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) ghstack-source-id: 92090804 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

zhaojuanmao · 2019-10-17T16:15:39Z

did more refactoring, addressed comments and added more unit tests. Summary is also updated.

It is ready for another round of review now.

one pending issue is: although local tests passed, CI tests still have link issue. Looking into it.

1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) ghstack-source-id: 92123039 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

zhaojuanmao · 2019-10-17T19:07:57Z

added TORCH_API for newly added functions to fix link issue on CI tests

pritamdamania87 · 2019-10-17T21:02:35Z

test/dist_autograd_test.py

+# respectively.
+# [RankDistance.PREV_PREV] represents for prev of prev rank.
+# [RankDistance.PREV_PREV_PREV] represents for prev of prev of prev rank.
+class RankDistance(IntEnum):


Why don't we just use integers for the distance instead of an enum like this? That way we can extend this to multiple nested levels later on easily.

I thought enum is more descriptive, but yes, I can change it back to integers for add more levels easily

pritamdamania87 · 2019-10-17T21:03:30Z

test/dist_autograd_test.py

+    global rpc_done
+    global ctx_ids
+    rpc_done[rank_distance] = True
+    ctx_ids[rank_distance] = ctx_id


nit: instead of ttl use hops

pritamdamania87 · 2019-10-17T21:27:52Z

torch/csrc/distributed/autograd/utils.h

 TORCH_API DistAutogradContext* addRecvRpcBackward(
    const AutogradMetadata& autogradMetadata,
    std::vector<torch::Tensor>& tensors,
    rpc::worker_id_t fromWorkerId);


nit: call this sendMessageWithAutograd since we might attach autograd information here.

pritamdamania87 · 2019-10-17T21:28:14Z

torch/csrc/distributed/autograd/utils.h

 // with this context.
 //
 // Returns a pointer to the autograd context created (nullptr in case of no
 // autograd information was needed.)


nit: remove the Check at the end.

pritamdamania87 · 2019-10-17T21:28:56Z

torch/csrc/distributed/rpc/request_callback.cpp


+struct ClearAutogradContextGuard {
+  ClearAutogradContextGuard() {
+    clear();


Do we need to do this in the constructor? Destructor is enough right?

pritamdamania87 · 2019-10-17T21:37:40Z

torch/csrc/distributed/autograd/utils.cpp

+  autogradContext.addRecvFunction(grad_fn, autogradMetadata.autogradMessageId);
+  return &autogradContext;
 }



This method should take the wrappedRpc and not wrappedMsg

pritamdamania87 · 2019-10-17T21:52:36Z

torch/csrc/distributed/autograd/utils.cpp

+  // Record autograd information for 'send'.
+  addSendRpcBackward(
+      autogradContext, autogradMetadata, rpcWithAutograd->tensors(), dstId);
+


Can we pass in wrappedRpc in both sendMessage() and getMessageWithAutogradCheck with above and inside getMessageWithAutogradCheck call toMessage on the rpc? That seems cleaner from an API standpoint.

in request_callback_impl.cpp, getMessageWithAutogradCheck is called, where we can only pass wrappedRpcResponse message to it.

pritamdamania87 · 2019-10-17T21:56:20Z

torch/csrc/distributed/autograd/utils.cpp

+  addSendRpcBackward(
+      autogradContext, autogradMetadata, rpcWithAutograd->tensors(), dstId);
+
+  return std::move(*rpcWithAutograd).toMessage();


Why do we have msgType here? Isn't it always FORWARD_AUTOGRAD_REQ?

pritamdamania87 · 2019-10-17T21:58:54Z

test/dist_autograd_test.py

+            self._verify_graph_for_rpc_call_exec(list(send_functions.values())[0])
+
+    # Rank0->Rank1->Rank0
+    @dist_init


Can we also add a test where none of the tensors require grad and verify that we don't attach send/recv functions anywhere?

yes, and need to verify no prev context is passed over when there is no tensors requiring grads.

pritamdamania87 · 2019-10-17T21:59:19Z

torch/csrc/distributed/autograd/utils.cpp

+    MessageType msgType) {
+  auto& autogradContainer = DistAutogradContainer::getInstance();
+
+  if (!autogradContainer.hasValidContext() ||


nit: Add a small comment here explaining why we do this.

1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

zhaojuanmao · 2019-10-17T23:14:25Z

addressed comments

1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/) [ghstack-poisoned]

Pull Request resolved: #27576 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) ghstack-source-id: 92154535 Differential Revision: [D17819153](https://our.internmc.facebook.com/intern/diff/D17819153/)

mrshenli

Looks good from my side all my comments are nits or followups.

mrshenli · 2019-10-18T02:27:00Z

test/dist_autograd_test.py

-prev_rank_rpc_done = False
-prev_rank_context_id = 0
+# Right now we test up to 3-layer nested rpc calls.
+# rpc_done[1] and ctx_ids[1] represent rpc is done in prev rank, and context id


rpc_done[0] is the current rank?

mrshenli · 2019-10-18T02:30:55Z

test/dist_autograd_test.py

+        self.assertEqual(ret.grad_fn, recv_function)
+
+    # For a context passed from previous nested chain calls, this rank
+    # recevied two tensors t1 and t2, execute torch.add(t1, t2) and send result


receives, executes, sends

mrshenli · 2019-10-18T02:33:39Z

test/dist_autograd_test.py

+        self.assertEqual(next_funcs[0][0], next_funcs[1][0])
+
+    # For a context passed from previous nested chain calls, this rank
+    # recevied two tensors t1 and t2, forwarding t1 and t2 tensors using


recevie -> receive

receives, and forwards

mrshenli · 2019-10-18T02:35:23Z

test/dist_autograd_test.py

+    # nested rpc call to next dst. In return route, receive result tensor t3
+    # from next dst and forwarding t3 back to previous calls.
+    # For this context in this rank, it expects graph like this:
+    #  send and recv functions while recevive and forward t1 and t2:


recevive -> receive

while recevive and forward t1 and t2 -> for receving and forwarding t1 and t2?

mrshenli · 2019-10-18T02:35:37Z

test/dist_autograd_test.py

+    #       rpcSendBackward
+    #          /          \
+    # t1.recvRpcBackward    t2.recvRpcBackward
+    #  send and recv functions while receive and forward t3:


mrshenli · 2019-10-18T02:40:09Z

test/dist_autograd_test.py

+def my_py_nested_call(t1, t2, dst, world_size, hops):
+    next_dst = (dst + 1) % world_size
+    if hops > 0:
+        return rpc.rpc_sync("worker{}".format(next_dst), my_py_nested_call,


(Does not need to be in this PR). Let's also test async rpc calls, e.g., making multiple async calls, collect all futures in a list, and wait on all list in the end.

mrshenli · 2019-10-18T02:50:56Z

torch/csrc/distributed/rpc/request_callback.cpp

 } // anonymous namespace

 Message RequestCallback::operator()(Message& request) const {
+  // For a rev thread, current context id should be invalid outside


rev -> recv

mrshenli · 2019-10-18T02:53:32Z

torch/csrc/distributed/rpc/request_callback.cpp

 Message RequestCallback::operator()(Message& request) const {
+  // For a rev thread, current context id should be invalid outside
+  // processMessage().
+  ClearAutogradContextGuard guard;


(does not need to be done in this PR) It looks a little weird that we have a guard that only clears context but not set it. I wonder if it would be better if we let the guard to govern both set and clear context?

@mrshenli this is possibly hard to set context here, as we can only set it after addRecvBackward() in FORWARD_AUTOGRAD_REQ

mrshenli

Test failure looks irrelevant:

02:16:34 Failed to reproduce exception. Expected: 
02:16:34 Traceback (most recent call last):
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/hypothesis/core.py", line 669, in evaluate_test_data
02:16:34     result = self.execute(data, collect=True)
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/hypothesis/core.py", line 584, in execute
02:16:34     result = self.test_runner(data, run)
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/hypothesis/executors.py", line 58, in default_new_style_executor
02:16:34     return function(data)
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/hypothesis/core.py", line 580, in run
02:16:34     return test(*args, **kwargs)
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/operator_test/unique_ops_test.py", line 39, in test_unique_op
02:16:34     X=hu.tensor1d(
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/hypothesis/core.py", line 524, in test
02:16:34     result = self.test(*args, **kwargs)
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/operator_test/unique_ops_test.py", line 62, in test_unique_op
02:16:34     outputs_to_check=[0, 1] if return_remapping else [0]
02:16:34   File "/var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/hypothesis_test_util.py", line 417, in assertDeviceChecks
02:16:34     dc.CheckSimple(op, inputs, outputs_to_check, input_device_options)
02:16:34   File "/usr/lib64/python2.7/unittest/case.py", line 462, in assertTrue
02:16:34     raise self.failureException(msg)
02:16:34 AssertionError: False is not true

pritamdamania87 · 2019-10-18T05:34:43Z

torch/csrc/distributed/autograd/utils.cpp

+Message getMessageWithAutograd(
    const rpc::worker_id_t dstId,
    torch::distributed::rpc::Message&& wrappedRpcMsg,
    MessageType msgType) {


Why do we pass msgType here? We just need to directly set FORWARD_AUTOGRAD_REQ on line 87

because it could be passed with FORWARD_AUTOGRAD_REQ or FORWARD_AUTOGRAD_RESP

zhaojuanmao · 2019-10-18T16:44:49Z

yeah, the test failures are not relevant, we skipped rocm tests for distributed as well

rohan-varma · 2019-10-18T20:48:46Z

test/dist_autograd_test.py

+            # prev context id is not passed over as tensors do not require grads
+            with self.assertRaises(RuntimeError):
+                ctx = dist_autograd._retrieve_context(ctx_ids[1])
+


I think we need a dist.barrier() here, since this test is modified to test for state on a process that is set by an RPC from another process (previously this was just testing local state). Without dist.barrier(), for example, worker 0 could run the 2nd portion of this test (where the tensors do require gradients), and create the context on worker 1. Then worker 1 can run in the 1st portion of the test, and the assert would fail.

ah, my bad, forgot to call "self._check_rpc_done(1)" before calling _retrieve_context(ctx_ids[1])

facebook-github-bot · 2019-10-18T22:14:39Z

This pull request has been merged in 56c4215.

Summary: Pull Request resolved: pytorch#27576 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in pytorch#27022) ghstack-source-id: 92154535 Test Plan: unit tests Differential Revision: D17819153 fbshipit-source-id: 37d8a85855bf591f2f2da48d475a06e870a30ea1

zhaojuanmao requested review from mrshenli and pietern as code owners October 8, 2019 21:52

pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: pybind Related to our Python bindings / interactions with other Python libraries labels Oct 8, 2019

zhaojuanmao requested review from pritamdamania87, xush6528, aazzolini and rohan-varma October 8, 2019 21:54

rohan-varma reviewed Oct 8, 2019

View reviewed changes

pritamdamania87 suggested changes Oct 9, 2019

View reviewed changes

mrshenli reviewed Oct 10, 2019

View reviewed changes

mrshenli reviewed Oct 11, 2019

View reviewed changes

pritamdamania87 reviewed Oct 11, 2019

View reviewed changes

pritamdamania87 mentioned this pull request Oct 15, 2019

[distributed] cleanup dist autograd context on other nodes when it is released on one node #27951

Closed

rohan-varma reviewed Oct 16, 2019

View reviewed changes

rohan-varma mentioned this pull request Oct 16, 2019

Test DistAutograd context cleanup with multiple nested RPCs #28124

Closed

zhaojuanmao mentioned this pull request Oct 17, 2019

Add autograd hook for python rpc call #28211

Closed

zhaojuanmao requested review from pritamdamania87, rohan-varma and mrshenli October 17, 2019 16:13

pritamdamania87 suggested changes Oct 17, 2019

View reviewed changes

zhaojuanmao requested a review from pritamdamania87 October 17, 2019 23:14

mrshenli reviewed Oct 18, 2019

View reviewed changes

mrshenli approved these changes Oct 18, 2019

View reviewed changes

pritamdamania87 reviewed Oct 18, 2019

View reviewed changes

facebook-github-bot closed this in 56c4215 Oct 18, 2019

rohan-varma reviewed Oct 18, 2019

View reviewed changes

facebook-github-bot added the merged label Oct 18, 2019

facebook-github-bot deleted the gh/zhaojuanmao/10/head branch October 28, 2019 22:23

mruberry added the Merged label Oct 28, 2020

Add autograd hook for python rpc call #27576

Add autograd hook for python rpc call #27576

Uh oh!

Conversation

zhaojuanmao commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao commented Oct 17, 2019

Uh oh!

zhaojuanmao commented Oct 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao commented Oct 8, 2019 •

edited

Loading