From a0e876c923bd0d6ef635bb72e05f05f04c25654e Mon Sep 17 00:00:00 2001 From: Howard Huang Date: Wed, 11 Nov 2020 15:47:16 -0800 Subject: [PATCH 1/2] small doc changes for Remote Reference Protocol --- docs/source/rpc/rref.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/rpc/rref.rst b/docs/source/rpc/rref.rst index 822f10a32f8c..9fcccf8e15d9 100644 --- a/docs/source/rpc/rref.rst +++ b/docs/source/rpc/rref.rst @@ -13,7 +13,7 @@ Background ^^^^^^^^^^ RRef stands for Remote REFerence. It is a reference of an object which is -located on the local or a remote worker, and transparently handles reference +located on the local or remote worker, and transparently handles reference counting under the hood. Conceptually, it can be considered as a distributed shared pointer. Applications can create an RRef by calling :meth:`~torch.distributed.rpc.remote`. Each RRef is owned by the callee worker @@ -42,9 +42,9 @@ Assumptions RRef protocol is designed with the following assumptions. - **Transient Network Failures**: The RRef design handles transient - network failures by retrying messages. Node crashes or permanent network - partition is beyond the scope. When those incidents occur, the application - may take down all workers, revert to the previous checkpoint, and resume + network failures by retrying messages. It cannot handle node crashes or + permanent network partitions. When those incidents occur, the application + should take down all workers, revert to the previous checkpoint, and resume training. - **Non-idempotent UDFs**: We assume the user functions (UDF) provided to :meth:`~torch.distributed.rpc.rpc_sync`, @@ -91,8 +91,8 @@ guarantee: As messages might come delayed or out-of-order, we need one more guarantee to make sure the delete message is not processed too soon. If A sends a message to -B that involves an RRef, we call the RRef on A the parent RRef and the RRef on B -the child RRef. +B that involves an RRef, we call the RRef on A (the parent RRef) and the RRef on B +(the child RRef). **G2. Parent RRef will NOT be deleted until the child RRef is confirmed by the owner.** @@ -125,7 +125,7 @@ possible that the child ``UserRRef`` may be deleted before the owner knows its parent ``UserRRef``. Consider the following example, where the ``OwnerRRef`` forks to A, then A forks -to Y, and Y forks to Z.: +to Y, and Y forks to Z: .. code:: From 25602e0bceb7acbfde06dc042526798173c65645 Mon Sep 17 00:00:00 2001 From: Howard Huang Date: Tue, 17 Nov 2020 13:16:09 -0800 Subject: [PATCH 2/2] change wording for clarity --- docs/source/rpc/distributed_autograd.rst | 2 +- docs/source/rpc/rref.rst | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/rpc/distributed_autograd.rst b/docs/source/rpc/distributed_autograd.rst index 4a4f7855a733..61af22b9486f 100644 --- a/docs/source/rpc/distributed_autograd.rst +++ b/docs/source/rpc/distributed_autograd.rst @@ -270,7 +270,7 @@ As an example the complete code with distributed autograd would be as follows: # Retrieve the gradients from the context. dist_autograd.get_gradients(context_id) -The distributed autograd graph with dependencies would be as follows: +The distributed autograd graph with dependencies would be as follows (t5.sum() excluded for simplicity): .. image:: ../_static/img/distributed_autograd/distributed_dependencies_computed.png diff --git a/docs/source/rpc/rref.rst b/docs/source/rpc/rref.rst index 9fcccf8e15d9..3d5197111038 100644 --- a/docs/source/rpc/rref.rst +++ b/docs/source/rpc/rref.rst @@ -87,7 +87,7 @@ The only requirement is that any ``UserRRef`` must notify the owner upon destruction. Hence, we need the first guarantee: -**G1. The owner will be notified when any ``UserRRef`` is deleted.** +**G1. The owner will be notified when any UserRRef is deleted.** As messages might come delayed or out-of-order, we need one more guarantee to make sure the delete message is not processed too soon. If A sends a message to @@ -132,12 +132,12 @@ to Y, and Y forks to Z: OwnerRRef -> A -> Y -> Z If all of Z's messages, including the delete message, are processed by the -owner before all messages from Y, the owner will learn Z's deletion before -knowing Y. Nevertheless, this does not cause any problem. Because, at least -one of Y's ancestors will be alive (in this case, A) and it will +owner before Y's messages. the owner will learn of Z's deletion befores +knowing Y exists. Nevertheless, this does not cause any problem. Because, at least +one of Y's ancestors will be alive (A) and it will prevent the owner from deleting the ``OwnerRRef``. More specifically, if the -owner does not know Y, A cannot be deleted due to **G2**, and the owner knows A -as the owner is A's parent. +owner does not know Y, A cannot be deleted due to **G2**, and the owner knows A +since it is A's parent. Things get a little trickier if the RRef is created on a user: