Implement timeout support for RRefs #38590

rohan-varma · 2020-05-16T01:33:20Z

Stack from ghstack:

Implement timeout support for RRefs #38590 Implement timeout support for RRefs

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:

Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an std::terminate. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an RPCErrorType enum and defined RRef error handlers to handle the RPCErrorrTypes (currently just timeout and unknown)
A timeout parameter is added to to_here() which gives the user control over the max amount of time it can block for.
Before blocking, to_here checks if the RRef creation on the remote node has timed out, and if so, throws with that information.
ctx.prepareChildForFork() which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the rpc.remote() call had timed out, and if so, raises that error to the user.
Tests are added, primarily via delay injection.

Differential Revision: D21588165

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! ghstack-source-id: 104229474 Pull Request resolved: #38590

dr-ci · 2020-05-16T01:53:04Z

💊 CI failures summary and remediations

As of commit a72eeed (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 51 times.

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104236695 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104259633 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

jjlilley · 2020-05-18T15:34:50Z

torch/csrc/distributed/rpc/message.h

@@ -8,6 +8,9 @@ namespace torch {
 namespace distributed {
 namespace rpc {

+// An enum denoting common RPC errors to allow specific error handling for them.
+enum RPCErrorType { TIMEOUT = 0, UNKNOWN_ERROR = 10 };


Maybe it might make more sense for unknown error to be 0?

jjlilley · 2020-05-18T15:43:57Z

torch/csrc/distributed/rpc/utils.cpp

+      RpcAgent::getCurrentRpcAgent()->getTimeoutErrorDescription();
+  timeoutErrors.emplace_back("failed intentionally");
+
+  auto err = std::string(fm.error()->what());


I like that we're trying to distinguish between failure causes here.
Though scanning strings for a vector of possible error cases feels like it might be a bit cumbersome, and unnecessarily widen the RpcAgent api.

As a potential [half-baked] alternate suggestion, if we need the scan strings from the future error, maybe we can establish a convention where we tag the error string?

e.g. maybe something like "RpcErr:code"?
"RpcErr:1 RPC ran for more than {} milliseconds and timed out"
"Host {} is down (RpcErr:2)"
"Host {} is overloaded (RpcErr:3)" ?
"RRef doesn't exist (RpcErr:4)"
etc.

(The tag needn't be at the beginning, we can just strstr for the first occuring place, and if there's a parsable int, use that. Otherwise, categorize as unknown. And the logic can be put into a virtual RpcAgent.cpp method)

Sure, this sounds reasonable. I think we can take the error type from the enum when we are handling the error (since we know what the error is) and then format that into the error message string. Then in this function, instead of parsing for substrings, we could instead parse out the error code as you suggested.

@jjlilley I updated the diff so that common errors (when we can detect them in the RPC agents) are tagged with the function makeRPCError, and this parsing logic is updated to attempt to get the tagged error from the error message.

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104410529 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104456888 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

jjlilley · 2020-05-20T19:34:32Z

torch/csrc/distributed/rpc/utils.cpp

+}
+
+std::string makeRPCError(std::string rpcErrorStr, RPCErrorType errorType) {
+  return fmt::format(


interesting, is it now legit to use libfmt in pytorch code?

I think since #37241, it is supported in libtorch

I think since #37241, it is supported in libtorch

awesome, good to know

jjlilley · 2020-05-20T19:35:25Z

torch/csrc/distributed/rpc/utils.cpp

+  if (pos != std::string::npos) {
+    // Parse the RPCErrorType.
+    auto errIdx = pos + torch::distributed::rpc::kRPCErrorPrefix.size() + 1;
+    auto errStr = err.substr(errIdx, err.find(":", errIdx) - errIdx);


what if err.find(':", errIdx) is npos?

I assumed that all users of this API would pass in an error created with the other function makeRPCErrorType (so the find() would succeed), although this of course doesn't have to be true. In that case, I think we can return unknown_errror.

torch/csrc/distributed/rpc/rref_impl.h

torch/csrc/distributed/rpc/message.h

torch/csrc/distributed/rpc/process_group_agent.h

jjlilley · 2020-05-20T19:46:18Z

torch/csrc/distributed/rpc/rref_impl.cpp


+  // TODO: we should ideally be able to interrupt this blocking wait if we check
+  // isTimedOut() and it is true.


isn't the wait() going to be interrupted by the futureResponse being satisfied? or is the issue that the rref_context needs to handleError, and we simply haven't had the timedout set yet?

(we could probably interpret any thrown exception here if it made sense?)

Yes, the issue could be that we haven't had the timeout set yet. The specific case that I was thinking of is something like,

Say rpc.remote() call is completely dropped on the receiver, (or takes a long time to process)

Sender, after rpc.remote(), calls rref.to_here(), thus blocking

We are notified that (1) failed (via a timeout), and mark that future with an error and call setTimedOut() on the rref.

However since in (2) we have already started blocking on a (different) future, we should ideally be able to interrupt this with the information we have from (3)

Although, this isn't too big of a deal since the blocking future in this call would eventually time out as well.

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104561702 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

mrshenli · 2020-05-23T15:46:56Z

torch/csrc/distributed/rpc/init.cpp

@@ -203,12 +203,22 @@ PyObject* rpc_init(PyObject* /* unused */) {
              )")
          .def(
              "to_here",
-              &PyRRef::toHere,
+              [](PyRRef& self, float rpcTimeoutSeconds) {


Any reason we need this lambda wrapper instead of directly using &PyRRef::toHere?

Unneeded, removed

mrshenli · 2020-05-23T15:47:12Z

torch/csrc/distributed/rpc/init.cpp

              py::call_guard<py::gil_scoped_release>(),
              R"(
                  Blocking call that copies the value of the RRef from the owner
                  to the local node and returns it. If the current node is the
                  owner, returns a reference to the local value.
+
+                  Arguments:
+                        timeout (Optional, float): Timeout for to_here. If the


mrshenli · 2020-05-23T15:48:36Z

torch/csrc/distributed/rpc/message.h

@@ -20,7 +28,7 @@ enum MessageType {
  // messages for dist.remote on builtin operators and Python UDF
  SCRIPT_REMOTE_CALL = 4, // A remote call on a builtin operator
  PYTHON_REMOTE_CALL = 5, // A remote call on a Python UDF
-  REMOTE_RET = 6, // A remote call on a Python UDF
+  REMOTE_RET = 6, // Response for remote calls for UDF, builtin, or script


Thanks for fixing this!

mrshenli · 2020-05-23T15:57:29Z

torch/csrc/distributed/rpc/utils.cpp

@@ -9,17 +9,54 @@
 #include <torch/csrc/distributed/rpc/python_call.h>
 #include <torch/csrc/distributed/rpc/python_remote_call.h>
 #include <torch/csrc/distributed/rpc/python_resp.h>
+#include <torch/csrc/distributed/rpc/rpc_agent.h>


it looks a little weird to have utils depends on rpc_agent here. If kRPCErrorPrefix is the only reason for this, does it make sense to define kRPCErrorPrefix in this file?

mrshenli · 2020-05-23T16:00:22Z

torch/distributed/rpc/api.py

+                                   ``to`` is not successfully processed on this
+                                   worker within this timeout, then the next time
+                                   there is an attempt to use the RRef (such as
+                                   to_here()), a timeout will be raised indicating


double quote

mrshenli · 2020-05-23T16:00:33Z

torch/distributed/rpc/api.py

+                                   infinite timeout, i.e. a timeout error will
+                                   never be raised. If not provided, the default
+                                   value set during initialization or with
+                                   `_set_rpc_timeout` is used.


mrshenli · 2020-05-23T16:05:12Z

torch/csrc/distributed/rpc/rref_context.cpp

  }
  RRefContext::getInstance().delPendingUser(expectedForkId);
-  // Potentially propagate to the userRRef?
-  RRefContext::handleException(futureMessage);
 }

 c10::intrusive_ptr<RRef> finishCreatingOwnerRRef(


Do we also need to check timeout error here, in case the remote is sent to self?

mrshenli · 2020-05-23T16:08:14Z

torch/testing/_internal/distributed/rpc/rpc_test.py

+        if self.rank != 0:
+            return
+        # test the case where rpc.remote() message creation is completely dropped.
+        dst_rank = (self.rank + 1) % self.world_size


Let's also test for sending to self.

Good catch! The next version of this diff will have thes updates.

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 104961930 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

jjlilley · 2020-06-01T16:09:05Z

torch/csrc/distributed/rpc/rref_context.cpp

+    for (auto& rrefId : unforkedOwners) {
+      LOG(INFO) << "Removing unforked OwnerRRef with RRefId: " << rrefId;
+      auto iter = owners_.find(rrefId);
+      owners_.erase(iter);


What's our protocol for potentially deleting the intrusive_ptr in owners_ with the lock held?
In delForkOfOwner() we maintain the intrusive pointer until after the lock is dropped, perhaps to avoid excess work with the lock held.

(There's a separate comment about whether an O(owners_) operation is needed here, but I guess for delAllUsers, there probably should be.

@jjlilley , it seems like the only place where the deletedRRef is used is to explicitly acquire GIL and reset it if it holds a py object. I'm assuming this is to properly cleanup the python objects, so maybe we need to do something similar here. @mrshenli Do you have any thoughts on this?

The O(owners_) point is true, though this function only runs once during RPC shutdown (and also has a couple blocking waits and RPC calls), although, if we are concerned about the performance, there may be some things we can do to make this portion function non-blocking .

jjlilley · 2020-06-01T16:13:49Z

torch/csrc/distributed/rpc/testing/faulty_process_group_agent.cpp

+  float msgDelay = getDelayForMessage(message.type());
+  if (msgDelay != 0) {
+    // Sleep for the specified delay for the message.
+    std::this_thread::sleep_for(std::chrono::milliseconds(


Just a thought: Maybe to make the sleep action to be slightly more realistic (i.e. only affect the specified message, vs pausing everything behind it as well), it might be worth spawning a detached thread?, e.g.

std::thread thr([msgDelay, message, this]() { std::this_thread::sleep_for(std::chrono::milliseconds(...); ProcessGroupAgent::sendToSelf(std::move(message)); }); thr.detach();

The detached thread is sketchy in prod code, seems reasonable for test cases

Great idea! This will definitely make the delays more realistic.

I will plan to do this in a follow up PR.

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - Before blocking, `to_here` checks if the RRef creation on the remote node has timed out, and if so, throws with that information. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)! [ghstack-poisoned]

Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Differential Revision: [D21588165](https://our.internmc.facebook.com/intern/diff/D21588165/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21588165/)!

facebook-github-bot · 2020-06-04T12:09:56Z

This pull request has been merged in 8b2bb02.

* [ONNX] Bump up ONNX submodule to a82c6a7010e2e332d8f74ad5b0c726fd47c85376 (#39372) Summary: Pull Request resolved: #39372 we only bump the submodule in oss to unblock some works Test Plan: ci Reviewed By: hl475 Differential Revision: D21830800 fbshipit-source-id: fb4a716992efcd71926f7bba24a7c24422c17e38 * Add rpc.functions.async_execution decorator for rpc_sync/rpc_async (#39216) Summary: Pull Request resolved: #39216 The `rpc.functions.async_execution` decorator specifies that the wrapped function is guaranteed to return a `torch.futures.Future`. The decorator adds a `_wrapped_async_rpc_function` attribute to the wrapper function. The caller retrieves this information and then sets `isAsyncFunction` argument accordingly which is later added to PythonCall RPC message as a field. On the callee side, if the PythonCall carries an asynchronous function, it will cast the function's return value to a jit::PythonFutureWrapper object, and then install response creation and communication as a callback on the that jit::PythonFutureWrapper. For applications, this feature is useful when a function needs to wait for IO or additional singaling. In those cases, marking the user function as `rpc.functions.async_execution` will prevent it from blocking one thread on callee for too long. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D21779962 fbshipit-source-id: 6b6aa698bf6f91dad6ed2a7ee433df429b59e941 * Fix index overflow in ConvTranspose3d [attempt 2] (#39198) Summary: Fixes #32866, resubmit of #38970 The memory error in the issue is caused by int overflowing in col2vol. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of ConvTranspose3d. vs 20-30% regression with pure 64-bit indexing. This requires that input.numel() <= UINT_MAX, and channels * kernel.numel() <= UINT_MAX otherwise it raises an error. Previously, the code would crash or give incorrect results unless input.numel() * kernel.numel() <= INT_MAX. Note that the test is a minimised reproducer for the issue. Pull Request resolved: #39198 Differential Revision: D21817836 Pulled By: ezyang fbshipit-source-id: b9adfe9f9dd00f04435be132966b33ac6b9efbef * [TensorPipe] Re-enable RPC tests (#39406) Summary: Pull Request resolved: #39406 For now, just the RPC test (no dist autograd or dist optimizer). I removed the skipping decorator from all the tests except those that explicitly use the ProcessGroup options. Includes #39027. ghstack-source-id: 105159974 Test Plan: Ran the tests several hundred times, in various build modes. Saw some flakes, but at a rate of about 0.1% Differential Revision: D21716069 fbshipit-source-id: 9d2a99e112049a63745772c18e7a58266ed8e74e * Added FPGA DispatchKey, DeviceType, Backend (#38938) Summary: ezyang, I have added the changes to DispatchKey, DeviceType, Backend to support the out-of-tree FPGA. cc. tataetae Pull Request resolved: #38938 Differential Revision: D21748955 Pulled By: ezyang fbshipit-source-id: fe76d9730818205961430d2a0e00727b5c547b32 * Leak safety in RReLU (#39347) Summary: Fixes gh-38966 If `THCTensor_(resizeAs)` fails to allocate, then these `free`s will never be reached. So, instead I use a wrapped tensor to do cleanup automatically. Pull Request resolved: #39347 Differential Revision: D21838933 Pulled By: ezyang fbshipit-source-id: 8c74ecdd720d6712a33ddef6126ea545761a269b * [TensorPipe] Re-enable dist autograd tests (#39440) Summary: Pull Request resolved: #39440 After the RPC tests, re-enable the second test suite: dist autograd. ghstack-source-id: 105165393 Test Plan: Ran the tests, several times each, in different build configs. Differential Revision: D21858974 fbshipit-source-id: 409377d564c36fecae51b9e4c776d94187b434a2 * [TensorPipe] Re-enable dist optimizer tests (#39441) Summary: Pull Request resolved: #39441 This is the last test suite to be enabled for TensorPipe. ghstack-source-id: 105166757 Test Plan: Ran the tests, hundreds of times each, in different build modes. Differential Revision: D21858975 fbshipit-source-id: ee0a7e64b77b4b1974f031207031cc14afb3a8c2 * `as_strided` : add size and stride length check (#39301) Summary: Fixes #39281 Pull Request resolved: #39301 Differential Revision: D21849082 Pulled By: gchanan fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40 * add_observer: respect device affinity for ReLU (#39337) Summary: Pull Request resolved: #39337 In #39031 we made fake quantize respect device affinity of the original module. However, that PR only handled modules with parameters or buffers, and did not work properly for `ReLU`. Fixing the logic to also work for `ReLU` by passing the parent's device when adding observers. Test Plan: ``` python test/test_quantization.py TestDistributed.test_device_affinity ``` Imported from OSS Differential Revision: D21821243 fbshipit-source-id: cc6abda3694b80ce8ba0440dc6c1b5b58f3c0066 * Fix `torch.backends.cudnn` mypy error (#38947) Summary: Fix #38410 ![image](https://user-images.githubusercontent.com/6421097/82724121-74b26880-9c99-11ea-9b63-e92de2dccdf2.png) Pull Request resolved: #38947 Differential Revision: D21765290 Pulled By: ezyang fbshipit-source-id: 5d2b25f039a653c609d60cdaac4a7ac5812ae291 * .github: Add initial target specifier config (#39378) Summary: Pull Request resolved: #39378 Will initially only contain a label to trigger builds for binary tests Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Differential Revision: D21864091 Pulled By: seemethere fbshipit-source-id: f69467ccc797b6b320dc8b7f2d50a8601c172a1f * [torch] remove integer conversion resulted in a change of sign warning (#38968) Summary: Pull Request resolved: #38968 As title Reviewed By: glaringlee Differential Revision: D21711684 fbshipit-source-id: c340360b29849fe9ab0e7be376918c92ba3629be * Replace torch.allClose with self.assertEqual (#39424) Summary: Pull Request resolved: #39424 Reviewed By: Krovatkin Differential Revision: D21854870 Pulled By: ailzhang fbshipit-source-id: eb68f1775596e4c963169033444d6d6f4f818d4f * Do not call optimizations within freezing API (#38499) Summary: This patch removes call to run optimizations within freezing API. Only dead code elimination is invoked to clean up the frozen module. Pull Request resolved: #38499 Reviewed By: eellison Differential Revision: D21579607 Pulled By: bzinodev fbshipit-source-id: a6231754fea89296a3dcf07b5e37a1c43cb8d5dd * Update key_padding_mask arg docs in MHA module (#39321) Summary: Pull Request resolved: #39321 Reviewed By: zhangguanheng66 Differential Revision: D21825488 Pulled By: Nayef211 fbshipit-source-id: 41ee09e683c4ae838cfd488a342088d591e806e4 * [TensorExpr] Fix two bugs in Rfactor (#39268) Summary: The two bugs were: * Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this. * The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases. Also cleaned up tests, removed some helper code we don't need etc. Pull Request resolved: #39268 Differential Revision: D21864489 Pulled By: nickgg fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c * Revert D21579607: [pytorch][PR] Do not call optimizations within freezing API Test Plan: revert-hammer Differential Revision: D21579607 Original commit changeset: a6231754fea8 fbshipit-source-id: 277011605eedee1c3b44fbaf877233b239adf56b * [TensorExpr] some cleanups / fixes for LoopOptions (#39408) Summary: Mainly, fix a bug in the HashProvider where it would not include LoopOptions in the hash, meaning two loops would be seen as identical even if they were bound to different thread/block axes. Also added symbolic names for the different axis options. Pull Request resolved: #39408 Differential Revision: D21864494 Pulled By: nickgg fbshipit-source-id: 9c28729984e7a3375e026c78294c9f75b9015123 * Adds TestCase.compare_with_numpy (#39179) Summary: Cut from #38994. This is a helper function for comparing torch and NumPy behavior. It updates the existing and increasingly popular _np_compare function and moves it to be a method on TestCase. Pull Request resolved: #39179 Differential Revision: D21855082 Pulled By: mruberry fbshipit-source-id: edca3b78ae392d32243b02bf61960898b6ba590f * Enable some test cases in `test_memory_format_operators` (#38648) Summary: Re-enable some test cases in `test_memory_format_operators` since their corresponding issue has been fixed. Pull Request resolved: #38648 Differential Revision: D21689085 Pulled By: VitalyFedyunin fbshipit-source-id: 0aa09e0bf31ba98c8ad0191ac3afd31dda0f1d42 * Retry/skip test on URLError rather than on HTTPError (#39477) Summary: `HTTPError` are raised when server is overloaded, while `URLError` is raised when network is not available And since `HTTPError` is an extension of `URLError`, `URLError` should catch both exceptions Pull Request resolved: #39477 Differential Revision: D21873560 Pulled By: malfet fbshipit-source-id: 11806671b768705465f562087521ad4887fd20f7 * LayerNorm Fake FP16 Op debug (#39476) Summary: LayerNorm Fake FP16 Op debug. still seeing output mismatches. Pull Request resolved: #39476 Differential Revision: D21871748 Pulled By: hyuen fbshipit-source-id: ab308e3acff9ce21de41b0f006cbee767983f8e4 * Selective build on Training, query based. (#39452) Summary: Pull Request resolved: #39452 Selective build works on training. * VariableType_?.cpp are now selectively generated based on the operator list. * Add a flag in pt_operator_library, "train". If it's True, an extra flag of "pt_train_operator_library" will be added to the labels. A query for "pt_train_operator_library" will be done to aggregate the training operators. With this flag we limit the generated VariableType to used training operators only, to conserve the code size. The models for inference only have train = False by default. * For testing purpose, caffe2/fb/pytorch_trainer is created. It's based on full jit but the operators are selectively built. * smartkeyboard_debug_model is used for test. Since the static code analysis is not applied for VariableType yet, the operators are manually added based on debugging error messages. * At build stage, make selective build optional for training code-gen library. The reason is that to make fb4a built, the generated VariableType.cpp needs to depend on torch_mobile_train. Torch_mobile_train is not needed for apps with inference only. In those cases training can be turned off to remove the dependency on torch_mobile_train to save size. It can also be used as a switch to check size regression introduced by training. ghstack-source-id: 105190037 (Note: this ignores all push blocking failures!) Test Plan: Training: ``` buck run -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 xplat/caffe2/fb/pytorch_trainer:trainer ~/models/papaya/keyboard/smartkeyboard_debug_model.pt ``` Inference, with and without the new query-based feature: ``` buck run -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 xplat/caffe2/fb/lite_predictor:lite_predictor_bi -- --model=/home/myuan/models/pytext/BI/bi_pytext_0512.bc --input_dims "1,4" --input_type int64 --pytext_len=4 ``` ``` buck run xplat/caffe2/fb/lite_predictor:lite_predictor_bi -- --model=/home/myuan/models/pytext/BI/bi_pytext_0512.bc --input_dims "1,4" --input_type int64 --pytext_len=4 ``` Reviewed By: ljk53 Differential Revision: D21459302 fbshipit-source-id: df71a46d74f8c7448cbf51990804104f1384594f * Remove copy_imag and copy_real methods (#39065) Summary: Pull Request resolved: #39065 Test Plan: Imported from OSS Differential Revision: D21803939 Pulled By: anjali411 fbshipit-source-id: c7313c527eb6b54d49ef46aa0a839a3418fa8d7e * Enabling lite interpreter in torch python API (#39181) Summary: Pull Request resolved: #39181 Create a python binding classes torch._C. LiteScriptModule for mobile::module, a python class called LiteScriptModule is created which wrap torch._C. LiteScriptModule. Python class LiteScriptModule contains preliminary functions including forward, run_method and __call__. Create a python api "load_for_lite_interpreter" under torch.jit.mobile where takes pre-saved mobile module in a file-like object as input and returns python class LiteScriptModule. Add a python binding method "_save_to_buffer_for_mobile" under ScriptModule, and python method "_save_to_buffer_for_lite_interpreter" under RecursiveScriptModule which saves mobile module into buffer instead of file. ghstack-source-id: 105215736 Test Plan: buck test caffe2/test:mobile Differential Revision: D21757474 fbshipit-source-id: 758b87497d65c4686459a567d41887c7a577aa4c * [caffe2] compute r_correction only for radam to avoid sqrt(negative) (#39393) Summary: Pull Request resolved: #39393 Computing r_correction should be done only for radam . Otherwise can generate floating-point exceptions. Test Plan: buck test caffe2/caffe2/python/operator_test:adam_test -- test_sparse_adam with --caffe2_operator_throw_if_fp_exceptions=1 gflags option Differential Revision: D21834296 fbshipit-source-id: a9e6a93451423e76a99f6591d21cb65d4374b008 * Device and torch._C function cleanup (#38173) Summary: Pull Request resolved: #38173 - Introduce torch.types.Device representing all "device-like" types - Stubbed torch.device.__reduce__ - Stubbed all torch._C functions comprehensively - Deleted _safe_call which is unused throughout the codebase Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21497399 Pulled By: ezyang fbshipit-source-id: 1f534442b0ec9a70d556545d072f2c06a08b9d15 * move to.prim_dtype to lite interpreter (#39456) Summary: Pull Request resolved: #39456 Move aten::to.prim_dtype from full jit to lite interpreter Test Plan: verify TTS model can be used Reviewed By: iseeyuan Differential Revision: D21856104 fbshipit-source-id: 774981a5c04798e3a87cf7d6e6682f35e604944e * [ONNX]Enable tests in test_operators.py (#39431) Summary: Enable Dropout and SoftmaxCrossEntropy tests in test_operators.py Pull Request resolved: #39431 Reviewed By: hl475 Differential Revision: D21877501 Pulled By: houseroad fbshipit-source-id: 1e9b1e5cf80dc1843bdbde2662f3339e357c6654 * fix internal targets for layernorm (#39501) Summary: Pull Request resolved: #39501 fix internal targets, and disable the test until it is fixed Test Plan: built and ran the test, but venkat has to get access to nnpi before fine tuning the last few pieces. Currently getting around 1e-5 relative error Reviewed By: yinghai Differential Revision: D21875657 fbshipit-source-id: 3ae762093084fa65b9aeedaef1b2ca1b1e13b587 * [ONNX] Update pytoch/onnx doc (#39480) Summary: Updated dos for operator_export_types and recently added op symbolics. Pull Request resolved: #39480 Reviewed By: hl475 Differential Revision: D21877364 Pulled By: houseroad fbshipit-source-id: 9831fe5776629da897db6d7943f830528cb916d2 * Implement rad2deg, deg2rad (#38852) Summary: Resolves #38372. cc mruberry Pull Request resolved: #38852 Differential Revision: D21868935 Pulled By: mruberry fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6 * Add rpc.async_function decorator for TorchScript functions (#39267) Summary: Pull Request resolved: #39267 When combined with `torch.jit.script`, the order of decorators matter. `rpc.functions.async_execution` must be the outmost one. The `async_execution` decorator will store the TorchScript function in attribute `_wrapped_async_rpc_function` on the wrapper function, and pass this wrapped TorchScript function (i.e., an instance of `torch.jit.ScriptFunction`) to RPC. The caller will mark the ScriptCall with `isAsyncExecution=true`, and the callee will extract the returned `Future` in C++ and install subsequent processing as a callback to that `Future`. Test Plan: Imported from OSS Differential Revision: D21792688 fbshipit-source-id: de095eb148d21e9114a478e9e6047c707d34fd07 * Revert D21652452: [pytorch][PR] Fix for num_threads==1 in OpenMP "parallel for" Test Plan: revert-hammer Differential Revision: D21652452 Original commit changeset: 2cda7777c0ea fbshipit-source-id: fdd9a0346ce32a962766f57e13357dd2bc60d8b8 * Optimize GroupNorm on CPU (#28203) Summary: Pull Request resolved: #28203 Optimize GroupNorm on CPU ghstack-source-id: 105149765 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm" Reviewed By: houseroad Differential Revision: D17901506 fbshipit-source-id: 5eb22ad0e8a9ab2533282b967b2818f690e48865 * [pytorch] move tracing logic to a separate dispatch backend (#38467) Summary: This PR moves tracing logic out of the generated VariableType kernels, to associate it with a new dedicated dispatch key Tracer. It also toggles the dispatch key set at various places to keep the semantics unchanged - see the inline [Tracing Mode Switches] note. Sample generated code: ``` Tensor & __ilshift___Tensor(Tensor & self, const Tensor & other) { #if !defined(PYTORCH_DISABLE_TRACING) torch::jit::Node* node = nullptr; std::shared_ptr<jit::tracer::TracingState> tracer_state; if (jit::tracer::isTracing()) { tracer_state = jit::tracer::getTracingState(); at::Symbol op_name; op_name = jit::Symbol::fromQualString("aten::__ilshift__"); node = tracer_state->graph->create(op_name, /*num_outputs=*/0); jit::tracer::recordSourceLocation(node); jit::tracer::addInputs(node, "self", self); jit::tracer::addInputs(node, "other", other); tracer_state->graph->insertNode(node); jit::tracer::setTracingState(nullptr); } #endif static auto op = c10::Dispatcher::singleton().findSchemaOrThrow("aten::__ilshift__", "Tensor"); c10::Dispatcher::singleton().redispatch<Tensor &, Tensor &, const Tensor &>(op, c10::DispatchKey::Tracer, self, other); #if !defined(PYTORCH_DISABLE_TRACING) if (tracer_state) { jit::tracer::setTracingState(std::move(tracer_state)); jit::tracer::addOutput(node, self); } #endif return self; } ``` Pull Request resolved: #38467 ghstack-source-id: 105215150 Test Plan: CI Differential Revision: D21570684 fbshipit-source-id: 1a96761830307f9a934f38bfb9fe8b5b1763e0e0 * Implement timeout support for RRefs (#38590) Summary: Pull Request resolved: #38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3 * Fix lint (#39517) Summary: Fixes lint. Pull Request resolved: #39517 Reviewed By: lw Differential Revision: D21881495 Pulled By: mruberry fbshipit-source-id: 43b06466d9311d16b0d78d58ed124c1f01807443 * Clean up thrust::complex from rsqrt (#39294) Summary: Pull Request resolved: #39294 Differential Revision: D21818288 Pulled By: anjali411 fbshipit-source-id: ee7758872700a93713ab66565e2a7a9e8a088a94 * Add SymbolicShape and replace all uses of VaryingShape<ShapeSymbol> with it (#38544) Summary: Adding a SymbolicShape class to represent a generic tensor shape with ShapeSymbols. Its core data structure is c10::optional<std::vector<ShapeSymbol>>. If has_value() == false, it represents an unranked tensor shape. At any dimension ShapeSymbol can contain dynamic size, checkable with ShapeSymbol::IsStatic method. SymbolicShape now replaces all uses of VaryingShape<ShapeSymbol>, ie c10::optional<std::vector<c10::optional<ShapeSymbol>>>. The inner c10::optional wrapper around ShapeSymbol used to indicate dynamic shape, which overlaps with part of ShapeSymbol's representation. Pull Request resolved: #38544 Reviewed By: ZolotukhinM Differential Revision: D21693984 Pulled By: gmagogsfm fbshipit-source-id: 6e633e4f36cf570d6fb34ac15d00ec1fb2054a09 * fix build table for ppc64le (#39475) Summary: This corrects the build info for ppc64le in the main README. I am opening this PR before renaming the build job. (So, the "live" master README has the correct "live" link and the PR does not.) Immediately after submitting the PR, I will correct the name of the build job. This will make the new PR link correct, and the current "master" link will briefly appear broken until this PR gets merged. Pull Request resolved: #39475 Differential Revision: D21883184 Pulled By: malfet fbshipit-source-id: 148353b632448c98e5aff560d31642328afe7963 * Remove cuda init patch (#39222) Summary: The below lines have been removed from `torch/cuda/__init__.py` anyway: ``` _cudart = _load_cudart() _cudart.cudaGetErrorName.restype = ctypes.c_char_p _cudart.cudaGetErrorString.restype = ctypes.c_char_p ``` Pull Request resolved: #39222 Differential Revision: D21864397 Pulled By: yns88 fbshipit-source-id: 941b13f92192f930e1dfa4b385e1aec2e321e75f * Fix lint (#39527) Summary: Pull Request resolved: #39527 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884798 Pulled By: ezyang fbshipit-source-id: a130bfd4cc122ea1d45e7db7303bf44e04f08703 * Restore docs coverage test via sphinx (#39331) Summary: Pull Request resolved: #39331 Fixes gh-37590 Adds an extra `make coverage` to document building, which uses the built-in facility in sphinx to check docstring coverage. Also fixes a failure to import `torch/jit/supported_ops.py` which broke the [Torchscript Builtins](https://pytorch.org/docs/stable/jit_builtin_functions.html) page. This also adds the required `SPHINXOPTS` to turn warnings into error, but this is commented out. Note that since documentation of `torchvision` is merged in here, failures there would cause failures here if this is made active. Some thought might be needed about pinning the torchvision version merged into documentation. The first commit should fail, since the "ScriptModule" class is commented out. I did that in order to check that a CI failure is properly reported. Pull Request resolved: #38244 Differential Revision: D21640589 Pulled By: ezyang fbshipit-source-id: 1e240d81669b5f21404d596de4a27d192dc9fd8a * Add arcosh, arcsinh and arctanh to unary ops (#38388) Summary: This PR aims to add `arcosh`, `arcsinh` and `arctanh` support. Please see issue #38349 for more details. **TODOs:** * [x] Add test cases for `arcosh`, `arcsinh` and `arctanh`. (need help) * [x] Overload ops if `std::op` does not work with `thrust::complex` types (like for `sinh`, `cosh`). Note: `std::acosh, std::asinh, std::atanh` do not support `thrust::complex` types. Added support for complex types for these 3 ops (`arccosh, arcsinh, arctanh`) cc: mruberry Pull Request resolved: #38388 Differential Revision: D21882055 Pulled By: mruberry fbshipit-source-id: d334590b47c5a89e491a002c3e41e6ffa89000e3 * Move some of the definitions in LegacyNNDefinitions.cpp closer to sites (#37531) Summary: Pull Request resolved: #37531 All of these definitions are no longer "legacy" as their CPU implementations have been ported to ATen. There are probably some layers of indirection that could be reduced here, but for now just do a minor but unlikely to break things cleanup. The last thing in LegacyNNDefinitions truly is still in THCUNN and can't be removed. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21310913 Pulled By: ezyang fbshipit-source-id: 1ff4ff16abddf13f8d583df990242ac4b0461915 * Selective enabling of xnnpack based max_pool2d in ceil_mode. (#39447) Summary: max_pool2d with ceil_mode calculates output size a little differently than what we get with xnnpack max_pool2d. Thus when ceil_mode=True, we disable this path. However if we get the same output size with ceil_mode and without ceil_mode, we should use xnnpack based max_pool2d. Pull Request resolved: #39447 Test Plan: CI Differential Revision: D21873334 Pulled By: kimishpatel fbshipit-source-id: b84abed1505e36e492cc87e7d40664ac63964909 * .circleci: Move binary builds into their own workflow (#39379) Summary: Pull Request resolved: #39379 Moves binary builds into their own workflow and adds the ability to target specification on them. This allows you to run the binary build workflow on a pull request without the need to modify any configuration at all. Some notes about this implementation: * Upload jobs are still restricted to only the nightly branches and RC tags * Parameters for circleci are currently defined in .circleci/verbatim-sources/header-section.yml * Target specification configuration is currently located at .github/pytorch-circleci-labels.yml Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Differential Revision: D21886341 Pulled By: seemethere fbshipit-source-id: 146ef5df2fea208d33e97862d52c170bf001bc98 * [JIT] fix broadcasting lists of ints (#39481) Summary: Previously, on conversion from python -> c++ it was casted to double list through bad copy pasta. It's pretty unusual for someone to script a broadcasting list function directly since it's an internal api, so it was unlikely to affect anyone. Fix for #39450 Pull Request resolved: #39481 Reviewed By: jamesr66a Differential Revision: D21870557 Pulled By: eellison fbshipit-source-id: e704e5e87d2702a270b7d65c4df444246a134480 * Fix overflow issues when unpacking large numbers (#39140) Summary: Resolve #33111 relax the overflow and precision lost checks when unpacking doubles. Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Pull Request resolved: #39140 Differential Revision: D21885217 Pulled By: ezyang fbshipit-source-id: e2bbe90d719443ea2e1c6b7b2c637f9a943fa5c0 * Upgrade lint. (#39483) Summary: Pull Request resolved: #39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685 * [ONNX] Fix type casting for reduce ops (#38829) Summary: Fix type casting for reduce ops in ONNX exporter. PyTorch promotes dtypes bool and all integer types to long for these ops. This fix only covers traced modules where dtype is present Pull Request resolved: #38829 Reviewed By: hl475 Differential Revision: D21833533 Pulled By: houseroad fbshipit-source-id: 00d9ff692cc0b09d6ca169f6c63913f04b56f182 * deepcopy() of Objects should call __g/setstate__ (#39500) Summary: Pull Request resolved: #39500 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D21875091 Pulled By: jamesr66a fbshipit-source-id: 105875dd220a91bc4fcb8fcfb77fab8b626eb6cb * Add/fix typing annotations to some functions (#39075) Summary: Add missing typing imports to some jit tests Add typing annotations to `torch.testing._compare_scalars_internal` and `torch.testing._internal.assertTrue` Pull Request resolved: #39075 Differential Revision: D21882468 Pulled By: malfet fbshipit-source-id: dd9858eb8e11a38411544cc64daf36fced807d76 * Reduce time spent per guard by comparing TensorType with Tensor (#39098) Summary: It mainly reduces the time spent on allocating new TensorType object for Tensor, but comparing them directly. benchmark result before and after this PR: https://gist.github.com/ailzhang/db44d0a1911cae62e0bb794bff33f40a Pull Request resolved: #39098 Differential Revision: D21786678 Pulled By: ailzhang fbshipit-source-id: 2f61f0ac1dc8c529c45bef4e149be431ff1608b0 * Do not raise decorator (#39532) Summary: s/raise unittest.skip/raise unittest.SkipTest/ As `unittest.skip` is a decorator while `unittest.SkipTest` is an exception Pull Request resolved: #39532 Differential Revision: D21889152 Pulled By: malfet fbshipit-source-id: 27a03dbf065a1e2712a63c6c27e156bd13edbbdf * misc updates to fake fp16 tests (#39405) Summary: Misc updates to the fake FP16 tests. 1. seeding numpy with a random seed 2. test base class changed from unittest.TestCase=>serial.SerializedTestCase 3. Removed the hypothesis_test_util import Reviewer: Hector Yuen Pull Request resolved: #39405 Test Plan: Fake FP16 test Differential Revision: D21890212 Pulled By: hyuen fbshipit-source-id: 25e7e17f118655f32cdd06ea9db3cdac5277e649 * Remove useless copy on zip file load (#36362) Summary: Instead of copying to a buffer, then setting a tensor's storage with that buffer, create a storage directly from the file Pull Request resolved: #36362 Pulled By: driazati Differential Revision: D21889537 fbshipit-source-id: edbd430073c2bbf52332fe7b3b2590e7d936dedf * [quant][graphmode] Test for another type of ops in insert_observer for if (#39380) Summary: Pull Request resolved: #39380 Test for inserting observers for if statement for ops that propagate quantization parameters Test Plan: Imported from OSS Differential Revision: D21832477 fbshipit-source-id: 6e0b4ce4a89f847af161bb22338525802adb8b41 * try always inline * fix Co-authored-by: Lu Fang <lufang@fb.com> Co-authored-by: Shen Li <shenli@fb.com> Co-authored-by: Peter Bell <peterbell10@live.co.uk> Co-authored-by: Luca Wehrstedt <lcw@fb.com> Co-authored-by: Dylan Bespalko <dylanbespalko@me.com> Co-authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Co-authored-by: Vasiliy Kuznetsov <vasiliy@fb.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Eli Uriegas <eliuriegas@fb.com> Co-authored-by: Jongsoo Park <jongsoo@fb.com> Co-authored-by: JackCaoG <jackcao@google.com> Co-authored-by: Zino Benaissa <zinob@fb.com> Co-authored-by: Christian Puhrsch <cpuhrsch@fb.com> Co-authored-by: Nick Gibson <nickg@fb.com> Co-authored-by: Sebastian Messmer <messmer@fb.com> Co-authored-by: Mike Ruberry <mruberry@devfair044.maas> Co-authored-by: Nikita Shulga <nikita.shulga@oculus.com> Co-authored-by: Venkata Chintapalli <venkatc77@gmail.com> Co-authored-by: Martin Yuan <myuan@fb.com> Co-authored-by: anjali411 <chourdiaanjali123@gmail.com> Co-authored-by: Xingying Cheng <xcheng16@fb.com> Co-authored-by: Edward Yang <ezyang@fb.com> Co-authored-by: Linbin Yu <linbin@fb.com> Co-authored-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com> Co-authored-by: Hector Yuen <hyz@fb.com> Co-authored-by: neginraoof <neginmr@utexas.edu> Co-authored-by: Aayush Naik <aayushnaik17@gmail.com> Co-authored-by: Oguz Ulgen <oulgen@fb.com> Co-authored-by: Xiaomeng Yang <yangxm@fb.com> Co-authored-by: Jiakai Liu <liujiakai@fb.com> Co-authored-by: Rohan Varma <rvarm1@fb.com> Co-authored-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: David Clissold <cliss@us.ibm.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: mattip <matti.picus@gmail.com> Co-authored-by: krshrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Kimish Patel <kimishpatel@fb.com> Co-authored-by: Elias Ellison <eellison@fb.com> Co-authored-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Co-authored-by: James Reed <jamesreed@fb.com> Co-authored-by: Nikita Shulga <nshulga@fb.com> Co-authored-by: Ailing Zhang <ailzhang@fb.com> Co-authored-by: davidriazati <davidriazati@fb.com> Co-authored-by: Jerry Zhang <jerryzh@fb.com>

This was added as a part of #38590 but we can use default arguments here. We use fmt:;format to bind the default value to the rpc timeout at runtime. Differential Revision: [D21912719](https://our.internmc.facebook.com/intern/diff/D21912719/) [ghstack-poisoned]

This was added as a part of #38590 but we can use default arguments here. We use fmt:;format to bind the default value to the rpc timeout at runtime. Differential Revision: [D21912719](https://our.internmc.facebook.com/intern/diff/D21912719/) ghstack-source-id: 105370320 Pull Request resolved: #39602