From 1f661b17e4e8f35d73f51bbb7cc1fd61069b6fa6 Mon Sep 17 00:00:00 2001 From: Haoyu Zhang Date: Thu, 18 Feb 2021 17:04:45 -0800 Subject: [PATCH 1/5] Add draft RFC for setting GRPC_FAIL_FAST to use_caller by default --- rfcs/20210218-grpc-fail-fast-use-caller.md | 90 ++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 rfcs/20210218-grpc-fail-fast-use-caller.md diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md new file mode 100644 index 000000000..1c5467824 --- /dev/null +++ b/rfcs/20210218-grpc-fail-fast-use-caller.md @@ -0,0 +1,90 @@ +# GRPC Fail Fast By Default + +| Status | Proposed | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) | +| **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) | +| **Sponsor** | Bramandia Ramadhana (bramandia@google.com) | +| **Updated** | 2021-02-18 | + +## Objective + +We propose to set the default value of the `GRPC_FAIL_FAST` environment variable +to `use_caller`. This change prevents TensorFlow distributed jobs from hanging +indefinitely due to task failures, and allows users and TF libraries (e.g., +distribution strategies) to handle the connection errors for better failure and +preemption recovery. + +## Background + +`GRPC_FAIL_FAST` is a TensorFlow distributed runtime environment variable that +controls the behavior of RPC requests when observing a network disconnection +with remote servers. It can be configured to the following values: + +* `true`, which immediately reports an `UnavailableError` when there is a + connection issue for all RPCs, regardless of the per-RPC configurations; +* `false`, which will (in most cases) hang until successfully connected to the + remote server for all RPCs (see + [gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)), + regardless of the per-RPC configurations; +* `use_caller`, which is `true` for RPCs used in distributed execution (such + as `RecvTensor`, `RunComponentFunction`), and `false` for RPCs in + initializing remote execution environments (e.g., `GetStatus`). + +The default value of `GRPC_FAIL_FAST` is set to `false` currently. As a result, +users need to +[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106) +to receive reasonable exceptions when workers fail / get preempted; otherwise +the cluster will hang and cannot recover from failures. + +## Proposed Change + +We propose to set the default value of `GRPC_FAIL_FAST` to `use_caller`. By +doing so, the runtime reports errors quickly to detect remote server failures +during execution, while still allowing the client to start early and wait for +remote servers to establish initial connections. This should be the desired +behavior for most use cases. + +In the context of TensorFlow 2, the default behavior of the following RPCs used +for distributed execution will be changed from hanging on failures (current +behavior) to immediately reporting failures (after the change): + +* `EagerService.CreateContext` +* `EagerService.UpdateContext` +* `EagerService.WaitQueueDone` +* `EagerService.KeepAlive` +* `EagerService.Enqueue` +* `EagerService.RunComponentFunction` +* `WorkerService.RecvTensor` +* `WorkerService.RecvBuf` + +The default behavior of the following RPC will not change: it will still hang if +the remote task cannot be reached. + +* `WorkerService.GetStatus` + +The `GetStatus` RPC is typically the first RPC sent from the client to +initialize a distributed execution environment, in both the single- and the +multi-client mode. The underlying implementation uses GRPC's +[`wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md) +flag, which allows the client to start before the remote server in the +deployment. + +## User Impact + +Most users should see the new default as expected behaviors in distributed +execution. Users can take advantage of the built-in fault tolerance support in +`ParameterServerStrategy` without having to make changes to the environment +variable configurations. In other setups, exceptions will be raised to the model +training loop code, where users can catch and handle these errors with custom +logic. + +Certain users might receive "false alarms" if there are transient connection +errors to the remote servers. We expect this to happen very rarely since GRPC +(built on top of HTTP and TCP protocols) should already handle packet drops and +network flakinesses in most cases, and only report errors when there are real +network or server failures. However, if this does happen to some users, please +set `GRPC_FAIL_FAST=false` to override the default value and revert to the +previous behavior. Please also file an issue to inform the TensorFlow Runtime +team. + From 14c00e2ebf1c9d6067b862ba9b5551a19b725038 Mon Sep 17 00:00:00 2001 From: Haoyu Zhang Date: Thu, 18 Feb 2021 17:08:10 -0800 Subject: [PATCH 2/5] Update RFC number --- rfcs/20210218-grpc-fail-fast-use-caller.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md index 1c5467824..206759e4a 100644 --- a/rfcs/20210218-grpc-fail-fast-use-caller.md +++ b/rfcs/20210218-grpc-fail-fast-use-caller.md @@ -2,7 +2,7 @@ | Status | Proposed | | :------------ | :------------------------------------------------------ | -| **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) | +| **RFC #** | [355](https://github.com/tensorflow/community/pull/355) | | **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) | | **Sponsor** | Bramandia Ramadhana (bramandia@google.com) | | **Updated** | 2021-02-18 | From 9aa833016b012e8d2e14241732ff9d0df773e6f9 Mon Sep 17 00:00:00 2001 From: Haoyu Zhang Date: Thu, 4 Mar 2021 11:24:48 -0800 Subject: [PATCH 3/5] Added more clarification on proposed changes and user impact --- rfcs/20210218-grpc-fail-fast-use-caller.md | 23 ++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md index 206759e4a..9b087b857 100644 --- a/rfcs/20210218-grpc-fail-fast-use-caller.md +++ b/rfcs/20210218-grpc-fail-fast-use-caller.md @@ -5,7 +5,7 @@ | **RFC #** | [355](https://github.com/tensorflow/community/pull/355) | | **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) | | **Sponsor** | Bramandia Ramadhana (bramandia@google.com) | -| **Updated** | 2021-02-18 | +| **Updated** | 2021-03-04 | ## Objective @@ -23,16 +23,18 @@ with remote servers. It can be configured to the following values: * `true`, which immediately reports an `UnavailableError` when there is a connection issue for all RPCs, regardless of the per-RPC configurations; -* `false`, which will (in most cases) hang until successfully connected to the - remote server for all RPCs (see +* `false`, which blocks and waits until successfully connected to the remote + server (see [gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)), regardless of the per-RPC configurations; -* `use_caller`, which is `true` for RPCs used in distributed execution (such - as `RecvTensor`, `RunComponentFunction`), and `false` for RPCs in +* `use_caller`, which allows customization per RPC basis; in the current + implementation, `true` is used for RPCs used in distributed execution (such + as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in initializing remote execution environments (e.g., `GetStatus`). -The default value of `GRPC_FAIL_FAST` is set to `false` currently. As a result, -users need to +The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the +consequences is that users and/or high-level distribute libraries (such as +`ParameterServerStrategy`) need to [manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106) to receive reasonable exceptions when workers fail / get preempted; otherwise the cluster will hang and cannot recover from failures. @@ -72,12 +74,17 @@ deployment. ## User Impact +When this change is made to the codebase, subsequent TensorFlow 2 releases will +have this new default behavior. TensorFlow 1.x users who use the stable releases +(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users +who build TensorFlow directly from source at the head will also be affected. + Most users should see the new default as expected behaviors in distributed execution. Users can take advantage of the built-in fault tolerance support in `ParameterServerStrategy` without having to make changes to the environment variable configurations. In other setups, exceptions will be raised to the model training loop code, where users can catch and handle these errors with custom -logic. +logic instead of hanging indefinitely. Certain users might receive "false alarms" if there are transient connection errors to the remote servers. We expect this to happen very rarely since GRPC From d1113a47d8d07575b8829888bb07b1c92b05e66c Mon Sep 17 00:00:00 2001 From: Haoyu Zhang Date: Fri, 19 Mar 2021 12:19:26 -0700 Subject: [PATCH 4/5] Change status of RFC to Approved. The code change is submitted in https://github.com/tensorflow/tensorflow/commit/c6708edd1353ef4fee05d24d1da425a465225e7a --- rfcs/20210218-grpc-fail-fast-use-caller.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md index 9b087b857..8fce0b809 100644 --- a/rfcs/20210218-grpc-fail-fast-use-caller.md +++ b/rfcs/20210218-grpc-fail-fast-use-caller.md @@ -1,6 +1,6 @@ # GRPC Fail Fast By Default -| Status | Proposed | +| Status | Approved | | :------------ | :------------------------------------------------------ | | **RFC #** | [355](https://github.com/tensorflow/community/pull/355) | | **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) | From 4ffddb416731fd9eaa24ec2d3105909d5c6385ec Mon Sep 17 00:00:00 2001 From: ematejska Date: Mon, 22 Mar 2021 10:59:34 -0700 Subject: [PATCH 5/5] Update 20210218-grpc-fail-fast-use-caller.md --- rfcs/20210218-grpc-fail-fast-use-caller.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md index 8fce0b809..635e31e71 100644 --- a/rfcs/20210218-grpc-fail-fast-use-caller.md +++ b/rfcs/20210218-grpc-fail-fast-use-caller.md @@ -1,6 +1,6 @@ # GRPC Fail Fast By Default -| Status | Approved | +| Status | Accepted | | :------------ | :------------------------------------------------------ | | **RFC #** | [355](https://github.com/tensorflow/community/pull/355) | | **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) |