Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions rfcs/20210218-grpc-fail-fast-use-caller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# GRPC Fail Fast By Default

| Status | Accepted |
| :------------ | :------------------------------------------------------ |
| **RFC #** | [355](https://github.com/tensorflow/community/pull/355) |
| **Author(s)** | Haoyu Zhang (haoyuzhang@google.com) |
| **Sponsor** | Bramandia Ramadhana (bramandia@google.com) |
| **Updated** | 2021-03-04 |

## Objective

We propose to set the default value of the `GRPC_FAIL_FAST` environment variable
to `use_caller`. This change prevents TensorFlow distributed jobs from hanging
indefinitely due to task failures, and allows users and TF libraries (e.g.,
distribution strategies) to handle the connection errors for better failure and
preemption recovery.

## Background

`GRPC_FAIL_FAST` is a TensorFlow distributed runtime environment variable that
controls the behavior of RPC requests when observing a network disconnection
with remote servers. It can be configured to the following values:

* `true`, which immediately reports an `UnavailableError` when there is a
connection issue for all RPCs, regardless of the per-RPC configurations;
* `false`, which blocks and waits until successfully connected to the remote
server (see
[gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)),
regardless of the per-RPC configurations;
* `use_caller`, which allows customization per RPC basis; in the current
implementation, `true` is used for RPCs used in distributed execution (such
as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in
initializing remote execution environments (e.g., `GetStatus`).

The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the
consequences is that users and/or high-level distribute libraries (such as
`ParameterServerStrategy`) need to
[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106)
to receive reasonable exceptions when workers fail / get preempted; otherwise
the cluster will hang and cannot recover from failures.

## Proposed Change

We propose to set the default value of `GRPC_FAIL_FAST` to `use_caller`. By
doing so, the runtime reports errors quickly to detect remote server failures
during execution, while still allowing the client to start early and wait for
remote servers to establish initial connections. This should be the desired
behavior for most use cases.

In the context of TensorFlow 2, the default behavior of the following RPCs used
for distributed execution will be changed from hanging on failures (current
behavior) to immediately reporting failures (after the change):

* `EagerService.CreateContext`
* `EagerService.UpdateContext`
* `EagerService.WaitQueueDone`
* `EagerService.KeepAlive`
* `EagerService.Enqueue`
* `EagerService.RunComponentFunction`
* `WorkerService.RecvTensor`
* `WorkerService.RecvBuf`

The default behavior of the following RPC will not change: it will still hang if
the remote task cannot be reached.

* `WorkerService.GetStatus`

The `GetStatus` RPC is typically the first RPC sent from the client to
initialize a distributed execution environment, in both the single- and the
multi-client mode. The underlying implementation uses GRPC's
[`wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)
flag, which allows the client to start before the remote server in the
deployment.

## User Impact

When this change is made to the codebase, subsequent TensorFlow 2 releases will
have this new default behavior. TensorFlow 1.x users who use the stable releases
(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users
who build TensorFlow directly from source at the head will also be affected.

Most users should see the new default as expected behaviors in distributed
execution. Users can take advantage of the built-in fault tolerance support in
`ParameterServerStrategy` without having to make changes to the environment
variable configurations. In other setups, exceptions will be raised to the model
training loop code, where users can catch and handle these errors with custom
logic instead of hanging indefinitely.

Certain users might receive "false alarms" if there are transient connection
errors to the remote servers. We expect this to happen very rarely since GRPC
(built on top of HTTP and TCP protocols) should already handle packet drops and
network flakinesses in most cases, and only report errors when there are real
network or server failures. However, if this does happen to some users, please
set `GRPC_FAIL_FAST=false` to override the default value and revert to the
previous behavior. Please also file an issue to inform the TensorFlow Runtime
team.