data: check liveness before blessing data server #4851

wchargin · 2021-04-09T04:02:19Z

Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.

Test Plan:

Test with normal working TensorBoard.
Simulate failed data server connection by changing the definition of
addr to "localhost:%d" % (port + 777) on line 175. This should
print the UNAVAILABLE/“failed to connect” message from [2.5] Failed to pick subchannel #4844.
Simulate slow data server by adding the following to cli.rs:
```
tokio::time::sleep(Duration::from_secs(3)).await;
```
Add this right before the Server::builder().(...).await? call at
the end of main: i.e., after we write the port file, but before we
actually respond to requests. Note that TensorBoard still works with
a 3-second delay, but that it actually delays printing the startup
message for those 3 seconds as it determines which data provider to
use.
Simulate extra-slow data provider as above but waiting 6 seconds,
and note that TensorBoard prints a DEADLINE_EXCEEDED error, falls
back to the legacy paths, and shows valid data.

wchargin-branch: data-liveness-check

Summary: Issue #4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from #4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e

wchargin · 2021-04-09T04:02:42Z

If all looks good, I’d like to cherry-pick this into 2.5.0.

stephanwlee · 2021-04-09T16:46:21Z

Unusual merge: for sake of expediency for 2.5.0 release, I am merging it on behalf of wchargin.

Summary: Issue tensorflow#4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from tensorflow#4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e

Summary: Issue #4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from #4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e

wchargin added theme:usability Areas to reduce confusion and frustration. core:rustboard //tensorboard/data/server/... labels Apr 9, 2021

wchargin requested a review from stephanwlee April 9, 2021 04:02

google-cla bot added the cla: yes label Apr 9, 2021

stephanwlee approved these changes Apr 9, 2021

View reviewed changes

stephanwlee merged commit 876134e into master Apr 9, 2021

stephanwlee deleted the wchargin-data-liveness-check branch April 9, 2021 16:46

stephanwlee mentioned this pull request Apr 9, 2021

TensorBoard 2.5.0 #4854

Closed

wchargin mentioned this pull request Apr 9, 2021

[2.5] Failed to pick subchannel #4844

Closed

stephanwlee mentioned this pull request Apr 9, 2021

TensorBoard 2.5 #4855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data: check liveness before blessing data server #4851

data: check liveness before blessing data server #4851

Uh oh!

wchargin commented Apr 9, 2021

Uh oh!

wchargin commented Apr 9, 2021

Uh oh!

stephanwlee commented Apr 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

data: check liveness before blessing data server #4851

data: check liveness before blessing data server #4851

Uh oh!

Conversation

wchargin commented Apr 9, 2021

Uh oh!

wchargin commented Apr 9, 2021

Uh oh!

stephanwlee commented Apr 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants