Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Apr 9, 2021

Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.

Test Plan:

  • Test with normal working TensorBoard.

  • Simulate failed data server connection by changing the definition of
    addr to "localhost:%d" % (port + 777) on line 175. This should
    print the UNAVAILABLE/“failed to connect” message from [2.5] Failed to pick subchannel #4844.

  • Simulate slow data server by adding the following to cli.rs:

    tokio::time::sleep(Duration::from_secs(3)).await;

    Add this right before the Server::builder().(...).await? call at
    the end of main: i.e., after we write the port file, but before we
    actually respond to requests. Note that TensorBoard still works with
    a 3-second delay, but that it actually delays printing the startup
    message for those 3 seconds as it determines which data provider to
    use.

  • Simulate extra-slow data provider as above but waiting 6 seconds,
    and note that TensorBoard prints a DEADLINE_EXCEEDED error, falls
    back to the legacy paths, and shows valid data.

wchargin-branch: data-liveness-check

Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.

Test Plan:

  - Test with normal working TensorBoard.

  - Simulate failed data server connection by changing the definition of
    `addr` to `"localhost:%d" % (port + 777)` on line 175. This should
    print the `UNAVAILABLE`/“failed to connect” message from #4844.

  - Simulate slow data server by adding the following to `cli.rs`:

    ```rust
    tokio::time::sleep(Duration::from_secs(3)).await;
    ```

    Add this right before the `Server::builder().(...).await?` call at
    the end of `main`: i.e., after we write the port file, but before we
    actually respond to requests. Note that TensorBoard still works with
    a 3-second delay, but that it actually delays printing the startup
    message for those 3 seconds as it determines which data provider to
    use.

  - Simulate extra-slow data provider as above but waiting 6 seconds,
    and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls
    back to the legacy paths, and shows valid data.

wchargin-branch: data-liveness-check
wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
@wchargin wchargin added theme:usability Areas to reduce confusion and frustration. core:rustboard //tensorboard/data/server/... labels Apr 9, 2021
@wchargin wchargin requested a review from stephanwlee April 9, 2021 04:02
@google-cla google-cla bot added the cla: yes label Apr 9, 2021
@wchargin
Copy link
Contributor Author

wchargin commented Apr 9, 2021

If all looks good, I’d like to cherry-pick this into 2.5.0.

@stephanwlee
Copy link
Contributor

Unusual merge: for sake of expediency for 2.5.0 release, I am merging it on behalf of wchargin.

@stephanwlee stephanwlee merged commit 876134e into master Apr 9, 2021
@stephanwlee stephanwlee deleted the wchargin-data-liveness-check branch April 9, 2021 16:46
@stephanwlee stephanwlee mentioned this pull request Apr 9, 2021
stephanwlee pushed a commit to stephanwlee/tensorboard that referenced this pull request Apr 9, 2021
Summary:
Issue tensorflow#4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.

Test Plan:

  - Test with normal working TensorBoard.

  - Simulate failed data server connection by changing the definition of
    `addr` to `"localhost:%d" % (port + 777)` on line 175. This should
    print the `UNAVAILABLE`/“failed to connect” message from tensorflow#4844.

  - Simulate slow data server by adding the following to `cli.rs`:

    ```rust
    tokio::time::sleep(Duration::from_secs(3)).await;
    ```

    Add this right before the `Server::builder().(...).await?` call at
    the end of `main`: i.e., after we write the port file, but before we
    actually respond to requests. Note that TensorBoard still works with
    a 3-second delay, but that it actually delays printing the startup
    message for those 3 seconds as it determines which data provider to
    use.

  - Simulate extra-slow data provider as above but waiting 6 seconds,
    and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls
    back to the legacy paths, and shows valid data.

wchargin-branch: data-liveness-check
wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
@stephanwlee stephanwlee mentioned this pull request Apr 9, 2021
stephanwlee pushed a commit that referenced this pull request Apr 15, 2021
Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.

Test Plan:

  - Test with normal working TensorBoard.

  - Simulate failed data server connection by changing the definition of
    `addr` to `"localhost:%d" % (port + 777)` on line 175. This should
    print the `UNAVAILABLE`/“failed to connect” message from #4844.

  - Simulate slow data server by adding the following to `cli.rs`:

    ```rust
    tokio::time::sleep(Duration::from_secs(3)).await;
    ```

    Add this right before the `Server::builder().(...).await?` call at
    the end of `main`: i.e., after we write the port file, but before we
    actually respond to requests. Note that TensorBoard still works with
    a 3-second delay, but that it actually delays printing the startup
    message for those 3 seconds as it determines which data provider to
    use.

  - Simulate extra-slow data provider as above but waiting 6 seconds,
    and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls
    back to the legacy paths, and shows valid data.

wchargin-branch: data-liveness-check
wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:rustboard //tensorboard/data/server/... theme:usability Areas to reduce confusion and frustration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants