-
Notifications
You must be signed in to change notification settings - Fork 1.7k
data: check liveness before blessing data server #4851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Issue #4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from #4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
|
If all looks good, I’d like to cherry-pick this into 2.5.0. |
stephanwlee
approved these changes
Apr 9, 2021
|
Unusual merge: for sake of expediency for 2.5.0 release, I am merging it on behalf of wchargin. |
Closed
stephanwlee
pushed a commit
to stephanwlee/tensorboard
that referenced
this pull request
Apr 9, 2021
Summary: Issue tensorflow#4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from tensorflow#4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
Merged
stephanwlee
pushed a commit
that referenced
this pull request
Apr 15, 2021
Summary: Issue #4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from #4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cla: yes
core:rustboard
//tensorboard/data/server/...
theme:usability
Areas to reduce confusion and frustration.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.
Test Plan:
Test with normal working TensorBoard.
Simulate failed data server connection by changing the definition of
addrto"localhost:%d" % (port + 777)on line 175. This shouldprint the
UNAVAILABLE/“failed to connect” message from [2.5] Failed to pick subchannel #4844.Simulate slow data server by adding the following to
cli.rs:Add this right before the
Server::builder().(...).await?call atthe end of
main: i.e., after we write the port file, but before weactually respond to requests. Note that TensorBoard still works with
a 3-second delay, but that it actually delays printing the startup
message for those 3 seconds as it determines which data provider to
use.
Simulate extra-slow data provider as above but waiting 6 seconds,
and note that TensorBoard prints a
DEADLINE_EXCEEDEDerror, fallsback to the legacy paths, and shows valid data.
wchargin-branch: data-liveness-check