Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalars: multiplex data fetches within a tag #4050

Merged
merged 7 commits into from
Aug 20, 2020

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Aug 18, 2020

Summary:
As of this patch, a tf-scalar-card can make just one network request
to fetch its data, instead of one request per time series (i.e., one
request per run, since each scalar chart renders a single tag). This
reduces network overhead, improves throughput due to higher request
concurrency, and offers the opportunity for data providers to more
efficiently request the data in batch.

This is implemented with a new POST route /scalars_multirun, since the
list of run names may be long. The frontend is configured to batch
requests to at most 64 runs at once, so the multiplexing is bounded.

This only affects the scalars plugin. Other scalar chart sources, like
PR curves, custom scalars, and the hparams metrics views, are not
affected.

Supersedes #3835, with the same idea and same general backend approach,
but using the frontend APIs enabled by #4045.

Test Plan:
On the hparams demo with four charts showing, each fetching 50 runs, we
now make only four requests as opposed to 200. On a Google-internal
networked data provider, this improves end-to-end time (measured from
“first spinner appears” to “last spinner disappears”) by about 50%, from
22±1 seconds to 11±1 seconds. (Before this patch, the network requests
were being queued all the way to the end of the test period.)

Changing the batch size to 4 and then running on a dataset with 14 runs
shows that the requests are properly batched, including the last one
with just 2 runs.

Testing hparams, custom scalars, and PR curves shows that they continue
to work, even when multiple time series are requested.

wchargin-branch: scalars-mux-runs

Summary:
The data loader behavior maintains a fine-grained cache of key-value
pairs. This is useful because when the set of requested keys expands,
only the new keys need to be fetched. But, until now, the behavior has
been hard-coded to fire a separate request for each key-value pair.
Clients can have either fine-grained cache invalidation or efficient
batched requests, but not both at the same time. This patch enriches the
behavior to support just that.

In a future change, the scalars dashboard will take advantage of this to
batch requests for multiple runs and a single tag.

In doing so, we need to shuffle around the API a bit. Instead of asking
clients to provide `getDataLoadUrl: (Item) => string` plus a separate
function `requestData: (url: string) => Promise<Data>` (where “`Item`”
and “`Data`” are the keys and values, respectively), clients now provide
a single function `requestData` that takes raw `Item`s (now plural),
performs the request(s), and returns the data. The function provides a
stream of key-value pairs, which is represented in callback style for
convenience. (We don’t want to drag Observable into this.)

The purpose of this approach, as opposed to a perhaps more natural
approach that simply adapts `getDataLoadUrl` to return some kind of
request struct with a callback to map a response into key-value pairs,
is to accommodate the variety of existing clients. The structures get
pretty wild: e.g., `tf-line-chart-data-loader` mixes in the behavior but
doesn’t actually provide the needed properties; they’re provided instead
by `tf-scalar-card`, but then `tf-hparams-session-group-details` further
overrides some of those properties of `tf-scalar-card` with an entirely
different backend. It’s a bit wordier for clients, but at least there
are fewer moving pieces to keep track of.

Test Plan:
The scalars, custom scalars, distributions, histograms, and hparams
dashboards all work. The fine-grained invalidation on the scalars
dashboard works: e.g., set the tag filter to `mnist` and then to
`mnist|hparams`, and watch only the hparams demo data load; then, set it
to `hparams` and watch the MNIST charts disappear without any repaints
to the hparams demo charts. The post-load callback properly causes
scalar charts’ domains to adjust. The refresh button in the dashboard UI
properly invalidates and re-fetches data.

(Make sure to run with `TB_POLYMER3=1` until that’s the default.)

wchargin-branch: dlb-batch-finegrained
wchargin-source: 14ec8abde36c4563a4922209d361fc5bd16b6061
Summary:
DO NOT SUBMIT until corresponding internal change is submitted.

Test Plan:
Manually tested; to be documented.

wchargin-branch: scalars-mux-runs
wchargin-source: 9459c8cc6bc0b041016dfc689306639a04080217
@wchargin wchargin added plugin:scalars theme:performance Performance, scalability, large data sizes, slowness, etc. labels Aug 18, 2020
@wchargin
Copy link
Contributor Author

DO NOT SUBMIT until corresponding internal change is submitted.

No internal change actually needed. (The relevant internal ML dashboard
is isolated from these changes due to the Polymer 3 migration.) I still
do want to better document and profile this, though, so leaving as draft
for the moment.

@wchargin
Copy link
Contributor Author

With a networked data provider (Google-internally) on a dataset with
four scalar charts each with 50 runs, triggering load by using “Toggle
All Runs” to turn all runs from off to on (on clean load cache),
timing from “first spinner appears” to “last spinner disappears”,
I measure an improvement from 22±1 seconds to 11±1 seconds: about 50%.

Base automatically changed from wchargin-dlb-batch-finegrained to master August 19, 2020 17:18
wchargin-branch: scalars-mux-runs
wchargin-source: 2e1bd5c8cd3393ac0a79c64b0c87ea43bf346b8a

# Conflicts:
#	tensorboard/plugins/scalar/polymer3/tf_scalar_dashboard/tf-scalar-card.ts
wchargin-branch: scalars-mux-runs
wchargin-source: 2e1bd5c8cd3393ac0a79c64b0c87ea43bf346b8a
wchargin-branch: scalars-mux-runs
wchargin-source: 0484e726315d46c282c3d61994ffb9e4dc07c20b
@wchargin wchargin closed this Aug 19, 2020
@wchargin wchargin reopened this Aug 19, 2020
wchargin-branch: scalars-mux-runs
wchargin-source: f183c2962c3d464b6ac903b63d833f0c926fe162
@wchargin wchargin closed this Aug 19, 2020
@wchargin wchargin reopened this Aug 19, 2020
Comment on lines +67 to +68
Accepts form-encoded POST data with a (required) singleton key `tag` and a
repeated key `runs`. Returns a JSON object mapping run names to arrays of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the contract if I violate this constraint? (0 or 2+ tag)? Unlike the zero length runs case, it looks like we are throwing 400.

Also, do you think it is a good idea to perhaps write an example of the request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-length runs is fine; it just returns an empty object. Missing
tag is a 400. Duplicate tags will silently pick one of them
(probably last?). I can make that an error, though; probably a good
idea. Thanks.

Also, do you think it is a good idea to perhaps write an example of
the request?

Yep. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I expected to see FormData but this works too :)

return this.requestManager.request(url, {tag, runs}).then((allData) => {
for (const run of runs) {
const item = {tag, run};
if (Object.prototype.hasOwnProperty.call(allData, run)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we have to use Object.prototype.hasOwnProperty? (i.e., why not allData.hasOwnProperty?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a run called hasOwnProperty, then that would be a type
error.

> ({train: 1}).hasOwnProperty("train")
true
> ({train: 1, hasOwnProperty: 2}).hasOwnProperty("train")
Thrown:
TypeError: {(intermediate value)(intermediate value)}.hasOwnProperty is not a function
> Object.prototype.hasOwnProperty.call({train: 1, hasOwnProperty: 2}, "train")
true

Yeah.

I prefer to write this kind of code defensively—even if it seems
unlikely, it sometimes happens (e.g.: #1283).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day https://github.com/hasOwnProperty will show up in a JSON
object mapping GitHub login names to database IDs, and someone will have
a fun time…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

wchargin-branch: scalars-mux-runs
wchargin-source: a4ca83ea5bf6b665052c076df835592105e5eefb
Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

return this.requestManager.request(url, {tag, runs}).then((allData) => {
for (const run of runs) {
const item = {tag, run};
if (Object.prototype.hasOwnProperty.call(allData, run)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a run called hasOwnProperty, then that would be a type
error.

> ({train: 1}).hasOwnProperty("train")
true
> ({train: 1, hasOwnProperty: 2}).hasOwnProperty("train")
Thrown:
TypeError: {(intermediate value)(intermediate value)}.hasOwnProperty is not a function
> Object.prototype.hasOwnProperty.call({train: 1, hasOwnProperty: 2}, "train")
true

Yeah.

I prefer to write this kind of code defensively—even if it seems
unlikely, it sometimes happens (e.g.: #1283).

Comment on lines +67 to +68
Accepts form-encoded POST data with a (required) singleton key `tag` and a
repeated key `runs`. Returns a JSON object mapping run names to arrays of the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-length runs is fine; it just returns an empty object. Missing
tag is a 400. Duplicate tags will silently pick one of them
(probably last?). I can make that an error, though; probably a good
idea. Thanks.

Also, do you think it is a good idea to perhaps write an example of
the request?

Yep. Done.

return this.requestManager.request(url, {tag, runs}).then((allData) => {
for (const run of runs) {
const item = {tag, run};
if (Object.prototype.hasOwnProperty.call(allData, run)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

Comment on lines +67 to +68
Accepts form-encoded POST data with a (required) singleton key `tag` and a
repeated key `runs`. Returns a JSON object mapping run names to arrays of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I expected to see FormData but this works too :)

@wchargin wchargin merged commit 0bf9c12 into master Aug 20, 2020
@wchargin wchargin deleted the wchargin-scalars-mux-runs branch August 20, 2020 22:05
wchargin added a commit that referenced this pull request Sep 23, 2020
Summary:
In #4050, we added multiplexing to scalar chart fetches for performance.
This works nicely in local TensorBoard and public Colab, but the POST
request machinery isn’t yet supported in the Google-internal version of
Colab, so #4050 caused a regression there. This patch conditionally
rolls back the change for Colab only. This includes rolling it back for
public Colab, where it worked fine. We hope that this isn’t too big of a
problem, since we expect that most Colab users are inspecting datasets
with few runs (generated from within Colab) rather than massive
hyperparameter sweeps. We also hope that we can simply revert this patch
once the Google-internal behavior is fixed.

Jupyter environments are unaffected (and still work).

We used to have a global `window.TENSORBOARD_ENV` that listed whether we
were in Colab, but we removed that in #2798 because we thought that it
was no longer necessary. Since in that PR we switched to create an
iframe rather than manually linking and loading the HTML (much nicer, to
be sure), we now plumb this information via a query parameter. This also
has the advantage that it’s easy to test the Colab codepaths by simply
adding that query parameter to a normal TensorBoard instance.

Test Plan:
First, test that normal TensorBoard (`bazel run //tensorboard`) still
works with multiplexing, and that adding `?tensorboardColab=true` causes
it to send multiple GET requests instead. Then, build the Pip package
(`bazel run //tensorboard/pip_package:extract_pip_packages`), upload it
to public Colab, and install it into the runtime. Verify that the scalar
charts still work there, albeit without multiplexing. Finally,
cherry-pick these changes into google3 via a test sync and follow the
test plan at <http://cl/333398676> to verify the fix.

wchargin-branch: scalars-nomux-colab
wchargin-source: 453503f1dea196985e889be483c2a7675cc87aa1
wchargin added a commit that referenced this pull request Sep 24, 2020
Summary:
In #4050, we added multiplexing to scalar chart fetches for performance.
This works nicely in local TensorBoard and public Colab, but the POST
request machinery isn’t yet supported in the Google-internal version of
Colab, so #4050 caused a regression there. This patch conditionally
rolls back the change for Colab only. This includes rolling it back for
public Colab, where it worked fine. We hope that this isn’t too big of a
problem, since we expect that most Colab users are inspecting datasets
with few runs (generated from within Colab) rather than massive
hyperparameter sweeps. We also hope that we can simply revert this patch
once the Google-internal behavior is fixed.

Jupyter environments are unaffected (and still work).

We used to have a global `window.TENSORBOARD_ENV` that listed whether we
were in Colab, but we removed that in #2798 because we thought that it
was no longer necessary. Since in that PR we switched to create an
iframe rather than manually linking and loading the HTML (much nicer, to
be sure), we now plumb this information via a query parameter. This also
has the advantage that it’s easy to test the Colab codepaths by simply
adding that query parameter to a normal TensorBoard instance.

Fixes #4174.

Test Plan:
First, test that normal TensorBoard (`bazel run //tensorboard`) still
works with multiplexing, and that adding `?tensorboardColab=true` causes
it to send multiple GET requests instead. Then, build the Pip package
(`bazel run //tensorboard/pip_package:extract_pip_packages`), upload it
to public Colab, and install it into the runtime. Verify that the scalar
charts still work there, albeit without multiplexing. Finally,
cherry-pick these changes into google3 via a test sync and follow the
test plan at <http://cl/333398676> to verify the fix.

wchargin-branch: scalars-nomux-colab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes plugin:scalars theme:performance Performance, scalability, large data sizes, slowness, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants