Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

projector: fix RunPaths result aliasing #4479

Merged
merged 1 commit into from
Dec 17, 2020

Conversation

wchargin
Copy link
Contributor

Summary:
The projector uses multiplexer.RunPaths() to find each run’s logdir on
disk, but it’s not clear from the code how this stays up to date as more
runs are loaded. The answer is insidious: RunPaths returns the raw
underlying dictionary, not a shallow copy of it, so changes to the
internal structures of the multiplexer are immediately reflected in the
projector plugin. This is a bit terrifying, since it means both that the
code is wildly non-threadsafe and that it’s hard to tell when the data
has actually changed. (The terror is somewhat abated by remembering that
concurrency safety around dicts in Python is all just pretend, anyway.)

This patch modifies the projector plugin to always shallow-copy the
result of RunPaths, and makes the necessary follow-up changes to
observe updates to the data. The result is that RunPaths is called
more times, up to once on each HTTP handler or is_active call. This is
fine: it’s expected that plugin routes may hit the multiplexer or data
provider, and since the projector plugin only really works locally, this
should be fast enough to not matter.

In doing so, we also improve the update semantics, since we can test
precisely for cache invalidation rather than just looking at the count
of runs. There are almost certainly still observable concurrency issues;
this patch doesn’t try to fix them all.

The eventual goal of this series of patches is to remove the dependency
on the multiplexer entirely, so the call to RunPaths will become a
call to a data provider’s list_runs and merging in flags.logdir.

Test Plan:
The projector plugin still works in the normal case (shows embeddings
with sprites, etc.). Edge cases to check:

  • Point TensorBoard at a logdir containing only non-projector data.
    Launch TensorBoard and wait a bit for the projector to finish
    determining that it has no data. Then add one projector run to the
    logdir and refresh TensorBoard (but do not restart the server). Note
    that the projector plugin properly picks up the new run. Repeat, and
    note that it picks up the newer run, too (i.e., the cache can be
    updated even when it’s non-empty).

  • Point TensorBoard at a logdir where the projector_config.pbtxt is
    inside a logdir’s plugin assets directory:

    logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt
    

    Note that the config is still resolved properly. You can tell that
    it’s using the plugin assets because if you comment out the
    _append_plugin_asset_directories call, the embeddings will still
    be displayed but will lack sprite information.

wchargin-branch: projector-fix-aliasing

Summary:
The projector uses `multiplexer.RunPaths()` to find each run’s logdir on
disk, but it’s not clear from the code how this stays up to date as more
runs are loaded. The answer is insidious: `RunPaths` returns the raw
underlying dictionary, not a shallow copy of it, so changes to the
internal structures of the multiplexer are immediately reflected in the
projector plugin. This is a bit terrifying, since it means both that the
code is wildly non-threadsafe and that it’s hard to tell when the data
has actually changed. (The terror is somewhat abated by remembering that
concurrency safety around dicts in Python is all just pretend, anyway.)

This patch modifies the projector plugin to always shallow-copy the
result of `RunPaths`, and makes the necessary follow-up changes to
observe updates to the data. The result is that `RunPaths` is called
more times, up to once on each HTTP handler or `is_active` call. This is
fine: it’s expected that plugin routes may hit the multiplexer or data
provider, and since the projector plugin only really works locally, this
should be fast enough to not matter.

In doing so, we also improve the update semantics, since we can test
precisely for cache invalidation rather than just looking at the count
of runs. There are almost certainly still observable concurrency issues;
this patch doesn’t try to fix them all.

The eventual goal of this series of patches is to remove the dependency
on the multiplexer entirely, so the call to `RunPaths` will become a
call to a data provider’s `list_runs` and merging in `flags.logdir`.

Test Plan:
The projector plugin still works in the normal case (shows embeddings
with sprites, etc.). Edge cases to check:

  - Point TensorBoard at a logdir containing only non-projector data.
    Launch TensorBoard and wait a bit for the projector to finish
    determining that it has no data. Then add one projector run to the
    logdir and refresh TensorBoard (but do not restart the server). Note
    that the projector plugin properly picks up the new run. Repeat, and
    note that it picks up the newer run, too (i.e., the cache can be
    updated even when it’s non-empty).

  - Point TensorBoard at a logdir where the `projector_config.pbtxt` is
    inside a logdir’s plugin assets directory:

    ```
    logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt
    ```

    Note that the config is still resolved properly. You can tell that
    it’s using the plugin assets because if you comment out the
    `_append_plugin_asset_directories` call, the embeddings will still
    be displayed but will lack sprite information.

wchargin-branch: projector-fix-aliasing
wchargin-source: 30ba6c049c5509ee4625895f609ce162da3b2eb9
@wchargin wchargin added plugin:projector core:rustboard //tensorboard/data/server/... labels Dec 15, 2020
@google-cla google-cla bot added the cla: yes label Dec 15, 2020
Copy link
Contributor

@psybuzz psybuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually checked the edge cases in the description and confirmed things work.

Point TensorBoard at a logdir where the projector_config.pbtxt is
inside a logdir’s plugin assets directory: logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt

I got hung up for a bit on this, because after I created that directory structure, TB wasn't detecting the config for some reason. Adding an empty file named with the substring "tfevents" in that directory made things work, presumably because it forced the multiplexer to recognize that directory.

else `False`.
"""
if self.multiplexer:
run_paths = dict(self.multiplexer.RunPaths())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other multiplexer methods including PluginAssets(), ActivePlugins(), Runs() return copies of the underlying value. Should we instead make the multiplexer responsible for creating a copy of the run paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes, but we can only do that once we’ve ensured that it’s
safe. If we’d done so before this commit, the projector plugin would
have broken. So before making that change, I’d like to do at least a
cursory audit, and I’d like that to not be tied to this commit.

Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point TensorBoard at a logdir where the projector_config.pbtxt is
inside a logdir’s plugin assets directory: logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt

I got hung up for a bit on this, because after I created that directory structure, TB wasn't detecting the config for some reason. Adding an empty file named with the substring "tfevents" in that directory made things work, presumably because it forced the multiplexer to recognize that directory.

Hmm—you only need an event file at the run directory, right?

$ find /tmp/proj/mnist/lr*
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/events.out.tfevents.1563406327.HOSTNAME
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins/org_tensorflow_tensorboard_projector
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt
$ bazel run //tensorboard -- --logdir /tmp/proj  # works

You specifically should not need it for the plugin assets directory
itself: i.e., projector_config.pbtxt does not need to have any event
files as siblings.

else `False`.
"""
if self.multiplexer:
run_paths = dict(self.multiplexer.RunPaths())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes, but we can only do that once we’ve ensured that it’s
safe. If we’d done so before this commit, the projector plugin would
have broken. So before making that change, I’d like to do at least a
cursory audit, and I’d like that to not be tied to this commit.

@wchargin
Copy link
Contributor Author

Also, if you do need an event file in the plugin assets directory for
some reason (I can’t reproduce this), could you please check whether you
observe the same behavior against head?

@psybuzz
Copy link
Contributor

psybuzz commented Dec 17, 2020

You specifically should not need it for the plugin assets directory
itself: i.e., projector_config.pbtxt does not need to have any event
files as siblings.

Just double checked and can confirm it's as you say. In a scenario with

  • run dir having a tfevents file
  • plugin assets dir has only projector_config.pbtxt (no siblings)
    the Projector still properly picks up the config file.

@wchargin
Copy link
Contributor Author

Okay; great. Thanks.

@wchargin wchargin merged commit 638014e into master Dec 17, 2020
@wchargin wchargin deleted the wchargin-projector-fix-aliasing branch December 17, 2020 20:08
wchargin added a commit that referenced this pull request Dec 22, 2020
Summary:
Some Google-internal, non-TensorBoard code depends on the projector and
monkey patches out its internals in a way that is not compatible with
the changes in #4479. This is clearly not ideal and should probably be
fixed downstream, but pending investigation we revert #4479 to unbreak.

This reverts commit 638014e.

Test Plan:
Test sync at this tree passes all tests.

wchargin-branch: projector-revert-4479
wchargin-source: b0d20a84c557976e8c84c061a5a9dec742c11e7c
wchargin added a commit that referenced this pull request Dec 22, 2020
Summary:
Some Google-internal, non-TensorBoard code depends on the projector and
monkey patches out its internals in a way that is not compatible with
the changes in #4479. This is clearly not ideal and should probably be
fixed downstream, but pending investigation we revert #4479 to unbreak.

This reverts commit 638014e.

Test Plan:
Test sync at this tree passes all tests.

wchargin-branch: projector-revert-4479
wchargin added a commit that referenced this pull request Dec 23, 2020
Summary:
This rolls forward #4479 (see that commit for context) with changes due
to Google-internal, non-TensorBoard code that monkey patches some of the
projector internals:

  - The `_run_paths` field is now updated only during `_update_configs`,
    to ensure that each change to `_run_paths` gives `_configs` the
    opportunity to update.
  - Configs/run paths are no longer updated in `__init__`, since that
    may be expensive and involve file access (even with a multiplexer).
  - The `_run_paths` field is now always a `dict` rather than `None`,
    fixing an existing bug wherein accessing `configs` on a plugin with
    no multiplexer would raise an `AttributeError`.

This reverts commit 4c9a699, and
therefore reinstates commit 638014e.

Test Plan:
Test plan from #4479 still passes, and a test sync now passes as well.
Also tested pointing the projector at a logdir with no runs but with
checkpoint data at root, as fixed in #3694; this worked in the original
version of this change (#4479), but it works now, too.

wchargin-branch: projector-fix-aliasing
wchargin-source: 0cb7e9c3917d5dc9f00616e35aa5bf756a51dea0
wchargin added a commit that referenced this pull request Jan 5, 2021
Summary:
This rolls forward #4479 (see that commit for context) with changes due
to Google-internal, non-TensorBoard code that monkey patches some of the
projector internals:

  - The `_run_paths` field is now updated only during `_update_configs`,
    to ensure that each change to `_run_paths` gives `_configs` the
    opportunity to update.
  - Configs/run paths are no longer updated in `__init__`, since that
    may be expensive and involve file access (even with a multiplexer).
  - The `_run_paths` field is now always a `dict` rather than `None`,
    fixing an existing bug wherein accessing `configs` on a plugin with
    no multiplexer would raise an `AttributeError`.

This reverts commit 4c9a699, and
therefore reinstates commit 638014e.

Test Plan:
Test plan from #4479 still passes, and a test sync now passes as well.
Also tested pointing the projector at a logdir with no runs but with
checkpoint data at root, as fixed in #3694; this worked in the original
version of this change (#4479), but it works now, too.

Test sync: <http://cl/350223680>.

wchargin-branch: projector-fix-aliasing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants