-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
projector: fix RunPaths
result aliasing
#4479
Conversation
Summary: The projector uses `multiplexer.RunPaths()` to find each run’s logdir on disk, but it’s not clear from the code how this stays up to date as more runs are loaded. The answer is insidious: `RunPaths` returns the raw underlying dictionary, not a shallow copy of it, so changes to the internal structures of the multiplexer are immediately reflected in the projector plugin. This is a bit terrifying, since it means both that the code is wildly non-threadsafe and that it’s hard to tell when the data has actually changed. (The terror is somewhat abated by remembering that concurrency safety around dicts in Python is all just pretend, anyway.) This patch modifies the projector plugin to always shallow-copy the result of `RunPaths`, and makes the necessary follow-up changes to observe updates to the data. The result is that `RunPaths` is called more times, up to once on each HTTP handler or `is_active` call. This is fine: it’s expected that plugin routes may hit the multiplexer or data provider, and since the projector plugin only really works locally, this should be fast enough to not matter. In doing so, we also improve the update semantics, since we can test precisely for cache invalidation rather than just looking at the count of runs. There are almost certainly still observable concurrency issues; this patch doesn’t try to fix them all. The eventual goal of this series of patches is to remove the dependency on the multiplexer entirely, so the call to `RunPaths` will become a call to a data provider’s `list_runs` and merging in `flags.logdir`. Test Plan: The projector plugin still works in the normal case (shows embeddings with sprites, etc.). Edge cases to check: - Point TensorBoard at a logdir containing only non-projector data. Launch TensorBoard and wait a bit for the projector to finish determining that it has no data. Then add one projector run to the logdir and refresh TensorBoard (but do not restart the server). Note that the projector plugin properly picks up the new run. Repeat, and note that it picks up the newer run, too (i.e., the cache can be updated even when it’s non-empty). - Point TensorBoard at a logdir where the `projector_config.pbtxt` is inside a logdir’s plugin assets directory: ``` logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt ``` Note that the config is still resolved properly. You can tell that it’s using the plugin assets because if you comment out the `_append_plugin_asset_directories` call, the embeddings will still be displayed but will lack sprite information. wchargin-branch: projector-fix-aliasing wchargin-source: 30ba6c049c5509ee4625895f609ce162da3b2eb9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually checked the edge cases in the description and confirmed things work.
Point TensorBoard at a logdir where the projector_config.pbtxt is
inside a logdir’s plugin assets directory: logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt
I got hung up for a bit on this, because after I created that directory structure, TB wasn't detecting the config for some reason. Adding an empty file named with the substring "tfevents" in that directory made things work, presumably because it forced the multiplexer to recognize that directory.
else `False`. | ||
""" | ||
if self.multiplexer: | ||
run_paths = dict(self.multiplexer.RunPaths()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other multiplexer methods including PluginAssets()
, ActivePlugins()
, Runs()
return copies of the underlying value. Should we instead make the multiplexer responsible for creating a copy of the run paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, yes, but we can only do that once we’ve ensured that it’s
safe. If we’d done so before this commit, the projector plugin would
have broken. So before making that change, I’d like to do at least a
cursory audit, and I’d like that to not be tied to this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point TensorBoard at a logdir where the projector_config.pbtxt is
inside a logdir’s plugin assets directory: logs/run/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxtI got hung up for a bit on this, because after I created that directory structure, TB wasn't detecting the config for some reason. Adding an empty file named with the substring "tfevents" in that directory made things work, presumably because it forced the multiplexer to recognize that directory.
Hmm—you only need an event file at the run directory, right?
$ find /tmp/proj/mnist/lr*
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/events.out.tfevents.1563406327.HOSTNAME
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins/org_tensorflow_tensorboard_projector
/tmp/proj/mnist1/lr_1E-03,conv=1,fc=2/plugins/org_tensorflow_tensorboard_projector/projector_config.pbtxt
$ bazel run //tensorboard -- --logdir /tmp/proj # works
You specifically should not need it for the plugin assets directory
itself: i.e., projector_config.pbtxt
does not need to have any event
files as siblings.
else `False`. | ||
""" | ||
if self.multiplexer: | ||
run_paths = dict(self.multiplexer.RunPaths()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, yes, but we can only do that once we’ve ensured that it’s
safe. If we’d done so before this commit, the projector plugin would
have broken. So before making that change, I’d like to do at least a
cursory audit, and I’d like that to not be tied to this commit.
Also, if you do need an event file in the plugin assets directory for |
Just double checked and can confirm it's as you say. In a scenario with
|
Okay; great. Thanks. |
Summary: Some Google-internal, non-TensorBoard code depends on the projector and monkey patches out its internals in a way that is not compatible with the changes in #4479. This is clearly not ideal and should probably be fixed downstream, but pending investigation we revert #4479 to unbreak. This reverts commit 638014e. Test Plan: Test sync at this tree passes all tests. wchargin-branch: projector-revert-4479 wchargin-source: b0d20a84c557976e8c84c061a5a9dec742c11e7c
Summary: Some Google-internal, non-TensorBoard code depends on the projector and monkey patches out its internals in a way that is not compatible with the changes in #4479. This is clearly not ideal and should probably be fixed downstream, but pending investigation we revert #4479 to unbreak. This reverts commit 638014e. Test Plan: Test sync at this tree passes all tests. wchargin-branch: projector-revert-4479
Summary: This rolls forward #4479 (see that commit for context) with changes due to Google-internal, non-TensorBoard code that monkey patches some of the projector internals: - The `_run_paths` field is now updated only during `_update_configs`, to ensure that each change to `_run_paths` gives `_configs` the opportunity to update. - Configs/run paths are no longer updated in `__init__`, since that may be expensive and involve file access (even with a multiplexer). - The `_run_paths` field is now always a `dict` rather than `None`, fixing an existing bug wherein accessing `configs` on a plugin with no multiplexer would raise an `AttributeError`. This reverts commit 4c9a699, and therefore reinstates commit 638014e. Test Plan: Test plan from #4479 still passes, and a test sync now passes as well. Also tested pointing the projector at a logdir with no runs but with checkpoint data at root, as fixed in #3694; this worked in the original version of this change (#4479), but it works now, too. wchargin-branch: projector-fix-aliasing wchargin-source: 0cb7e9c3917d5dc9f00616e35aa5bf756a51dea0
Summary: This rolls forward #4479 (see that commit for context) with changes due to Google-internal, non-TensorBoard code that monkey patches some of the projector internals: - The `_run_paths` field is now updated only during `_update_configs`, to ensure that each change to `_run_paths` gives `_configs` the opportunity to update. - Configs/run paths are no longer updated in `__init__`, since that may be expensive and involve file access (even with a multiplexer). - The `_run_paths` field is now always a `dict` rather than `None`, fixing an existing bug wherein accessing `configs` on a plugin with no multiplexer would raise an `AttributeError`. This reverts commit 4c9a699, and therefore reinstates commit 638014e. Test Plan: Test plan from #4479 still passes, and a test sync now passes as well. Also tested pointing the projector at a logdir with no runs but with checkpoint data at root, as fixed in #3694; this worked in the original version of this change (#4479), but it works now, too. Test sync: <http://cl/350223680>. wchargin-branch: projector-fix-aliasing
Summary:
The projector uses
multiplexer.RunPaths()
to find each run’s logdir ondisk, but it’s not clear from the code how this stays up to date as more
runs are loaded. The answer is insidious:
RunPaths
returns the rawunderlying dictionary, not a shallow copy of it, so changes to the
internal structures of the multiplexer are immediately reflected in the
projector plugin. This is a bit terrifying, since it means both that the
code is wildly non-threadsafe and that it’s hard to tell when the data
has actually changed. (The terror is somewhat abated by remembering that
concurrency safety around dicts in Python is all just pretend, anyway.)
This patch modifies the projector plugin to always shallow-copy the
result of
RunPaths
, and makes the necessary follow-up changes toobserve updates to the data. The result is that
RunPaths
is calledmore times, up to once on each HTTP handler or
is_active
call. This isfine: it’s expected that plugin routes may hit the multiplexer or data
provider, and since the projector plugin only really works locally, this
should be fast enough to not matter.
In doing so, we also improve the update semantics, since we can test
precisely for cache invalidation rather than just looking at the count
of runs. There are almost certainly still observable concurrency issues;
this patch doesn’t try to fix them all.
The eventual goal of this series of patches is to remove the dependency
on the multiplexer entirely, so the call to
RunPaths
will become acall to a data provider’s
list_runs
and merging inflags.logdir
.Test Plan:
The projector plugin still works in the normal case (shows embeddings
with sprites, etc.). Edge cases to check:
Point TensorBoard at a logdir containing only non-projector data.
Launch TensorBoard and wait a bit for the projector to finish
determining that it has no data. Then add one projector run to the
logdir and refresh TensorBoard (but do not restart the server). Note
that the projector plugin properly picks up the new run. Repeat, and
note that it picks up the newer run, too (i.e., the cache can be
updated even when it’s non-empty).
Point TensorBoard at a logdir where the
projector_config.pbtxt
isinside a logdir’s plugin assets directory:
Note that the config is still resolved properly. You can tell that
it’s using the plugin assets because if you comment out the
_append_plugin_asset_directories
call, the embeddings will stillbe displayed but will lack sprite information.
wchargin-branch: projector-fix-aliasing