Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Extending the Tensorboard Notebook Display API #3752

Open
jerrylian-db opened this issue Jun 23, 2020 · 5 comments
Open

[RFC] Extending the Tensorboard Notebook Display API #3752

jerrylian-db opened this issue Jun 23, 2020 · 5 comments

Comments

@jerrylian-db
Copy link
Contributor

Three possible enhancements are proposed below to improve the Tensorboard experience on various notebooks environments. Please add your comments or concerns.

Custom Display API

A recent pull request made it possible for users to run Tensorboard in environments where the Jupyter host machine is not directly exposed to the notebook user. We are proposing to make the notebook display API more flexible.

The proposed enhancement is an API that lets users to directly set an inline display function that works in their environment, providing more flexibility for Tensorboard to be used on various notebook environments.

Here is an example use-case of this API.

# In the notebook Python shell init script
def my_custom_display(port, height, display_handle)
    del display_handel
    html_template =display_function(html_template) # this function is specific to the cloud environment

# import hook
Import wrapt
@wrapt.when_import(“tensorboard.notebook”)
def hook_tensorboard(module):
    module.register_custom_display(my_custom_display)

# In notebook cells
%load_ext tensorboard

%tensorboard --logdir logs
# uses the custom display function, which successfully renders the Tensorboard UI inline 

Flag to prevent process reuse

Process reuse is a convenient feature when a single user runs Tensorboard on their personal machine. However, in multi-tenant environments where users share the same machine or container, reusing processes can lead to mutual interference. One possible scenario: user A starts a process on a log directory and user B starts tensorboard on the same directory, hence reusing user A's process. User B kills the Tensorboard process and user A's Tensorboard UI becomes stale (to user A’s surprise).

By adding a flag to Tensorboard that prevents process re-use, e.g.--new_process, users can set that flag by default to prevent interfering with each other on a multi-tenant environment

Increasing the limit for the number of concurrent processes

Currently, there is a default and hard-coded limit of 10 concurrent Tensorboard processes in the port selection logic. For some environments, e.g. multi-tenant environments with powerful computing resources this is too restrictive. We propose increasing this default limit to 100.

@wchargin
Copy link
Contributor

Hi @jerrylian-db! Thanks for reaching out. A couple initial comments
inline:

@wrapt.when_imported("tensorboard.notebook")

With the understanding that this is just an example, in most cases we
would probably recommend just importing tensorboard.notebook and
registering the hook directly. The module is lightweight and has few
dependencies (in particular, no TensorFlow import).

def hook_tensorboard(module):
    module.register_custom_display(my_custom_display)

I don’t understand from this example what the contract of the API is.
How is the notebook module supposed to know when to use this custom
display function? e.g., for Colab it tests whether google.colab can be
imported, but I don’t see any place where such a check would fit in
here.

If the idea is that this is a hard override that will affect all
further calls to %tensorboard (or notebook.start) within this
session, we can maybe work with that, though register_custom_display
is then a bit of a misleading name. But I also wonder whether in that
case it would be easier for your code to just provide its own IPython
extension and/or magic rather than customizing ours?

user A starts a process on a log directory and user B starts
tensorboard on the same directory, hence reusing user A's process.
User B kills the Tensorboard process and user A's Tensorboard UI
becomes stale (to user A’s surprise).

Fair point. Related: User A points TensorBoard at a log directory to
which they don’t have read access; when User B points TensorBoard at the
same directory, it reuses the instance and can’t read the data, even if
User B does have sufficient privileges.

Let’s include the current user name/ID in the cache key to fix this.

By adding a flag to Tensorboard that prevents process re-use,
e.g.--new_process

It’s not quite that simple. Launching tensorboard(1) always spawns a
new TensorBoard instance; it’s the %tensorboard magic that sometimes
decides to reuse an existing instance. But the %tensorboard magic does
not parse any flags; it forwards them verbatim to the new TensorBoard
process. This is intentional, and we’d prefer to keep it that way.
(Parsing flags would require discovering and executing arbitrary plugin
code in the host IPython environment, would introduce a pin between the
library version of %tensorboard and the underlying binary version, and
so on.)

Currently, there is a default and hard-coded limit of 10 concurrent
Tensorboard processes in the port selection logic

To be clear, this is just the port-scanning logic. You can always pass
a fixed port, and you can also pass --port 0 to automatically select a
port. But increasing the scanning limit is probably fine with me, too.

@jerrylian-db
Copy link
Contributor Author

Hey @wchargin. Thank you for your feedback!

First, I apologize for the confusing example above. The way that we envision the API to work is for users to directly call the API on the notebook module to register a custom notebook display function. A more clear example:

from tensorboard import notebook
notebook.register_custom_display(custom_display)

The example we first provided does the above whenever the notebook module is imported, covering more usage scenarios.

How is the notebook module supposed to know when to use this custom
display function?

As you can see in https://github.com/tensorflow/tensorboard/pull/3770/files#diff-239d7f35cc0b4e0b6540bc796048880dR57, inside the notebook module, we define a global _custom_display variable as None. The new register_custom_display API sets that variable to a custom display function. The notebook module uses this function when the global variable is not None. All subsequent calls to %tensorboard and notebook.start will use this custom display function.

In general, we hope to provide a general way of displaying Tensorboard in notebook environments that are not ipython-based. Creating this API creates a stable way of doing so.

On process reuse, thanks for pointing out that it is the manager, not the binary that reuses Tensorboard processes. I’d also like to point out that in some multi tenant scenarios, users might share the same user id. I’m wondering if we can make the manager module look for an environment variable, perhaps called TENSORBOARD_NO_PROCESS_REUSE, and does not reuse processes if it is defined.

Thanks for the clarification on the port scanning logic. Yes, we would like to increase the scanning limit to 100. Basically, most users would just copy and paste examples from the web and expect those to work in multi-tenant environments. Requiring users to specify additional arguments adds friction.

wchargin pushed a commit that referenced this issue Jul 1, 2020
Increase the port scanning limit from 10 to 100 so that more users can
start TensorBoard processes with default parameters in multi-tenant
notebook environments.

See #3752 for more
context.
@jerrylian-db
Copy link
Contributor Author

Another thought on process reuse. What are your thoughts on adding an argument to Tensorboard like --key? It would have no effect on the Tensorboard server but is added the cache_key – allowing users to create a process in multi-tenant environments that would not be reused unless the exact same key is set.

@wchargin
Copy link
Contributor

My high-level concern is the following. The tensorboard.notebook
module currently exposes an API that is public, but “closed” in the
sense that it covers a fixed scope. Adding extension points to this API
is something that we can do, but it requires care to ensure that we not
lock ourselves into something that we regret later. (We have a fair
amount of this regret re: some of our older APIs from when we were both
less careful with design and more lax with what was public, and it’s an
ongoing burden.)

The approach proposed in the RFC and PR doesn’t seem quite there yet to
me. Here are some examples of what I mean:

  • If the proposed new API is appropriate for general use, then it
    should be possible to express the three existing contexts (CLI, bare
    IPython, and Colab) in terms of the new API. But as written, the
    “custom” context is just tacked on as a fourth exclusive case.

  • The RFC says that a goal is to avoid a dependency on IPython. But
    the callback’s third argument is an IPython display handle.

  • As mentioned previously, I’m not excited about implementing this by
    adding global mutable state to the module context. This is something
    that we’re actually fairly good at avoiding, and on the few cases
    where we’ve had global state we’ve been burnt in entirely
    predictable ways.

  • The existing display callbacks that we have (_display_colab, etc.)
    take a hodgepodge of host, port, and display handle arguments. The
    understanding here was that these were sufficient for what we need
    to do right now, and if we need to restructure that or add more
    metadata then we have that flexibility since it’s internal. To
    expose this signature publicly, the bar needs to be higher.

  • Controlling the process-spawning behavior of the (private)
    tensorboard.manager API via an environment variable also doesn’t
    have much precedent, so I’d like to see if there’s something more
    standard that we can do. (The existing TENSORBOARD_BINARY env var
    is more like a PATH, so it doesn’t really feel like the same thing
    to me.)

I recognize that these aren’t factual questions with clear correct
answers, but part of my job as a maintainer here is to object to
features :-). If you’re willing to take on the design work here more
holistically, I’d be happy to advise. Specifically, I’d like to see:

  • a model/ontology for what kinds of contexts we’re trying to support,
    and how this covers the current use cases and the use cases that you
    plan to add; and
  • consideration of whether it makes more sense for the functionality
    that you need to be a separate wrapper (separate extension, magics,
    or some combination) that delegates to TensorBoard APIs rather than
    extending them directly, so that you have more control over the
    trajectory.

What do you think?

@wchargin
Copy link
Contributor

(re: a new --key argument to be used in the cache key: see comments
above about not parsing flags.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants