Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard Master's capstone project #130

Closed
chrisranderson opened this issue Jun 22, 2017 · 21 comments
Closed

TensorBoard Master's capstone project #130

chrisranderson opened this issue Jun 22, 2017 · 21 comments

Comments

@chrisranderson
Copy link
Contributor

I have about 120-150 hours to work on a project for school, and I was thinking about doing a visualization project in TensorBoard, and I'd like it to be usable by lots of people. Here's my idea:

Users select from one of the following to view:

  • parameter values (normalized on a per-layer basis)
  • gradient magnitudes
  • activations
  • variance over time for the above
  • maybe other stuff like highlighting receptive fields on hover, drawing on image input and seeing how activations change

These would just be pulled out of the network, reshaped, and visualized like so (and maybe a separate area for 1st conv layer filters) (video here: https://www.youtube.com/watch?v=gjXmacaxlYI):
image

I have some questions:

  1. Has this been done before in a way that is meant for lots of people to use? I couldn't find anything from a bit of Googling around, just people doing one-off things for their projects.
  2. Can this be visualized as the model is running, as fast as every iteration without drastic slowdowns? For instance, using a 1 million parameter model where the parameters are reshaped into a 1000x1000 square.
  3. Do you have any suggestions of features that you would find useful that I could add? Is there some other visualization tool that I should be building instead?
  4. Should I just build a GUI in Python or a standalone JS app instead of integrating with TensorBoard?
@teamdandelion
Copy link
Contributor

Hi @chrisranderson,

This looks really cool! I'd love to support such an ambitious and interesting project if it's technically feasible.

Right now, every TensorBoard plugin gets its data via the summary system. Ie, they get data from event files that are written to disk by the summary.FileWriter. It's purely one-way communication, and it has high latency, because TensorBoard ingests data from the event logs at most every 5 seconds. Also, everything written there is by default persisted forever on disk, so if we use it for high-throughput communication it will quickly saturate disk. So, the summary system as presently written is inappropriate for any real-time streaming application.

@jart is working on revamping the summary system to use sqlite and to support data streaming, so I'll let her chime in on whether she thinks the new summary system would be a good fit for your application.

@caisq has worked on establishing direct 2-way grpc communication between TensorFlow and TensorBoard. It would be ideal if we could leverage his work, but we've had difficulty open-sourcing it due to issues with some dependencies.

I think it would also be feasible for you to develop your own system, specific to this plugin, for getting data from TensorFlow to TensorBoard, and setting up 2way communication. Something like the following:

Let's suppose your plugin is called the RealTimeParameterVisualizer (maybe we'd come up with something catchier later 😛). Then you create a class tensorboard.plugins.real_time_parameter.ParameterWriter. The ParameterWriter is instantiated with a pointer to the logdir. And the user modifies their training code so that every step, it offers the model weights and the gradients to the ParameterWriter.

On instantiation, the ParameterWriter makes a directory within the logdir like logdir/plugins/real_time_parameter which contains a file called mode, which will be used by TensorBoard to communicate to the ParameterWriter.

On the TensorBoard backend side, you create tensorboard.plugins.real_time_parameter.RealTimeParameterPlugin. The RealTimeParameterPlugin is given the logdir by TensorBoard framework, and it takes responsibility for writing the mode. So based on user interaction it can change the mode from "off" to "parameter values", "gradient values", "gradient variance", etc.

The ParameterWriter uses poll checking to see when the mode changes from off. When the mode is not off, it begins dumping data to the filesystem containing the compressed parameter data for the frontend to visualize. We can think of an appropriate way to make sure that the amount of disk space used is bounded, e.g. by having the PW write to a new file every minute, and giving it responsibility for deleting (or downsampling) data older than 10 minutes.

In the example you gave of the million parameter model, if we want to show 16-value greyscale for each parameter at 30 FPS, that would be (10^6 values * 0.5 bytes/value * 30 per second) = 15MB/s which seems reasonable for writing/reading to disk, and processing without too much latency. Or, if we were willing to have 1 update per second, then it would be just 500KB/sec.

@jart / @wchargin Please share your thoughts too.

Now to answer your questions:

  1. Has this been done before in a way that is meant for lots of people to use? I couldn't find anything from a bit of Googling around, just people doing one-off things for their projects.

I'm not aware of any widely accessible version of this. I think it would be novel work, and quite valuable to the community.

  1. Can this be visualized as the model is running, as fast as every iteration without drastic slowdowns? For instance, using a 1 million parameter model where the parameters are reshaped into a 1000x1000 square.

See discussion above. I think we could accomplish something that feels fast to the user, and is near-real-time.

  1. Do you have any suggestions of features that you would find useful that I could add? Is there some other visualization tool that I should be building instead?

This seems like an interesting project to me. It could be made technically simpler by taking away the realtime component, and settling for getting data ~once per minute. But, you could focus more on building UI interaction and visualizations to really dive into the data and find ways to interpret the weights in context. E.g. serializing the weights and activations for k training examples, and looking at how the patterns of activations are different for different examples.

  1. Should I just build a GUI in Python or a standalone JS app instead of integrating with TensorBoard?

You could build a GUI on your own, and it will be easier for you to develop since you'll have control over everything, and won't be limited by TensorBoard's assumptions. However, convincing people to discover and use a new tool is always an uphill battle. I guess that if this is integrated into mainline TensorBoard, the usage will be several orders of magnitude higher than if you make a purely standalone tool.

Also tagging @colah and @shancarter as they may have thoughts to offer.

@jart
Copy link
Contributor

jart commented Jun 22, 2017

One of the things on my bucket list has been to develop some type of visualization, where we encode data in real time using ffmpeg and stream it to the browser in a video tag. So I would be interested in supporting something like this.

@chrisranderson
Copy link
Contributor Author

chrisranderson commented Jun 22, 2017

Wow, awesome. I was a bit worried that the reply would be like "this should be on the google group instead" or some other dismissal. :) So, when you say you'd like to support the project, what does that mean? I hack on it for a few days, and when I get stuck I can ask you for help?

If I can have a hand here and there, I'd like to try doing this in TensorBoard. I have a timeline I need to stick to - I start on the project June 26th, and finish by August 14th, so I'll start on Monday. Is this a project that could get merged into the repo?

Also, for first steps, I think I'll figure out how grpc works. Closest thing I've used is ZMQ (maybe not close at all? I'm pretty ignorant here). I guess I'll figure out how to send images from a Python script to... Node or something? I've done a decent amount of JS, but I'm really cloudy on how I'll talk to TensorBoard.

Thanks for your responses!

@jart
Copy link
Contributor

jart commented Jun 23, 2017

Based on our experience, gRPC isn't quite ready yet. I also have a lot of respect for ZeroMQ, but I'm not sure if we need it. We can probably just stream protobufs over a socket using writeDelimitedTo() and a sentinel message on close (to avoid weird TCP edge cases.)

"Support" means we can put the time aside to participate in the development process with you, by offering code reviews, answering questions, and making any framework changes you might need. This works best when there's a tight feedback cycle. For example, we like to see lots of small pull requests, rather than a one big code dump.

I would recommend is checking out web_library_example. It's an example of how to do TensorBoard development in a separate repository, without forking the codebase. You basically need a BUILD and WORKSPACE file to get started.

@caisq
Copy link
Contributor

caisq commented Jun 23, 2017

@chrisranderson I think the project you described is very interesting and can be very useful for a lot of people. It will benefit model interpretation, understanding and debugging, which is getting more and more important as new types of DL models get invented every week. TensorFlow has TensorBoard and TFDBG, both of which has limitations. For example, TFDBG allows you to see all the intermediate tensor values during runtime. But all it currently has is a text-based interface in the shell, which is not ideal for visualizing the graph structures in TensorFlow models. TensorBoard has great graph visualization, but its connection with the TensorFlow runtime is not real-time. A visual debugger for TensorFlow in TensorBoard will be a great feature. Just imagine what you can see and do if you could "step" through nodes of a graph, visualize its output tensor as a table, a curve, an image or a video. You can also modify the tensor value before continuing further on the graph...

TFDBG already has a protocol for real-time streaming of data from TF runtime. But as @dandelionmane and @jart pointed out, due to some yet-unfulfilled feature requests in the gRPC library, these are not fully functional in open-source tensorflow yet. I can check with the gRPC team on their time line to fulfill the feature request. The request mainly has to do with implementing a py_grpc_library bazel genrule. Even if their timeline is too far in the future, we can find a way to bypass the missing feature and do it the same way as the way tensorflow/core/distributed_runtime does it, i.e., implement the server in C++. The part we have to work out ourselves is SWIG-wrapping it so that it can be used in Python, as a TensorBoard plugin. The C++ libraries of the aforementioned protocol is not fully open-source yet, but I can easily make them open-source soon.

I'll think twice before implementing the protocol again in another framework, as it may cause unnecessary duplicate work and confusion to clients.

@caisq
Copy link
Contributor

caisq commented Jun 23, 2017

cc @chihuahua

@teamdandelion
Copy link
Contributor

@chrisranderson As Justine (@jart) said, we're happy to support you by doing code reviews, answering questions, and making upstream changes if you need them. I think per Justine's suggestion, you should make a new repository for the plugin and use bazel rules to depend on it - forking web_library_example is a good starting point. We can also set up a video call so you can ask us questions, if you want.

The goal for the project will be to get your plugin to a point where we are comfortable absorbing it from you into tensorboard/plugins as an officially supported plugin. Hopefully we'll reach that by August 14 😄

As you can see from the back-and-forth on this thread, there are a lot of different opinions on how to do the communication between TensorFlow and TensorBoard. Personally, I would advocate for something that is simple (not too many new dependencies) and likely to work in different platforms and environments, like writing/reading to disk.

Eventually (once gRPC is ready) we will probably want to consolidate everything to use the same implementation as TFDBG. So I think my 2c would be either:

  1. write something simple and expedient (e.g. writing/reading from disk), with a reasonable interface, so we can later replace it with gRPC when that is ready
  2. work with @caisq to get something like what the debugger or distributed runtime does now (I am just scared that SWIG-wrapping etc will be a rabbit hole that distracts from actually getting the plugin to work)

@caisq
Copy link
Contributor

caisq commented Jun 23, 2017

+1 what @dandelionmane said. I think it's a good idea to build a simple communication channel between TF runtime and TensorBoard that can be easily replaced with grpc once its py_grpc_library genrule is ready.

I will be a happy to provide the kind of support that @dandelionmane mentioned as well. I can also you keep you abreast of any potentially relevant changes in TFDBG.

@caisq
Copy link
Contributor

caisq commented Jun 23, 2017

@chrisranderson forgot to mention in the previous post: the file write-read option @dandelionmane mentioned is a good candidate for the kind of simple communication channel mentioned above. TFDBG's can write out tensorflow.Event protobuf files to the disk using its file:// debug URLs. This unit test is a good place to start reading about it:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/debug/lib/session_debug_testlib.py

TFDBG also has modules for reading such files and their directory structures. See also the test above, in addition to the API doc at:
https://www.tensorflow.org/api_docs/python/tfdbg/DebugDumpDir
https://www.tensorflow.org/api_docs/python/tfdbg/DebugTensorDatum

@chrisranderson
Copy link
Contributor Author

chrisranderson commented Jun 23, 2017

Okay, based on what I've read, my overall plan (which is very, very hazy) is:

  • fork web_library_example which will be the main repo for this project, and basically read through everything there and try to understand what's going on.
  • look through other TensorBoard plugins and TensorBoard itself to figure out how to write something that reads and writes from a file. The client side of things I have basically zero idea what I'm doing. I've got some JS and web development background, but that's about it.
  • read up on how to write my own writer to save tensors to disk. That might consist of looking through TensorBoard code to see how others do it, and just mimicking patterns I find there. I'm not super worried about accomplishing this.
  • Eventually hit the point where I can write an image to disk "server side" and display it in a browser.
  • Produce a list of concrete questions once I have better ones than "I have no idea what I'm doing please help" :)

Would you like to continue communication here, or should I start making issues on my forked repo? I've never really done much that seriously on GitHub. Is it preferred that I use the issues for task management? Open an issue for every thing I'm working on and every commit corresponds to an issue?

Thanks again! I'm excited and nervous to get going. I'll start Monday.

@teamdandelion
Copy link
Contributor

That sounds like a reasonable plan. You'll want to poke around the baze docs to understand the web_library_example.

For communication, if you link your forked repo, I'll watch it and respond to issues that you post there. I think that may be cleaner than using this thread for everything. If you have trouble getting a hold of us, poke us here.

@jart
Copy link
Contributor

jart commented Jun 26, 2017

I can follow it too. If you post an issue in your new repository every time you have a question, then it can sort of become like a stack overflow for how to extend TensorBoard with Bazel. But in all fairness, they might get better search rankings if the questions are posted either here, on on TensorBoard's Stack Overflow. What do you think @dandelionmane? I'm leaning towards the latter.

@chrisranderson
Copy link
Contributor Author

Here is the repo, and here is my first set of questions: chrisranderson/beholder#1.

I wouldn't mind writing up some type of guide or blog post after this is all done about writing a plugin for TensorBoard - maybe you all could take it and edit to death and post it somewhere?

@teamdandelion
Copy link
Contributor

That would be so great - you are gonna be the first external contributor to write a TB plugin, and a write-up on how it's done would make it a lot easier for other people to follow in your footsteps.

@chrisranderson
Copy link
Contributor Author

I presented on my project today to the CS department, and passed! :) I guess I can close this issue now.

If anyone is interested in the future of this project, you can find a discussion here: chrisranderson/beholder#33

Thank you for your help!

@wchargin
Copy link
Contributor

Whoa—congratulations!! 🎉 🎉 🎉

@jart
Copy link
Contributor

jart commented Aug 16, 2017

Congrats!

@chihuahua
Copy link
Member

Well deserved!

@caisq
Copy link
Contributor

caisq commented Aug 16, 2017

Congrats!

@luchensk
Copy link

@caisq Based on your comment about gRPC, just to make more sure, is gRPC ready for setting up 2way communication between TF debugger and TB debugger, for now?
I noted that some code for debugger has already existed in the two repos of TF and TB.
Thanks.

@ljh9961
Copy link

ljh9961 commented Apr 25, 2018

GOOD!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants