Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with many large event logs #1002

Open
oweissbarth opened this issue Feb 25, 2018 · 14 comments
Open

Dealing with many large event logs #1002

oweissbarth opened this issue Feb 25, 2018 · 14 comments
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature

Comments

@oweissbarth
Copy link

I am training a CNN and i use tensorboard for the visualization of the training process and results.
As i create lots of image summaries while training the event log file often have size of about 7GB. When i point tensorboard to my runs-directory it seems to load all runs into memory even though non are activated in the ui. All log files in the runs directory total at about 100GB. Therefore loading everything into main memory (32GB on my system) doesn't work. Is there a way to only load log files once the runs are activated (on demand)? Am i missing something?
Thank you in advance.

@nfelt
Copy link
Collaborator

nfelt commented Feb 26, 2018

TensorBoard currently loads all runs into memory even if they aren't initially being displayed, so that when you select or deselect runs it can start showing that data in the UI immediately rather than having to then go crawl through the file.

I'm a bit surprised though that you are running into memory issues because TensorBoard should be keeping only a fixed-size sample of the loaded data in memory, not the full size of the original log directory. In the case of images, it should keep only 10 images per tag and per run:
https://github.com/tensorflow/tensorboard/blob/1.6.0/tensorboard/backend/application.py#L59 However, if you currently have a large number of unique tag+run combinations, each with only a few images, that could make the sampling a lot less effective.

The other thing is that with 100GB of logs in the directory, TensorBoard will just take a very long time to load them (even if they fit in memory) since there's only a single thread and that's just a lot of data to process. I agree it'd be useful if it were smarter about prioritizing runs to load, but for now a workaround could be to just create a new log directory and add a symlink to it for each run you want to show. E.g. if you have a log directory with runs like logdir/run1, logdir/run2, ... , logdir/run100 and you just want to show runs 1 and 100, you could create a new directory logdir-links and add within it symlinks to logdir/run1 and logdir/run100.

Also, we're working on adding support for a SQLite DB backend which should hopefully open up possibilities for loading run data only when needed and avoiding so much memory consumption, but it will still be a little while before that's reading for general use.

@gweidner
Copy link

Thank you @nfelt for the information. For the experimental SQLite DB backend that you mentioned, is there an example showing how to create and load the DB given large event files?
I'm not sure if loader tool should be used or if a different mechanism is available.
I noticed the histogram and scalar plugins have recently been updated to support db mode and would like to try out.

@nfelt
Copy link
Collaborator

nfelt commented Feb 28, 2018

@gweidner We don't really have an example yet, but it should be possible to populate a sqlite DB using either the loader.cc tool (built from TF source) or write TF code that uses tf.contrib.summary.create_db_writer(). Fair warning, it is still quite experimental with the DB schema subject to change, and as far as I know we haven't yet done much testing with large event files, but if you're interested please do try it out and let us know how it goes.

@bhack
Copy link

bhack commented Aug 28, 2018

Any news?

@amj
Copy link

amj commented Sep 15, 2018

+1. Is this still experimental?

@zishanahmed08
Copy link

Any Updates on the ETA?

@dimart
Copy link

dimart commented Oct 9, 2019

@jart from PRs I got a feeling that you're working on adding support for a SQLite DB backend.
Is there a way to help? Does the tensorboard team share the progress / current tasks / etc somewhere publicly? 🤔

@nav13n
Copy link

nav13n commented Oct 16, 2019

It seems like a very useful feature set. Any way we can help this expedite with contribution?

@nfelt nfelt added theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature and removed type:support labels Dec 17, 2019
@nfelt
Copy link
Collaborator

nfelt commented Dec 17, 2019

Hi folks, thanks for your continued interest and sorry there hasn't been much news. The SQLite DB backend work ran into difficulties and has been on hold for a while, but we're still very much aware that working with large logdirs is a pain point and we're working on more flexible modes of run selection and data management to address this.

@Strateus
Copy link

Any updates?

@zishanahmed08
Copy link

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

@jayleicn
Copy link

jayleicn commented Dec 21, 2020

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

Add my data point: on a MacBook Pro 2018 13', loading tensorboard from a remote server with many large event logs (a few hundred MBs in total), using Chrome will hang for a while every time I click on the tensorboard page. Then I followed this suggestion and switched to Safari -- Woo! It is just amazing and so responsive!

@MaLiN2223
Copy link

@nfelt is there any progress on that issue?

@LarsDu
Copy link

LarsDu commented May 8, 2023

Any updates on this issue? Sqlite backend anybody?
MLflow integration, perhaps???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature
Projects
None yet
Development

No branches or pull requests