Dealing with many large event logs #1002

oweissbarth · 2018-02-25T15:11:42Z

I am training a CNN and i use tensorboard for the visualization of the training process and results.
As i create lots of image summaries while training the event log file often have size of about 7GB. When i point tensorboard to my runs-directory it seems to load all runs into memory even though non are activated in the ui. All log files in the runs directory total at about 100GB. Therefore loading everything into main memory (32GB on my system) doesn't work. Is there a way to only load log files once the runs are activated (on demand)? Am i missing something?
Thank you in advance.

nfelt · 2018-02-26T19:23:09Z

TensorBoard currently loads all runs into memory even if they aren't initially being displayed, so that when you select or deselect runs it can start showing that data in the UI immediately rather than having to then go crawl through the file.

I'm a bit surprised though that you are running into memory issues because TensorBoard should be keeping only a fixed-size sample of the loaded data in memory, not the full size of the original log directory. In the case of images, it should keep only 10 images per tag and per run:
https://github.com/tensorflow/tensorboard/blob/1.6.0/tensorboard/backend/application.py#L59 However, if you currently have a large number of unique tag+run combinations, each with only a few images, that could make the sampling a lot less effective.

The other thing is that with 100GB of logs in the directory, TensorBoard will just take a very long time to load them (even if they fit in memory) since there's only a single thread and that's just a lot of data to process. I agree it'd be useful if it were smarter about prioritizing runs to load, but for now a workaround could be to just create a new log directory and add a symlink to it for each run you want to show. E.g. if you have a log directory with runs like logdir/run1, logdir/run2, ... , logdir/run100 and you just want to show runs 1 and 100, you could create a new directory logdir-links and add within it symlinks to logdir/run1 and logdir/run100.

Also, we're working on adding support for a SQLite DB backend which should hopefully open up possibilities for loading run data only when needed and avoiding so much memory consumption, but it will still be a little while before that's reading for general use.

gweidner · 2018-02-27T23:48:45Z

Thank you @nfelt for the information. For the experimental SQLite DB backend that you mentioned, is there an example showing how to create and load the DB given large event files?
I'm not sure if loader tool should be used or if a different mechanism is available.
I noticed the histogram and scalar plugins have recently been updated to support db mode and would like to try out.

nfelt · 2018-02-28T00:30:11Z

@gweidner We don't really have an example yet, but it should be possible to populate a sqlite DB using either the loader.cc tool (built from TF source) or write TF code that uses tf.contrib.summary.create_db_writer(). Fair warning, it is still quite experimental with the DB schema subject to change, and as far as I know we haven't yet done much testing with large event files, but if you're interested please do try it out and let us know how it goes.

bhack · 2018-08-28T20:34:44Z

Any news?

amj · 2018-09-15T23:15:32Z

+1. Is this still experimental?

zishanahmed08 · 2018-12-13T10:12:26Z

Any Updates on the ETA?

dimart · 2019-10-09T17:33:12Z

@jart from PRs I got a feeling that you're working on adding support for a SQLite DB backend.
Is there a way to help? Does the tensorboard team share the progress / current tasks / etc somewhere publicly? 🤔

nav13n · 2019-10-16T21:33:06Z

It seems like a very useful feature set. Any way we can help this expedite with contribution?

nfelt · 2019-12-17T20:13:47Z

Hi folks, thanks for your continued interest and sorry there hasn't been much news. The SQLite DB backend work ran into difficulties and has been on hold for a while, but we're still very much aware that working with large logdirs is a pain point and we're working on more flexible modes of run selection and data management to address this.

Strateus · 2020-07-30T12:55:41Z

Any updates?

zishanahmed08 · 2020-07-30T13:20:07Z

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

jayleicn · 2020-12-21T01:47:12Z

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

Add my data point: on a MacBook Pro 2018 13', loading tensorboard from a remote server with many large event logs (a few hundred MBs in total), using Chrome will hang for a while every time I click on the tensorboard page. Then I followed this suggestion and switched to Safari -- Woo! It is just amazing and so responsive!

MaLiN2223 · 2021-04-20T21:00:37Z

@nfelt is there any progress on that issue?

LarsDu · 2023-05-08T19:59:23Z

Any updates on this issue? Sqlite backend anybody?
MLflow integration, perhaps???

nfelt added core:backend type:support labels Feb 26, 2018

mdfirman mentioned this issue Oct 18, 2018

Suggestion: option to limit number of events saved per tag per run lanpa/tensorboardX#253

Closed

nfelt added theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature and removed type:support labels Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with many large event logs #1002

Dealing with many large event logs #1002

oweissbarth commented Feb 25, 2018

nfelt commented Feb 26, 2018

gweidner commented Feb 27, 2018

nfelt commented Feb 28, 2018

bhack commented Aug 28, 2018

amj commented Sep 15, 2018

zishanahmed08 commented Dec 13, 2018

dimart commented Oct 9, 2019

nav13n commented Oct 16, 2019

nfelt commented Dec 17, 2019

Strateus commented Jul 30, 2020

zishanahmed08 commented Jul 30, 2020

jayleicn commented Dec 21, 2020 •

edited

MaLiN2223 commented Apr 20, 2021

LarsDu commented May 8, 2023

Dealing with many large event logs #1002

Dealing with many large event logs #1002

Comments

oweissbarth commented Feb 25, 2018

nfelt commented Feb 26, 2018

gweidner commented Feb 27, 2018

nfelt commented Feb 28, 2018

bhack commented Aug 28, 2018

amj commented Sep 15, 2018

zishanahmed08 commented Dec 13, 2018

dimart commented Oct 9, 2019

nav13n commented Oct 16, 2019

nfelt commented Dec 17, 2019

Strateus commented Jul 30, 2020

zishanahmed08 commented Jul 30, 2020

jayleicn commented Dec 21, 2020 • edited

MaLiN2223 commented Apr 20, 2021

LarsDu commented May 8, 2023

jayleicn commented Dec 21, 2020 •

edited