Feature Request: Support for 1000s of logs #1013

ahundt · 2018-03-02T17:56:07Z

I'm really enjoying tensorboard, it makes my life a lot easier and gives me a good idea of what's happening as training progresses.

Unfortunately, I've found one area of weakness shows up when many runs accumulate, because tensorboard starts to become very sluggish. Here is a video from a 22 core xenon workstation with 48GB ram and about 50% of resources free trying to display about ~250 runs in tensorboard, each with a single time step (epoch):

What would help:

fix for performance issues when there are 1000s of logs
a "view summary" button was added to the chart with a non-hover version of the hover data so we can see all of it
- have a way to view, sort, copy and save that whole list of results that shows up when you hover
Add a way write and view dictionaries of data mapping from strings to a mix of integers, strings, and floats which are associated with each run.
- In my case it would be hyperparameters, others would likely have their own uses.
- perhaps this last item already exists and I missed it?

Thanks for your consideration!

nfelt · 2018-03-02T18:47:48Z

Thanks for the feedback!

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage. Are there particular actions or interactions you noticed being sluggish? E.g. reload vs hovering over the chart vs changing which runs are selected, etc.?

Also, when you say that you have a single time step per run, does that mean you're really trying to create something more like a scatter plot? We could open a feature request for that, and having a scatter plot might be a quicker route to good performance than reworking the scalar dashboard to handle this case well.

Re: the hover chart, I agree that it would be useful to have a way to "pin" it so that you can copy-paste the data (and view it if it's too large to fit on screen). Would you mind opening a separate FR for that? I think it's useful functionality even with just a few runs so I'd like to track that separately.

Re: hyperparameters, that's definitely a feature request we've gotten before (#46) and we'll hopefully have better support for that soon.

ahundt · 2018-03-05T22:45:22Z

Reload is definitely sluggish. Sometimes it can take the entire 30s reload period.
Sometimes hyperparameters cause out of memory errors, which I catch and handle properly, but then tensorboard seems to struggle with the missing data.
Tensorboard appear to expand to consume all the memory on my machine in this type of situation. I frequently have to kill and restart it so I don't run out.
Additionally, the hover list only shows about ~10-20 runs, so I end up zooming in on only the very best and taking a screenshot to figure out where to find them on disk.

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage.

I agree it is an outlier at the moment, but even without hyperopt the number of runs often does add up quickly. To my knowledge, hyperopt is also considered best practice so I may end up being an early adopter of a growing population. I also know of others in my university that moved away from tensorboard due to the same limitations.

Also, when you say that you have a single time step per run, does that mean you're really trying to create something more like a scatter plot? We could open a feature request for that, and having a scatter plot might be a quicker route to good performance than reworking the scalar dashboard to handle this case well.

There is no major problem with the plot as it is now, I found the relative and wall plots worked well enough.

nfelt · 2018-03-07T01:31:09Z

Okay, thanks for the details and filing that separate feature request! Re: # 2, what do you mean exactly by struggle? If you mean it doesn't render right a screenshot would be great. Re # 3 we have gotten some reports of memory leaks or high memory consumption, e.g. #766, though thus far I believe we haven't really had enough time to thoroughly investigate and drill down on how that's happening.

As you saw we are working on providing better support for hyperparameter optimization experiments, and I agree that part of that should be supporting ~1000 ish runs with much less of a performance hit than we have today.

ahundt · 2018-03-11T18:40:55Z

Re: # 2, what do you mean exactly by struggle?

It prints lots of stack traces about uncaught exceptions and missing data. Sorry I don't have one on hand at the moment, I'll try to add it here when I come across it.

phemmer · 2020-01-25T19:55:06Z

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage.

I don't know that this is much of an outlier any more. Especially now that the hparams plugin is part natively integrated.

I myself am running into difficulties with large numbers of runs (a few hundred in my case) performing hyperparameter searches. My experience is that the hparams plugin seems to just freak out and stops showing anything after 150 runs or so. And the scalars screen seems to break after around 200 runs. While tensorboard does suck down a ton of memory (~25gb), I've got plenty to spare (workstation has 64gb), it seems the main issue is that it starts pegging the CPU at 100% (just one core). Smells like it can't keep up with the incoming logs.

nfelt added the type:feature label Mar 2, 2018

ahundt mentioned this issue Mar 5, 2018

Chart hover tooltip should allow user interaction #1019

Open

nfelt mentioned this issue Aug 21, 2018

[feature request] --logdir should support globs #1221

Open

bileschi added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support for 1000s of logs #1013

Feature Request: Support for 1000s of logs #1013

ahundt commented Mar 2, 2018 •

edited

nfelt commented Mar 2, 2018

ahundt commented Mar 5, 2018

nfelt commented Mar 7, 2018 •

edited

ahundt commented Mar 11, 2018

phemmer commented Jan 25, 2020 •

edited

Feature Request: Support for 1000s of logs #1013

Feature Request: Support for 1000s of logs #1013

Comments

ahundt commented Mar 2, 2018 • edited

nfelt commented Mar 2, 2018

ahundt commented Mar 5, 2018

nfelt commented Mar 7, 2018 • edited

ahundt commented Mar 11, 2018

phemmer commented Jan 25, 2020 • edited

ahundt commented Mar 2, 2018 •

edited

nfelt commented Mar 7, 2018 •

edited

phemmer commented Jan 25, 2020 •

edited