Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for 1000s of logs #1013

Open
ahundt opened this issue Mar 2, 2018 · 5 comments
Open

Feature Request: Support for 1000s of logs #1013

ahundt opened this issue Mar 2, 2018 · 5 comments
Labels
theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature

Comments

@ahundt
Copy link

ahundt commented Mar 2, 2018

I'm really enjoying tensorboard, it makes my life a lot easier and gives me a good idea of what's happening as training progresses.

Unfortunately, I've found one area of weakness shows up when many runs accumulate, because tensorboard starts to become very sluggish. Here is a video from a 22 core xenon workstation with 48GB ram and about 50% of resources free trying to display about ~250 runs in tensorboard, each with a single time step (epoch):

tensorboard stalling out on 250 runs

What would help:

  • fix for performance issues when there are 1000s of logs
  • a "view summary" button was added to the chart with a non-hover version of the hover data so we can see all of it
    • have a way to view, sort, copy and save that whole list of results that shows up when you hover
  • Add a way write and view dictionaries of data mapping from strings to a mix of integers, strings, and floats which are associated with each run.
    • In my case it would be hyperparameters, others would likely have their own uses.
    • perhaps this last item already exists and I missed it?

Thanks for your consideration!

@nfelt
Copy link
Collaborator

nfelt commented Mar 2, 2018

Thanks for the feedback!

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage. Are there particular actions or interactions you noticed being sluggish? E.g. reload vs hovering over the chart vs changing which runs are selected, etc.?

Also, when you say that you have a single time step per run, does that mean you're really trying to create something more like a scatter plot? We could open a feature request for that, and having a scatter plot might be a quicker route to good performance than reworking the scalar dashboard to handle this case well.

Re: the hover chart, I agree that it would be useful to have a way to "pin" it so that you can copy-paste the data (and view it if it's too large to fit on screen). Would you mind opening a separate FR for that? I think it's useful functionality even with just a few runs so I'd like to track that separately.

Re: hyperparameters, that's definitely a feature request we've gotten before (#46) and we'll hopefully have better support for that soon.

@ahundt
Copy link
Author

ahundt commented Mar 5, 2018

  1. Reload is definitely sluggish. Sometimes it can take the entire 30s reload period.
  2. Sometimes hyperparameters cause out of memory errors, which I catch and handle properly, but then tensorboard seems to struggle with the missing data.
  3. Tensorboard appear to expand to consume all the memory on my machine in this type of situation. I frequently have to kill and restart it so I don't run out.
  4. Additionally, the hover list only shows about ~10-20 runs, so I end up zooming in on only the very best and taking a screenshot to figure out where to find them on disk.

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage.

I agree it is an outlier at the moment, but even without hyperopt the number of runs often does add up quickly. To my knowledge, hyperopt is also considered best practice so I may end up being an early adopter of a growing population. I also know of others in my university that moved away from tensorboard due to the same limitations.

Also, when you say that you have a single time step per run, does that mean you're really trying to create something more like a scatter plot? We could open a feature request for that, and having a scatter plot might be a quicker route to good performance than reworking the scalar dashboard to handle this case well.

There is no major problem with the plot as it is now, I found the relative and wall plots worked well enough.

@nfelt
Copy link
Collaborator

nfelt commented Mar 7, 2018

Okay, thanks for the details and filing that separate feature request! Re: # 2, what do you mean exactly by struggle? If you mean it doesn't render right a screenshot would be great. Re # 3 we have gotten some reports of memory leaks or high memory consumption, e.g. #766, though thus far I believe we haven't really had enough time to thoroughly investigate and drill down on how that's happening.

As you saw we are working on providing better support for hyperparameter optimization experiments, and I agree that part of that should be supporting ~1000 ish runs with much less of a performance hit than we have today.

@ahundt
Copy link
Author

ahundt commented Mar 11, 2018

Re: # 2, what do you mean exactly by struggle?

It prints lots of stack traces about uncaught exceptions and missing data. Sorry I don't have one on hand at the moment, I'll try to add it here when I come across it.

@bileschi bileschi added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Jan 2, 2020
@phemmer
Copy link

phemmer commented Jan 25, 2020

Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage.

I don't know that this is much of an outlier any more. Especially now that the hparams plugin is part natively integrated.

I myself am running into difficulties with large numbers of runs (a few hundred in my case) performing hyperparameter searches. My experience is that the hparams plugin seems to just freak out and stops showing anything after 150 runs or so. And the scalars screen seems to break after around 200 runs. While tensorboard does suck down a ton of memory (~25gb), I've got plenty to spare (workstation has 64gb), it seems the main issue is that it starts pegging the CPU at 100% (just one core). Smells like it can't keep up with the incoming logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature
Projects
None yet
Development

No branches or pull requests

4 participants