SummaryWriter deletes data automatically when there are too many #46907

ChenDRAG · 2020-10-27T11:07:01Z

❓ Questions and Help

I use SummaryWriter torch.utils.tensorboard to record data every epoch (1k global timesteps). At first SummaryWriter worked out fine, but I found out that after around 300 epochs(event file about 1M), the Summary starts to lose data. (Like when you see tensorboard visualizer at around 30 epochs, all data are kept and it goes like 1,2,3,4,5,6,7... and at around 300 epochs and you look at tensorboard again, the data becomes 1,5,9,12,13,15...)

I wonder if this is normal? and any idea what might have caused this?

I ues Python 3.8.3, tensorboard '2.3.0', torch '1.4.0',
I use from torch.utils.tensorboard import SummaryWriter in my code,
and use tensorboard --logdir=log in server's bash.
I use ssh tunnel tosee results on my pc.

The text was updated successfully, but these errors were encountered:

mrshenli · 2020-10-27T15:21:09Z

cc @orionr

orionr · 2020-10-27T15:30:23Z

I'm surprised by that, but there might be some bad buffering going on. cc @cryptopic, @nataliakliushkina, @edward-io to help investigate.

annisat · 2022-04-05T06:11:52Z

I encountered the same (similar?) problem with only 20 epochs but with epoch-wise image records. I simplfied my original code and made a minimal one that reproduces the error. Please see if they are related or I'm making a mistake somewhere.

from torch.utils.tensorboard import SummaryWriter
import torch

writer = SummaryWriter('test')
for epoch in range(20):
    data = torch.rand((1, 64, 64))
    writer.add_image("test", data, epoch)

After this, open tensorboard --logdir test. It ends up showing only epoch=1,2,3,4,8,9,10,11,13,19
The missing records are only missing when reaching higher epochs. For example, the image of epoch=5 is there when the training is around epoch 6, 7, or 8. But data to some epoch started to disappear when reaching, say epoch=12 or 13.

Part of my pip freeze for you reference:

tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
torch==1.11.0+cu113

I didn't install full tensorflow. I did install it to see if that's the source of the problem but no. That doesn't help.

orionr · 2022-04-05T14:01:49Z

@annisat did you flush your logs? You can test this by adding writer.flush() or writer.close() to the end of your test code. You can also use it as a context manager I believe, which will flush and close automatically. @Reubend FYI to you and the team as well.

annisat · 2022-04-05T16:22:53Z

I think I solved this issue (at least mine). It is not about torch.SummaryWriter not writing records. It's about tensorboard not displaying it.

Based on three observations, I suspect it is about displaying: 1) the images were there until certain epoch. Like, images of epoch 13 disappeared after those of 15 were recorded. (not the exact number, just as example of my observation). And 2) the size of log file (using writer.flush() with every record in one file) is proporational to the number of epochs, 3) When using context manager, log file for each epoch is there in the folder.

Then I found this flag in tensorboard:

--samples_per_plugin SAMPLES_PER_PLUGIN
        An optional comma separated list of plugin_name=num_samples pairs to explicitly specify how
        many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard randomly
        downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
        running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
        samples of that type. For instance "scalars=500,images=0" keeps 500 scalars and all images.
        Most users should not need to set this flag.

The document itself is not accurate. Setting images=0 will give you literally 0 images. I used tensorboard --logdir test --samples_per_plugin "images=50" and now it shows all my images.

ChenDRAG changed the title ~~SummaryWriter delete data automatically when there are too many~~ SummaryWriter deletes data automatically when there are too many Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SummaryWriter deletes data automatically when there are too many #46907

SummaryWriter deletes data automatically when there are too many #46907

ChenDRAG commented Oct 27, 2020

mrshenli commented Oct 27, 2020

orionr commented Oct 27, 2020

annisat commented Apr 5, 2022

orionr commented Apr 5, 2022 •

edited

annisat commented Apr 5, 2022 •

edited

SummaryWriter deletes data automatically when there are too many #46907

SummaryWriter deletes data automatically when there are too many #46907

Comments

ChenDRAG commented Oct 27, 2020

❓ Questions and Help

mrshenli commented Oct 27, 2020

orionr commented Oct 27, 2020

annisat commented Apr 5, 2022

orionr commented Apr 5, 2022 • edited

annisat commented Apr 5, 2022 • edited

orionr commented Apr 5, 2022 •

edited

annisat commented Apr 5, 2022 •

edited