Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SummaryWriter deletes data automatically when there are too many #46907

Open
ChenDRAG opened this issue Oct 27, 2020 · 5 comments
Open

SummaryWriter deletes data automatically when there are too many #46907

ChenDRAG opened this issue Oct 27, 2020 · 5 comments
Labels
module: tensorboard oncall: visualization Related to visualization in PyTorch, e.g., tensorboard

Comments

@ChenDRAG
Copy link

❓ Questions and Help

I use SummaryWriter torch.utils.tensorboard to record data every epoch (1k global timesteps). At first SummaryWriter worked out fine, but I found out that after around 300 epochs(event file about 1M), the Summary starts to lose data. (Like when you see tensorboard visualizer at around 30 epochs, all data are kept and it goes like 1,2,3,4,5,6,7... and at around 300 epochs and you look at tensorboard again, the data becomes 1,5,9,12,13,15...)

I wonder if this is normal? and any idea what might have caused this?

I ues Python 3.8.3, tensorboard '2.3.0', torch '1.4.0',
I use from torch.utils.tensorboard import SummaryWriter in my code,
and use tensorboard --logdir=log in server's bash.
I use ssh tunnel tosee results on my pc.

@ChenDRAG ChenDRAG changed the title SummaryWriter delete data automatically when there are too many SummaryWriter deletes data automatically when there are too many Oct 27, 2020
@mrshenli mrshenli added module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: visualization Related to visualization in PyTorch, e.g., tensorboard and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 27, 2020
@mrshenli
Copy link
Contributor

cc @orionr

@orionr
Copy link
Contributor

orionr commented Oct 27, 2020

I'm surprised by that, but there might be some bad buffering going on. cc @cryptopic, @nataliakliushkina, @edward-io to help investigate.

@annisat
Copy link

annisat commented Apr 5, 2022

I encountered the same (similar?) problem with only 20 epochs but with epoch-wise image records. I simplfied my original code and made a minimal one that reproduces the error. Please see if they are related or I'm making a mistake somewhere.

from torch.utils.tensorboard import SummaryWriter
import torch

writer = SummaryWriter('test')
for epoch in range(20):
    data = torch.rand((1, 64, 64))
    writer.add_image("test", data, epoch)

After this, open tensorboard --logdir test. It ends up showing only epoch=1,2,3,4,8,9,10,11,13,19
The missing records are only missing when reaching higher epochs. For example, the image of epoch=5 is there when the training is around epoch 6, 7, or 8. But data to some epoch started to disappear when reaching, say epoch=12 or 13.

Part of my pip freeze for you reference:

tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
torch==1.11.0+cu113

I didn't install full tensorflow. I did install it to see if that's the source of the problem but no. That doesn't help.

@orionr
Copy link
Contributor

orionr commented Apr 5, 2022

@annisat did you flush your logs? You can test this by adding writer.flush() or writer.close() to the end of your test code. You can also use it as a context manager I believe, which will flush and close automatically. @Reubend FYI to you and the team as well.

@annisat
Copy link

annisat commented Apr 5, 2022

I think I solved this issue (at least mine). It is not about torch.SummaryWriter not writing records. It's about tensorboard not displaying it.

Based on three observations, I suspect it is about displaying: 1) the images were there until certain epoch. Like, images of epoch 13 disappeared after those of 15 were recorded. (not the exact number, just as example of my observation). And 2) the size of log file (using writer.flush() with every record in one file) is proporational to the number of epochs, 3) When using context manager, log file for each epoch is there in the folder.

Then I found this flag in tensorboard:

--samples_per_plugin SAMPLES_PER_PLUGIN
        An optional comma separated list of plugin_name=num_samples pairs to explicitly specify how
        many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard randomly
        downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
        running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
        samples of that type. For instance "scalars=500,images=0" keeps 500 scalars and all images.
        Most users should not need to set this flag.

The document itself is not accurate. Setting images=0 will give you literally 0 images. I used tensorboard --logdir test --samples_per_plugin "images=50" and now it shows all my images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: tensorboard oncall: visualization Related to visualization in PyTorch, e.g., tensorboard
Projects
None yet
Development

No branches or pull requests

4 participants