Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard cannot load more than two event file in logdir #9512

Closed
Misairu-G opened this issue Apr 28, 2017 · 27 comments
Closed

Tensorboard cannot load more than two event file in logdir #9512

Misairu-G opened this issue Apr 28, 2017 · 27 comments
Labels
comp:tensorboard Tensorboard related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@Misairu-G
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes, Custom network structure and data pre-processing for my own task and dataset, modified based on current single GPU CIFAR-10 tutorial (which use the monitored session).

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10 Pro 1703

  • TensorFlow installed from (source or binary):
    Binary, install locally by using pip install .\xxx.whl in my miniconda environment, for the environment, pip freeze give the following information

appdirs==1.4.3
bleach==1.5.0
cycler==0.10.0
html5lib==0.9999999
Markdown==2.2.0
matplotlib==2.0.0
numpy==1.12.1
olefile==0.44
packaging==16.8
Pillow==4.1.0
protobuf==3.2.0
pyparsing==2.2.0
python-dateutil==2.6.0
pytz==2017.2
six==1.10.0
tensorflow-gpu==1.1.0rc2
Werkzeug==0.12.1
  • TensorFlow version (use command below):
    Nightly build #149 (GPU Version), 1.1.0-rc2

  • CUDA/cuDNN version:
    CUDA 8.0, cuDNN 5.1

  • GPU model and memory:
    Quadro M1200, 4GB, WDDM mode

Describe the problem

When restart the training (due to some hyper parameter adjustment) the third time, Tensorboard cannot load the new event file. It can only load the first two event file and after that scalar will stop refreshing.

Powershell console gave the following output:

[tensor] PS D:\Workspace\ConsorFlow> tensorboard.exe --logdir '../input_data/lpr_train_exp_01'
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
Starting TensorBoard b'52' at http://DESKTOP-P7T44AT:6006
(Press CTRL+C to quit)
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
ERROR:tensorflow:Unable to get size of D:\Workspace\input_data\lpr_train_exp_01\events.out.tfevents.1493274079.DESKTOP-P7T44AT: D:\Workspace\input_data\lpr_train_exp_01\events.out.tfevents.1493274079.DESKTOP-P7T44AT
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Detected out of order event.step likely caused by a TensorFlow restart. Purging expired events from Tensorboard display between the previous step: -1 (timestamp: -1) and current step: 17454 (timestamp: 1493310366.6493406). Removing 174 scalars, 76 histograms, 76 compressed histograms, 451 images, and 0 audio.

The 'current step' 17454 in the output is the first step in my second restart.

Information about event files:
1st: events.out.tfevents.1493274079
2nd: events.out.tfevents.1493310339
3rd: events.out.tfevents.1493352650

About this problem in Ubuntu:
I just switch to windows several days ago, such problem did not exist in Ubuntu (at least 14.04). I was using the exact same script, but with tensorflow version 1.01 (GPU, not nightly version), install following the offical instruction.

Under windows, it was because of #7500, which leave me no choice but to install a nightly build.

@girving
Copy link
Contributor

girving commented Apr 28, 2017

@Cooper-Yang: Can you clarify? The last sentence makes it sound like the problem is fixed in nightly builds.

@girving girving added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 28, 2017
@Misairu-G
Copy link
Author

@girving Sure. I choose nightly build because I encountered #7500 (OpKernel "bla bla bla" for unknown op: bla bla bla) while using tensorflow, and it was suggested to use a nightly build version for solving it. So I did.

And then, I met this problem while using Tensorboard.

Update: After some experiment, this problem exist in every version I tried, which include the current release 1.1.0, and nightly build 1.1.0-rc1, 1.1.0-rc2.

@girving
Copy link
Contributor

girving commented Apr 28, 2017

@dandelionmane Any ideas about this TensorBoard+Windows problem?

@girving girving added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Apr 28, 2017
@Misairu-G
Copy link
Author

update: fresh install ubuntu 16.04 and tensorflow, turns out that tensorboard in ubuntu can load the same logdir without any problem, console output shows the following:

(tensor) coopery@WorkStation-CY:/media/coopery/X1/Workspace/input_data$ tensorboard --host 127.0.0.1 --logdir ./lpr_train_exp_01
Starting TensorBoard b'47' at http://127.0.0.1:6006
(Press CTRL+C to quit)
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:path ../external/data/plugin/text/runs not found, sending 404
WARNING:tensorflow:path ../external/data/plugin/text/runs not found, sending 404
WARNING:tensorflow:path ../external/data/plugin/text/runs not found, sending 404
WARNING:tensorflow:path ../external/data/plugin/text/runs not found, sending 404
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Detected out of order event.step likely caused by a TensorFlow restart. Purging expired events from Tensorboard display between the previous step: -1 (timestamp: -1) and current step: 17454 (timestamp: 1493310366.6493406). Removing 174 scalars, 76 histograms, 76 compressed histograms, 451 images, and 0 audio.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
WARNING:tensorflow:Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
WARNING:tensorflow:Detected out of order event.step likely caused by a TensorFlow restart. Purging expired events from Tensorboard display between the previous step: -1 (timestamp: -1) and current step: 64632 (timestamp: 1493381130.1441092). Removing 116 scalars, 76 histograms, 0 compressed histograms, 451 images, and 0 audio.

@teamdandelion teamdandelion added the comp:tensorboard Tensorboard related issues label Jun 16, 2017
@teamdandelion
Copy link
Contributor

This is a known issue, TensorBoard doesn't like it when you write multiple event files from separate runs in the same directory. It will be fixed if you use a new subdirectory for every run (new hyperparameters = new subdirectory).

@mohamed-ezz
Copy link

mohamed-ezz commented Dec 28, 2017

@dandelionmane any plans to fix this ?

With this issue, one must maintain a single writer per run, which isn't possible when, for example, using Keras's TensorBoard callback along with other custom image/audio tensorboard callbacks. The other option is to have each writer in a separate directory but then Tensorboard will show them as separate runs.

@gar1t
Copy link

gar1t commented Mar 10, 2018

This issues comes up with estimators as well, which write to an eval subdirectory of model_dir.

I think this should be reopened :)

@tpet
Copy link

tpet commented Mar 11, 2018

If possible, it would also be good to know where this happened (run directory etc.), exactly what data triggered this warning.

@gar1t
Copy link

gar1t commented Mar 11, 2018

This sample project below can be used to reproduce the warnings. It's an implementation of the model in Getting Started with TensorFlow and uses tf.estimator.DNNClassifier.

https://github.com/guildai/examples/tree/master/iris

Steps:

git clone https://github.com/guildai/examples.git /tmp/tb-issue
cd /tmp/tb-issue/iris
python train.py
tensorboard --logdir model

@JRMeyer
Copy link
Contributor

JRMeyer commented May 22, 2018

I'm also getting this issue with tf.estimator.DNNClassifier...

any news?

@boulderZ
Copy link

I see same problem, why was this closed?

@meproyousuck
Copy link

I have the same Problem with tf.estimator.DNNRegressor

@pranav-vempati
Copy link

I'm encountering the same problem when employing the TensorBoard callback for training a Keras model. Is there a workaround for this issue that doesn't involve creating a separate subdirectory for logging event files generated by each run?

mdangschat added a commit to mdangschat/ctc-asr that referenced this issue Oct 29, 2018
@mjasonong
Copy link

This is a known issue, TensorBoard doesn't like it when you write multiple event files from separate runs in the same directory. It will be fixed if you use a new subdirectory for every run (new hyperparameters = new subdirectory).

Hi, can someone please elaborate on how to this? thanks

@simonsays1980
Copy link

I am also having this problem when using the tf.estimator.Estimator with the tf.estimator.RunConfig saving checkpoints. There had been an issue that was already closed without a solution: #17272

oscmansan added a commit to nokutu/m3-project that referenced this issue Jan 22, 2019
@datianshi21
Copy link

Same question

@ywdong
Copy link

ywdong commented Jun 13, 2019

When using tf.estimator.train_and_evaluate(...) with an tf.estimator.Estimator, I have the same problem, any ideas?

@xxllp
Copy link

xxllp commented Jul 2, 2019

two bad ~~

@datianshi21
Copy link

When using tf.estimator.train_and_evaluate(...) with an tf.estimator.Estimator, I have the same problem, any ideas?

Two possible reasons triggers that:
Reason1: you haven't clean the model repo but you train a new model.
Solution: just clean the model repo before you train another model.
Reason2: you are using models like WDL that made up with more than one graphs
Solution: In this situation, you cannot use tensorboard graph visualization

@datianshi21
Copy link

two bad ~~

Two possible reasons triggers that:
Reason1: you haven't clean the model repo but you train a new model.
**Solution: **just clean the model repo before you train another model.
Reason2: you are using models like WDL that made up with more than one graphs
**Solution: **In this situation, you cannot use tensorboard graph visualization

@zyuanbing
Copy link

by using torch vision 1.3.0.dev20190924
tensorboard works now.

@mauricioarmani
Copy link

I had the same problem using tensorboardX for pytorch. I notice that the code was writing two logs when the code starts. I was instantiating the summary writer in a main.py and writing from a train.py. Instantiating the writer in the trian.py solved.

@Danfoa
Copy link

Danfoa commented Feb 12, 2020

As mentioned by @mohamed-ezz:

With this issue, one must maintain a single writer per run, which isn't possible when, for example, using Keras's TensorBoard callback along with other custom image/audio tensorboard callbacks. The other option is to have each writer in a separate directory but then Tensorboard will show them as separate runs.

In order to properly log runs with custom callbacks this needs to be addressed. I believe this should be reopen or a new issue needs to be created

@andics
Copy link

andics commented Aug 11, 2020

Same issue... I'm very surprised that such a trivial problem has not been addressed yet.
It makes most sense to me to use the same directory for the same experiment. I'm thinking about a script that combines the tensorboard files from one directory into two single ones so that tensorboard doesn't have a problem.

@andics
Copy link

andics commented Aug 11, 2020

Edit:
I found a solution. When starting your tensorboard server add the following flag:
--purge_orphaned_data false

Hope that helps!

@TomMeowMeow
Copy link

TomMeowMeow commented Feb 8, 2021

For anyone looking for a naive solution. After every run, just remove the logfile folder or log file itself. Something like "rm -rf tf_logs", so each time the tf_logs would be created like it is the first time, and then remove it after review. Since we have to call the tensorboard from teminal anyway, I think for now this would hold.
sample code
tensorboard --logdir tf_logs/
rm -rf tf_logs

@ecorreig
Copy link

ecorreig commented Sep 1, 2023

If someone as [insert deprecating word here] as me is here, this might be caused by adding

if epoch == 0:
     tensorboard_writer.add_graph(model, data[0])

to the wrong loop; probably in a batch loop inside the epochs loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:tensorboard Tensorboard related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests