Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

voegtlel · 2018-02-26T11:29:31Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 4.4.0-104-generic
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.4.0-rc1-11-g130a514 1.4.0
Python version: 3.6.4
CUDA/cuDNN version: 8.0.61
GPU model and memory: GeForce GTX TITAN X

Describe the problem

When using tf.estimator.train_and_evaluate(...) with an tf.estimator.Estimator configurated with tf.contrib.learn.RunConfig(save_checkpoints_steps=10, ...), a CheckpointSaverHook will be created automatically. This CheckpointSaverHook will save the graph and graph_def to the summary writer every time it is triggered (see CheckpointSaverHook.before_run).

Basic code example:

estimator = tf.estimator.Estimator(
    model_fn, model_dir, params,
    config=tf.estimator.RunConfig(
        save_checkpoints_steps=100, 
        save_summary_steps=100
    )
)
train_spec = tf.estimator.TrainSpec(train_fn)
eval_spec = tf.estimator.TrainSpec(eval_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

When starting Tensorboard on the written summary, it will output a hundreds of warnings because of multiple graph defs in the summary which I guess slows it down a lot on startup:

W0117 18:47:30.278879 Reloader tf_logging.py:86] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W0117 18:47:30.279753 Reloader tf_logging.py:86] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

I see there might be issues when using multiple graphs, but for a single graph this seems unpracticable.

Related stack overflow discussion: https://stackoverflow.com/questions/48316888

The text was updated successfully, but these errors were encountered:

angerson · 2018-02-26T19:51:29Z

I don't think there's enough information here to identify a TF bug. Consider asking another question on Stack Overflow with the tensorboard tag.

ricvo · 2018-10-11T15:41:37Z

I am having the same issue, is it possible some workaround or to understand what this is due to?
tf version == 1.11.0

I tried to use all three possible MonitoredSessions (that I know of..) and the result is always the same, that is why I think it is a problem of the CheckpointSaverHook
It seems like more processes are writing on the event for saving the graph maybe.
I am not sure how many processes are launched under the hood, but only the chief should write the tf.summaries I think, isn't this the default behaviour or what is the suspected issue here?

For all the 3 following cases the model is saved, and together with is the graph in a tb event I guess...
when I read the respective folder with tensorboard I obtain:

W1011 18:27:17.685001 Reloader tf_logging.py:120] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W1011 18:27:17.685001 140617855026944 tf_logging.py:120] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W1011 18:27:17.920676 Reloader tf_logging.py:120] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
W1011 18:27:17.920676 140617855026944 tf_logging.py:120] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

A simple code presenting the issue:

hooks = [
    tf.train.StopAtStepHook(last_step=300)
]

sessioncode = args.session
SAVE_STEPS = 50

if sessioncode == 'mts':
    print("I am using a MonitoredTrainingSession!!")
    basedir = "tempMonitoredTrainingSession"
    checkpoint_dir = basedir
 
    sess = tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                            save_checkpoint_steps = SAVE_STEPS,
                                            hooks=hooks,
                                            config=sess_config)

elif sessioncode == 'ms':
    print("I am using a MonitoredSession!!")
    basedir = "tempMonitoredSession"
    checkpoint_dir = basedir
    hooks.append(tf.train.CheckpointSaverHook(checkpoint_dir, save_steps = SAVE_STEPS))
    
    chiefsess_creator = tf.train.ChiefSessionCreator(config=sess_config, checkpoint_dir=checkpoint_dir)
    sess = tf.train.MonitoredSession(session_creator=chiefsess_creator, hooks=hooks)

elif sessioncode == 'sms':
    print("I am using a SingularMonitoredSession!!")
    basedir = "tempSingularMonitoredSession"
    checkpoint_dir = basedir
    hooks.append(tf.train.CheckpointSaverHook(checkpoint_dir, save_steps = SAVE_STEPS))
    
    sess = tf.train.SingularMonitoredSession(checkpoint_dir=checkpoint_dir,
                                            hooks=hooks,
                                            config=sess_config)

else:
    raise ValueError("the session code passed is not contemplated!")


with sess:
    while not sess.should_stop():
        _, g_step = sess.run([training_op, global_step])
        print("global step: ", g_step)

shawnthu · 2019-01-11T09:05:59Z

I have the same issue when using tf.estimator, running the simplest iris code.

angerson closed this as completed Feb 26, 2018

simonsays1980 mentioned this issue Jan 4, 2019

Tensorboard cannot load more than two event file in logdir #9512

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

voegtlel commented Feb 26, 2018

angerson commented Feb 26, 2018

ricvo commented Oct 11, 2018 •

edited

shawnthu commented Jan 11, 2019

Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

Comments

voegtlel commented Feb 26, 2018

System information

Describe the problem

Basic code example:

angerson commented Feb 26, 2018

ricvo commented Oct 11, 2018 • edited

shawnthu commented Jan 11, 2019

ricvo commented Oct 11, 2018 •

edited