Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tf.estimator.Estimator with save_checkpoint_steps leads to Tensorboard warnings #17272

Closed
voegtlel opened this issue Feb 26, 2018 · 3 comments

Comments

@voegtlel
Copy link
Contributor

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 4.4.0-104-generic
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.4.0-rc1-11-g130a514 1.4.0
  • Python version: 3.6.4
  • CUDA/cuDNN version: 8.0.61
  • GPU model and memory: GeForce GTX TITAN X

Describe the problem

When using tf.estimator.train_and_evaluate(...) with an tf.estimator.Estimator configurated with tf.contrib.learn.RunConfig(save_checkpoints_steps=10, ...), a CheckpointSaverHook will be created automatically. This CheckpointSaverHook will save the graph and graph_def to the summary writer every time it is triggered (see CheckpointSaverHook.before_run).

Basic code example:

estimator = tf.estimator.Estimator(
    model_fn, model_dir, params,
    config=tf.estimator.RunConfig(
        save_checkpoints_steps=100, 
        save_summary_steps=100
    )
)
train_spec = tf.estimator.TrainSpec(train_fn)
eval_spec = tf.estimator.TrainSpec(eval_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

When starting Tensorboard on the written summary, it will output a hundreds of warnings because of multiple graph defs in the summary which I guess slows it down a lot on startup:

W0117 18:47:30.278879 Reloader tf_logging.py:86] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W0117 18:47:30.279753 Reloader tf_logging.py:86] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

I see there might be issues when using multiple graphs, but for a single graph this seems unpracticable.

Related stack overflow discussion: https://stackoverflow.com/questions/48316888

@angerson
Copy link
Contributor

I don't think there's enough information here to identify a TF bug. Consider asking another question on Stack Overflow with the tensorboard tag.

@ricvo
Copy link

ricvo commented Oct 11, 2018

I am having the same issue, is it possible some workaround or to understand what this is due to?
tf version == 1.11.0

I tried to use all three possible MonitoredSessions (that I know of..) and the result is always the same, that is why I think it is a problem of the CheckpointSaverHook
It seems like more processes are writing on the event for saving the graph maybe.
I am not sure how many processes are launched under the hood, but only the chief should write the tf.summaries I think, isn't this the default behaviour or what is the suspected issue here?

For all the 3 following cases the model is saved, and together with is the graph in a tb event I guess...
when I read the respective folder with tensorboard I obtain:

W1011 18:27:17.685001 Reloader tf_logging.py:120] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W1011 18:27:17.685001 140617855026944 tf_logging.py:120] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W1011 18:27:17.920676 Reloader tf_logging.py:120] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
W1011 18:27:17.920676 140617855026944 tf_logging.py:120] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

A simple code presenting the issue:

hooks = [
    tf.train.StopAtStepHook(last_step=300)
]

sessioncode = args.session
SAVE_STEPS = 50

if sessioncode == 'mts':
    print("I am using a MonitoredTrainingSession!!")
    basedir = "tempMonitoredTrainingSession"
    checkpoint_dir = basedir
 
    sess = tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                            save_checkpoint_steps = SAVE_STEPS,
                                            hooks=hooks,
                                            config=sess_config)

elif sessioncode == 'ms':
    print("I am using a MonitoredSession!!")
    basedir = "tempMonitoredSession"
    checkpoint_dir = basedir
    hooks.append(tf.train.CheckpointSaverHook(checkpoint_dir, save_steps = SAVE_STEPS))
    
    chiefsess_creator = tf.train.ChiefSessionCreator(config=sess_config, checkpoint_dir=checkpoint_dir)
    sess = tf.train.MonitoredSession(session_creator=chiefsess_creator, hooks=hooks)

elif sessioncode == 'sms':
    print("I am using a SingularMonitoredSession!!")
    basedir = "tempSingularMonitoredSession"
    checkpoint_dir = basedir
    hooks.append(tf.train.CheckpointSaverHook(checkpoint_dir, save_steps = SAVE_STEPS))
    
    sess = tf.train.SingularMonitoredSession(checkpoint_dir=checkpoint_dir,
                                            hooks=hooks,
                                            config=sess_config)

else:
    raise ValueError("the session code passed is not contemplated!")


with sess:
    while not sess.should_stop():
        _, g_step = sess.run([training_op, global_step])
        print("global step: ", g_step)

@shawnthu
Copy link

I have the same issue when using tf.estimator, running the simplest iris code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants