Document that tf.train.Supervisor is deprecated #6263

taochenshh · 2016-12-12T08:06:22Z

I am using CUDA 8.0, cuDNN 5.1, ubuntu 16.04, GPU: TitanX, tensorflow r0.12.
And I met some problems when using tf.train.Supervisor in distributed training. I have simplified my code shown as belown:

from __future__ import print_function
import tensorflow as tf

server = tf.train.Server.create_local_server()
logs_path = "mnist/logs"


global_step = tf.get_variable('global_step', [],
                              initializer=tf.constant_initializer(0),
                              trainable=False)
with tf.name_scope("weights"):
    W1 = tf.Variable(tf.random_normal([784, 100]))
    W2 = tf.Variable(tf.random_normal([784, 100]))

init_op = tf.global_variables_initializer()
print("Variables initialized ...")
sv = tf.train.Supervisor(is_chief=True,
                         logdir=logs_path,
                         global_step=global_step,
                         init_op=init_op,
                         save_model_secs=600)
with sv.managed_session(server.target) as sess:
    while not sv.should_stop():
        print('==============')
sv.stop()

The problem is that if I set logdir explicitly in tf.train.Supervisor, then the code above will met error like this:NotFoundError (see above for traceback): Key weights/Variable not found in checkpoint. But if I comment the lines about defining W1 and W2, then the code could work. So I assume there might be come issues in saving and restoring the checkpoint files in tf.train.Supervisor or maybe I did not use tf.train.Supervisor correctly.

The text was updated successfully, but these errors were encountered:

shiyemin · 2016-12-12T08:55:41Z

tf.Variable will add variable to GLOBAL_VARIABLES, and won't add it to MODEL_VARIABLES. It seem that Superivisor only save MODEL_VARIABLES. I fix this by call tf.contrib.framework.add_model_variable for all varibles defined by tf.Variable.

taochenshh · 2016-12-12T09:34:03Z

@shiyemin, Do you mean add following lines:

from __future__ import print_function
import tensorflow as tf

server = tf.train.Server.create_local_server()
logs_path = "mnist/logs"


global_step = tf.get_variable('global_step', [],
                              initializer=tf.constant_initializer(0),
                              trainable=False)
with tf.name_scope("weights"):
    W1 = tf.Variable(tf.random_normal([784, 100]))
    W2 = tf.Variable(tf.random_normal([784, 100]))

for variable in tf.global_variables():
    tf.contrib.framework.add_model_variable(variable)
init_op = tf.global_variables_initializer()
print("Variables initialized ...")
sv = tf.train.Supervisor(is_chief=True,
                         logdir=logs_path,
                         global_step=global_step,
                         init_op=init_op,
                         save_model_secs=600)
with sv.managed_session(server.target) as sess:
    while not sv.should_stop():
        print('==============')
sv.stop()

I tried this, still no success.

hannes-brt · 2017-01-14T21:24:15Z

I have the same issue - using tf.contrib.framework.add_model_variable(variable) doesn't help.

kchen92 · 2017-01-30T22:48:17Z

I encountered a similar error using Tensorflow r0.12 and r1.0 as well. For me, the code breaks when I start a new session using with sv.managed_session(...) as sess. I resolved it by clearing my logs/events in my log directory...

hannes-brt · 2017-01-31T19:48:14Z

I actually store the graph before I initialize a session. I tried with and without a session present and it doesn't make a difference.

…

--Hannes

On Mon, Jan 30, 2017 at 5:49 PM, Kevin Chen ***@***.***> wrote: I'm encountering a similar error using Tensorflow r0.12 and r1.0 as well. For me, the code breaks when I start a new session using with sv.managed_session(...) as sess. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6263 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATuQ5QH1sYoetzYwryaZaKdKk_UzS10ks5rXmkFgaJpZM4LKRtn> .

girving · 2017-06-16T16:28:17Z

tf.train.Supervisor is deprecated, please use tf.train.MonitoredSession instead. Assigning to @martinwicke since we should mark it deprecated in the docs.

dsalaj · 2017-09-13T10:21:00Z

For anyone looking for more information about deprecation of ´tf.trainSupervisor´ and upcoming update to guide, there are some newer comments here: #6604

TuranTimur · 2017-10-31T12:12:07Z

Can anyone update this example with tf.train.MonitoredSession?

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py

TuranTimur · 2017-11-02T09:41:09Z

Can anyone update document below with a phrase which clearly indicates that Supervisor is deprecated? Please make Tensorflow community united with clear vision to its API usage.

https://www.tensorflow.org/api_docs/python/tf/train/Supervisor

Fixes tensorflow#6263. PiperOrigin-RevId: 177230053

andydavis1 assigned sherrym Dec 13, 2016

andydavis1 added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 13, 2016

yolanda93 mentioned this issue Mar 29, 2017

Not found error : Key tower/fully_connected/weights not found in checkpoint #8798

Closed

girving assigned martinwicke and unassigned sherrym Jun 16, 2017

girving changed the title ~~Key weights/Variable not found in checkpoint Error when using tf.train.Supervisor~~ Document that tf.train.Supervisor is deprecated Jun 16, 2017

qtdaniel mentioned this issue Nov 2, 2017

Clarifications/suggestions for models/tutorials/rnn/ptb tensorflow/models#2505

Closed

TuranTimur mentioned this issue Nov 2, 2017

examples with deprecated API tensorflow/models#2681

Closed

AlexKuhnle mentioned this issue Nov 3, 2017

[DNM] Distributed fixes tensorforce/tensorforce#148

Closed

sb2nov pushed a commit to sb2nov/tensorflow that referenced this issue Dec 1, 2017

Mark Supervisor deprecated. Please use MonitoredTrainingSession instead.

6295875

Fixes tensorflow#6263. PiperOrigin-RevId: 177230053

sb2nov closed this as completed in b8969d1 Dec 2, 2017

chahna107 mentioned this issue Jan 2, 2018

Feature Request: tf.train.MonitoredTrainingSession implementation for slim.learning.train #15793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document that tf.train.Supervisor is deprecated #6263

Document that tf.train.Supervisor is deprecated #6263

taochenshh commented Dec 12, 2016 •

edited

shiyemin commented Dec 12, 2016

taochenshh commented Dec 12, 2016

hannes-brt commented Jan 14, 2017

kchen92 commented Jan 30, 2017 •

edited

hannes-brt commented Jan 31, 2017 via email

girving commented Jun 16, 2017

dsalaj commented Sep 13, 2017

TuranTimur commented Oct 31, 2017

TuranTimur commented Nov 2, 2017

Document that tf.train.Supervisor is deprecated #6263

Document that tf.train.Supervisor is deprecated #6263

Comments

taochenshh commented Dec 12, 2016 • edited

shiyemin commented Dec 12, 2016

taochenshh commented Dec 12, 2016

hannes-brt commented Jan 14, 2017

kchen92 commented Jan 30, 2017 • edited

hannes-brt commented Jan 31, 2017 via email

girving commented Jun 16, 2017

dsalaj commented Sep 13, 2017

TuranTimur commented Oct 31, 2017

TuranTimur commented Nov 2, 2017

taochenshh commented Dec 12, 2016 •

edited

kchen92 commented Jan 30, 2017 •

edited