Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that tf.train.Supervisor is deprecated #6263

Closed
taochenshh opened this issue Dec 12, 2016 · 9 comments
Closed

Document that tf.train.Supervisor is deprecated #6263

taochenshh opened this issue Dec 12, 2016 · 9 comments
Assignees
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower

Comments

@taochenshh
Copy link

taochenshh commented Dec 12, 2016

I am using CUDA 8.0, cuDNN 5.1, ubuntu 16.04, GPU: TitanX, tensorflow r0.12.
And I met some problems when using tf.train.Supervisor in distributed training. I have simplified my code shown as belown:

from __future__ import print_function
import tensorflow as tf

server = tf.train.Server.create_local_server()
logs_path = "mnist/logs"


global_step = tf.get_variable('global_step', [],
                              initializer=tf.constant_initializer(0),
                              trainable=False)
with tf.name_scope("weights"):
    W1 = tf.Variable(tf.random_normal([784, 100]))
    W2 = tf.Variable(tf.random_normal([784, 100]))

init_op = tf.global_variables_initializer()
print("Variables initialized ...")
sv = tf.train.Supervisor(is_chief=True,
                         logdir=logs_path,
                         global_step=global_step,
                         init_op=init_op,
                         save_model_secs=600)
with sv.managed_session(server.target) as sess:
    while not sv.should_stop():
        print('==============')
sv.stop()

The problem is that if I set logdir explicitly in tf.train.Supervisor, then the code above will met error like this:NotFoundError (see above for traceback): Key weights/Variable not found in checkpoint. But if I comment the lines about defining W1 and W2, then the code could work. So I assume there might be come issues in saving and restoring the checkpoint files in tf.train.Supervisor or maybe I did not use tf.train.Supervisor correctly.

@shiyemin
Copy link

tf.Variable will add variable to GLOBAL_VARIABLES, and won't add it to MODEL_VARIABLES. It seem that Superivisor only save MODEL_VARIABLES. I fix this by call tf.contrib.framework.add_model_variable for all varibles defined by tf.Variable.

@taochenshh
Copy link
Author

@shiyemin, Do you mean add following lines:

from __future__ import print_function
import tensorflow as tf

server = tf.train.Server.create_local_server()
logs_path = "mnist/logs"


global_step = tf.get_variable('global_step', [],
                              initializer=tf.constant_initializer(0),
                              trainable=False)
with tf.name_scope("weights"):
    W1 = tf.Variable(tf.random_normal([784, 100]))
    W2 = tf.Variable(tf.random_normal([784, 100]))

for variable in tf.global_variables():
    tf.contrib.framework.add_model_variable(variable)
init_op = tf.global_variables_initializer()
print("Variables initialized ...")
sv = tf.train.Supervisor(is_chief=True,
                         logdir=logs_path,
                         global_step=global_step,
                         init_op=init_op,
                         save_model_secs=600)
with sv.managed_session(server.target) as sess:
    while not sv.should_stop():
        print('==============')
sv.stop()

I tried this, still no success.

@andydavis1 andydavis1 added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 13, 2016
@hannes-brt
Copy link

I have the same issue - using tf.contrib.framework.add_model_variable(variable) doesn't help.

@kchen92
Copy link
Contributor

kchen92 commented Jan 30, 2017

I encountered a similar error using Tensorflow r0.12 and r1.0 as well. For me, the code breaks when I start a new session using with sv.managed_session(...) as sess. I resolved it by clearing my logs/events in my log directory...

@hannes-brt
Copy link

hannes-brt commented Jan 31, 2017 via email

@girving
Copy link
Contributor

girving commented Jun 16, 2017

tf.train.Supervisor is deprecated, please use tf.train.MonitoredSession instead. Assigning to @martinwicke since we should mark it deprecated in the docs.

@girving girving assigned martinwicke and unassigned sherrym Jun 16, 2017
@girving girving changed the title Key weights/Variable not found in checkpoint Error when using tf.train.Supervisor Document that tf.train.Supervisor is deprecated Jun 16, 2017
@dsalaj
Copy link
Contributor

dsalaj commented Sep 13, 2017

For anyone looking for more information about deprecation of ´tf.trainSupervisor´ and upcoming update to guide, there are some newer comments here: #6604

@TuranTimur
Copy link

Can anyone update this example with tf.train.MonitoredSession?

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py

@TuranTimur
Copy link

Can anyone update document below with a phrase which clearly indicates that Supervisor is deprecated? Please make Tensorflow community united with clear vision to its API usage.

https://www.tensorflow.org/api_docs/python/tf/train/Supervisor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower
Projects
None yet
Development

No branches or pull requests

10 participants