New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when using Queues + tf.train.Server #9136
Comments
Thanks for the detailed report and stacktraces, this helps a lot and is much appreciated. @mrry pointed out that we might have a bug when graphs are extended in a distributed session while some operations (in this case the enqueue operation) are in progress (See @suharshs : Would you have the bandwidth to look into that TODO? @jdonier : In the mean time, a workaround for you would be to ensure that the graph isn't modified after the queue runners are started. For example, your snippet above could be rewritten as: import tensorflow as tf
import time
cluster = tf.train.ClusterSpec({"cpu1" : ['localhost:2222']})
server = tf.train.Server(cluster, job_name="cpu1", task_index=0)
with tf.Graph().as_default() as graph:
# Queue
input_queue = tf.train.input_producer(tf.constant([0.], dtype=tf.float32))
# Useless variable
variable = tf.Variable(1., dtype=tf.float32, trainable=False, name="variable")
# Session and queue runners
session = tf.Session(target=server.target)
session.run(tf.global_variables_initializer())
# CHANGE FROM PREVIOUS SNIPPET: Assign operations
assign2 = tf.assign(variable, 2)
assign3 = tf.assign(variable, 3)
tf.train.start_queue_runners(session)
print(session.run(variable))
print(session.run(assign2))
# Freely sleep
time.sleep(1)
print(session.run(variable))
print(session.run(assign3)) |
@asimshankar the reason why I want to modify the graph after the queue runners are started is to change some parameters during / after training (e.g. the training rate during training, or dropout rates between training and inference) so this needs to be done after the queue runners have been started. I guess I could define them as placeholders but it's a bit weird to have to feed these values for every computation... About the problem with model saving: I was creating a |
I have a change coming soon that should fix this. (Thanks @mrry and @asimshankar for flagging!) |
NOTE: Issues that are not bugs or feature requests will be closed. Please ask usage questions on StackOverflow.
You must complete this information or else your issue will be closed
This problem has been reproduced on both Linux and various Mac OS machines.
Describe the problem clearly
We seem to experience issues when using both queues +
tf.train.Server
. When executed in a simple python 3.5.3 console, the following script hangs:It outputs:
and then hangs forever. The problem vanishes when either commenting the
input_queue=...
line, or when writingsession = tf.Session()
instead of passing theserver.target
.The problems seems to happen not only with variable assignments, but also saving the model using
tf.train.Saver().save(session, 'my_model')
for instance (and possibly other operations). Note that reading a variable works fine.In the example script, the
time.sleep
command simulates a pause between creating the session and running it to set a variable. The same effect is achieved, for example, when splitting session creation and running code across two Jupyter notebook cells. When executing the whole code in one cell, it works fine.Source Code / Logs
The source code to reproduce the problem is displayed above. I have attached a traceback using gdb, which shows that the program is hanging while trying to acquire a lock.
tf-issue-gdb-bt.txt
tf-issue-gdb-stack-threads.txt
The text was updated successfully, but these errors were encountered: