Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when using Queues + tf.train.Server #9136

Closed
jdonier opened this issue Apr 11, 2017 · 3 comments
Closed

Issues when using Queues + tf.train.Server #9136

jdonier opened this issue Apr 11, 2017 · 3 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@jdonier
Copy link

jdonier commented Apr 11, 2017

NOTE: Issues that are not bugs or feature requests will be closed. Please ask usage questions on StackOverflow.

You must complete this information or else your issue will be closed

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow)?: Yes
  • TensorFlow installed from (source or binary)?: binary
  • TensorFlow version: 1.0.0 CPU / 1.0.1 (CPU and GPU enabled) / 1.1.0rc1 (CPU)
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: N/A
  • GPU Model and Memory: N/A
  • Exact command to reproduce: cf below.

This problem has been reproduced on both Linux and various Mac OS machines.

Describe the problem clearly

We seem to experience issues when using both queues + tf.train.Server. When executed in a simple python 3.5.3 console, the following script hangs:

import tensorflow as tf
import time

cluster = tf.train.ClusterSpec({"cpu1" : ['localhost:2222']})
server = tf.train.Server(cluster, job_name="cpu1", task_index=0)

with tf.Graph().as_default() as graph:
    # Queue
    input_queue = tf.train.input_producer(tf.constant([0.], dtype=tf.float32))

    # Useless variable
    variable = tf.Variable(1., dtype=tf.float32, trainable=False, name="variable")

    # Session and queue runners
    session = tf.Session(target=server.target)
    session.run(tf.global_variables_initializer())
    tf.train.start_queue_runners(session)

print(session.run(variable))  # this works
print(session.run(tf.assign(variable, 2)))  # this also works, but only if called directly

# any pause between creating and running the session breaks it
time.sleep(1)

print(session.run(variable))  # retrieving a variable still works, but...
print(session.run(tf.assign(variable, 3)))  # ... assigning a variable will make the program hang.

It outputs:

1
2
2

and then hangs forever. The problem vanishes when either commenting the input_queue=... line, or when writing session = tf.Session() instead of passing the server.target.

The problems seems to happen not only with variable assignments, but also saving the model using tf.train.Saver().save(session, 'my_model') for instance (and possibly other operations). Note that reading a variable works fine.

In the example script, the time.sleepcommand simulates a pause between creating the session and running it to set a variable. The same effect is achieved, for example, when splitting session creation and running code across two Jupyter notebook cells. When executing the whole code in one cell, it works fine.

Source Code / Logs

The source code to reproduce the problem is displayed above. I have attached a traceback using gdb, which shows that the program is hanging while trying to acquire a lock.

tf-issue-gdb-bt.txt
tf-issue-gdb-stack-threads.txt

@asimshankar
Copy link
Contributor

asimshankar commented Apr 14, 2017

Thanks for the detailed report and stacktraces, this helps a lot and is much appreciated.

@mrry pointed out that we might have a bug when graphs are extended in a distributed session while some operations (in this case the enqueue operation) are in progress (See master_session.cc:1038 - that code predates the queue runners).

@suharshs : Would you have the bandwidth to look into that TODO?

@jdonier : In the mean time, a workaround for you would be to ensure that the graph isn't modified after the queue runners are started. For example, your snippet above could be rewritten as:

import tensorflow as tf
import time

cluster = tf.train.ClusterSpec({"cpu1" : ['localhost:2222']})
server = tf.train.Server(cluster, job_name="cpu1", task_index=0)

with tf.Graph().as_default() as graph:
    # Queue
    input_queue = tf.train.input_producer(tf.constant([0.], dtype=tf.float32))

    # Useless variable
    variable = tf.Variable(1., dtype=tf.float32, trainable=False, name="variable")

    # Session and queue runners
    session = tf.Session(target=server.target)
    session.run(tf.global_variables_initializer())

    # CHANGE FROM PREVIOUS SNIPPET: Assign operations
    assign2 = tf.assign(variable, 2)
    assign3 = tf.assign(variable, 3)

    tf.train.start_queue_runners(session)

print(session.run(variable))
print(session.run(assign2))

# Freely sleep
time.sleep(1)

print(session.run(variable))
print(session.run(assign3))

FYI @jhseu @saeta who might like to know about this too.

@asimshankar asimshankar added type:bug Bug stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 14, 2017
@jdonier
Copy link
Author

jdonier commented Apr 14, 2017

@asimshankar the reason why I want to modify the graph after the queue runners are started is to change some parameters during / after training (e.g. the training rate during training, or dropout rates between training and inference) so this needs to be done after the queue runners have been started. I guess I could define them as placeholders but it's a bit weird to have to feed these values for every computation...

About the problem with model saving: I was creating a tf.train.Saver() at saving time, which was causing the problem, consistent with your explanation. It all works fine if I define it when I create the graph -- so thanks a lot!

@saeta
Copy link
Contributor

saeta commented Apr 18, 2017

I have a change coming soon that should fix this. (Thanks @mrry and @asimshankar for flagging!)

@caisq caisq closed this as completed in 1f210ad Apr 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants