Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow freezes during execution of session.run() #7573

Closed
javdrher opened this issue Feb 16, 2017 · 5 comments
Closed

Tensorflow freezes during execution of session.run() #7573

javdrher opened this issue Feb 16, 2017 · 5 comments
Labels
stat:awaiting response Status - Awaiting response from author type:support Support issues

Comments

@javdrher
Copy link

We currently experience an issue with an implementation. During the execution of session.run, the process suddenly freezes. It does not crash but is irresponsive to ctrl+c. It isn't consuming any CPU anymore (and is not progressing either). This occurs on CPU, we haven't tested GPU. The issue seems to be highly related to #2788.

We ran the script on a machine running an up-to-date Ubuntu 16.04 with 128gb of ram, and 2 x Xeon CPU E5-2640 v4 processors. The issue occured with tensorflow 0.12.1 installed through anaconda. Then we reproduced the issue using a build of the master branch, without any CUDA support, using the system libraries rather than those shipped with anaconda.

The build of the master branch shows:
print(tensorflow.version)
1.0.0-rc2

$ git rev-parse HEAD
1a0742f

$ bazel version
Build label: 0.4.4
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Feb 1 18:54:21 2017 (1485975261)
Build timestamp: 1485975261
Build timestamp as int: 1485975261

Attached is the output of running thread apply all bt 10 in gdb. All threads appear to be waiting.
We are running a model locally implemented using GPflow (not one of the default models of that library though), unfortunately this occurs on a confidential data set. If required we can look into providing a MWE but I can not guarantee we can reproduce this easily under different circumstances.
gdb.txt

@aselle
Copy link
Contributor

aselle commented Feb 16, 2017

Without a minimal reproducible test case, we will be unlikely to be able to help you further. However, the most common cause of this is that your run() call is blocking on queues.

See e.g.
#2788

@aselle aselle added stat:awaiting response Status - Awaiting response from author type:support Support issues labels Feb 16, 2017
@yaroslavvb
Copy link
Contributor

Unresponsive to CTRL+C is suspicious. Is it possible to kill it using kill? I've seen unkillable TF training caused by 1) stuck IO (ie, trying to write to NFS when net is down) 2) Nvidia driver (it gets stuck sometimes). Both can be troubleshooted by looking at the stuck call in /proc/<pid>/stack.

@gpfins
Copy link

gpfins commented Feb 16, 2017

@javdrher and I can kill it, but no CTRL+C response, I'll look into your suggestions, thank you.

@prb12
Copy link
Member

prb12 commented Mar 9, 2017

Closing due to lack of activity. Please reopen if necessary.

@prb12 prb12 closed this as completed Mar 9, 2017
@edelmanjm
Copy link

Having this issue as well. I'm guessing it's some sort of issue with my code rather than TF, but it's still unusual. I've been running off of CPU, so Nvidia drivers shouldn't be the issue.

Code is here, specifically steer_eval.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author type:support Support issues
Projects
None yet
Development

No branches or pull requests

6 participants