-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow hangs without explanation, some threads stay busy. #2788
Comments
You'll need to provide some details about what you are doing. The most common hanging issue is that you've created an empty queue and are waiting to dequeue elements from it without any enqueues. |
I do have a queue for preloading data, though the model is hanging after everything is executed, meaning that it already successfully executed the dequeue node. I'll try it without preloading, to remove any queue ops from the graph and see what happens. |
You could also create all your sessions with On Fri, Jun 10, 2016 at 2:21 PM, alexatknit notifications@github.com
|
I do have to clarify that I don't think this has anything to do with queues as I have this code running on 6 other machines without issue. |
Yep, its hanging again without queues |
I'm running with the timeout to see if the error message has any info |
The DeadlineExceededError doesn't seem to provide any information about which nodes in the graph are executing at the time of the error. |
Is this a CUDA-related issue, or is it possible to reproduce with just CPU? There could be a bug here, but I think we'd need a somewhat minimized test case in order to reproduce on our end. |
I am seeing this when deploying it to only the cpu. Let me see if I can't narrow down a test case. |
It might have something to do with my system, I'm seeing the same behavior with cifar10_train.py. |
I'm seeing similar behavior, and it also appears to be installation or system dependent. cifar10_train.py hangs after some random number of iterations. Same with my own models. This is on a system with 4 x Tesla K80s. |
@mrry Suggestions? Seems like this one is going to be hard to debug. Assigning to you for now. |
(Reassigning to Benoit as my best guess is that it's an Eigen threadpool issue.) It might be a red herring, but the "busy" thread has a stack that includes the new Eigen threadpool, and perhaps it could be livelocking? @dvyukov, @rmlarsen, or @benoitsteiner would be best placed to comment on whether this is possible. Just for clarification, when you say "virtual cores" do you mean hyperthreads, or are you running in a virtualized environment (and if so, which one)? One possible workaround would be to build with the old Eigen threadpool. I believe all you need to do is add #define EIGEN_USE_SIMPLE_THREAD_POOL ...wherever the Eigen threadpool is used. At the very least this would include this line in |
Please run |
@mrry The i7-5930K has 6 cores but shows 12 in htop so I guess the proper name for that is hyperthreads? I'm going to work on getting thread apply all in gdb |
|
The cifar model uses local response normalization operation. Since this op doesn't have a GPU implementation it is being run on the CPU and therefore a fairly high average CPU utilization is expected. The thing that I can't explain in your stack trace is that we have 10 calls to CopyTensor::ViaDMA running at the same time. Are you creating multiple towers ? |
I'm just running $ python3.4 cifar10_train.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.4 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.228
pciBusID 0000:03:00.0
Total memory: 6.00GiB
Free memory: 5.66GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:03:00.0)
2016-06-15 11:33:33.201096: step 0, loss = 4.68 (11.4 examples/sec; 11.201 sec/batch)
2016-06-15 11:33:34.734543: step 10, loss = 4.65 (1402.6 examples/sec; 0.091 sec/batch)
2016-06-15 11:33:35.732699: step 20, loss = 4.64 (1123.6 examples/sec; 0.114 sec/batch)
2016-06-15 11:33:36.671899: step 30, loss = 4.62 (1432.7 examples/sec; 0.089 sec/batch)
2016-06-15 11:33:37.726346: step 40, loss = 4.60 (1641.9 examples/sec; 0.078 sec/batch) it runs though about 40 steps, give or take 20, then locks up on a single iteration. |
Sooo much mutexes, my eyes are bleeding.
|
@dvyukov Of course this is a traceback during the hang. The traceback is taken from the execution from my last comment with $ python3.4 cifar10_train.py running to just past step 40. I don't remember how many threads were busy in this case. It hangs even if nothing is deployed to the gpu. |
@mrry I get build errors if I use your #define EIGEN_USE_SIMPLE_THREAD_POOL workaround. |
The only non-blocked thread in the trace is the one executing cuda ioctl. The rest are either blocked on GPU mutex (waiting for the first thread), or thread pool threads out of work. |
Most of the CPU to GPU and GPU to CPU copies come from the fact that the exponential moving average variables are all placed on CPU, which causes a lot of small data transfers. Can you try to move it to GPU and see if that helps reduce the likelihood of the GPU hanging ? It will also help the model train faster. |
I commented out the code that deploys the variables to the cpu and these are the new stacks (most are duplicates):
|
I'm going to try out the new drivers: 367.27 |
didn't work |
I'm going to install the cpu only build |
Still happens on cpu only install, has nothing to do with cuda. |
|
the only thread that isn't waiting on a mutex or calling syscall-template.S is thread 41:
debug build:
|
It seems that all worker threads except one are in some kind of livelock, while one thread is working. And that livelocked threads additionally slowdown the working threads (due to contention in kernel or memory bus contention). @alexatknit, you said that "Pausing and continuing the process using gdb will make it continue immediately". Is it true most of the time? You are running on a physical machine with i7-5930K processor. No VMs involved. Right? What OS do you use? You said that the program hangs, but also that it completes successfully over night. Does it just become slower? By how much? Is it like 10x? 100x? In all stacks I see TensorContractionOp. I wonder if it submits super tiny tasks (due to cost over-estimation? @benoitsteiner @rmlarsen). @alexatknit What model do you use? Can you provide a reproducer? Does it use contraction with some unusual rhs/lhs operands? |
Of maybe 2x100000000x2 kind of configuration? |
config: |
@dvyukov are there any updates on this? |
@gunan no updates. still waiting for answers from @alexatknit. |
@alexatknit are you still running into this problem? |
Sorry, I had abandoned the machine, I can try to spin it up again. |
It appears to be powered down at the moment, so I don't have ssh access. I'll try again later |
I just ran v0.10 on the machine and it seems to be working fine now. |
I just noticed |
I've got a machine running a GTX 980 Ti with 32 gb of ram (of which about half is used by the data loading pipeline) and a i7-5930K cpu driving a network. I'm running Ubuntu 14.04, I've got the drivers x86_64-361.45.11 installed, I've installed CUDA 7.5.18_linux with CUDNN linux-x64-v4.0-prod. Every 30 or so iterations the run method hangs and 2 to 5 virtual cores are busy, with the gpu not doing anything more than idling. I thought it hanged indefinitely but when I ran it overnight I saw that it actually completed what it was doing after some time. Pausing and continuing the process using gdb will make it continue immediately. I saw something like this with the initial release of 0.6.0 but it was fixed by 0.6.1 so I just pulled from master and assumed I would never see it again. Inspecting the root process just shows that its waiting on the run method:
Inspection of one of the busy threads is also equally uninteresting:
Finally the cpu usage revealed by htop doesn't seem to be fully accounted for in the displayed processes, 5 virtual cores might be busy but the processes are only credited with 0-300% cpu usage.
The text was updated successfully, but these errors were encountered: