Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

Closed
MInner opened this issue Oct 26, 2016 · 18 comments
Closed

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

MInner opened this issue Oct 26, 2016 · 18 comments
Assignees
Labels

Comments

@MInner
Copy link

MInner commented Oct 26, 2016

CentOS 7
Tensorflow 0.10.0
TITAN X (Pascal) 367.44

I restore a model previously saved with tf.train.Saver() and try to compute probabilities of outputs for a given input batch. Whenever I try to execute tf.nn.softmax on GPU, I get an error:

E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS

but same computations work just fine on CPU:

print('run 1: logits')
logits = s.run(model.other.logits_labl, tf_run_args)
print(logits.shape) # (1001, 289)
print(np.max(logits), np.min(logits)) # 1.80996 -0.752239

print('run 2: probs in np')
probs = s.run(tf.nn.softmax(logits))
print(probs.shape) # (1001, 289)
print(np.sum(probs, axis=1)) # [ 1.   ...  1.00000012  1.          1.00000012]

print('run 3: probs on cpu')
with tf.device('/cpu:0'):
    t = tf.nn.softmax(model.other.logits_labl)
print(s.run(t, tf_run_args)) # okay

print('run 4: probs on gpu')
with tf.device('/gpu:0'):
    t2 = tf.nn.softmax(model.other.logits_labl)
print(s.run(t2, tf_run_args)) # error

produces

run 1: logits
(1001, 289)
1.80996 -0.752239
run 2: probs in np
(1001, 289)
[ 1.          0.99999994  1.         ...,  1.00000012  1.          1.00000012]
run 3: probs on cpu
[[ 0.00353091  0.0032969   0.00355321 ...,  0.00368173  0.00337926
   0.00326502]
 [ 0.00343715  0.00313538  0.00379693 ...,  0.00426536  0.0032676
   0.0031463 ]
 [ 0.00346572  0.00300998  0.00389543 ...,  0.00458091  0.0031747
   0.0030867 ]
 ...,
 [ 0.0035709   0.00302548  0.00384262 ...,  0.0042819   0.00318443
   0.00299288]
 [ 0.00353101  0.00305104  0.00379521 ...,  0.00428836  0.0031993
   0.0029559 ]
 [ 0.00352879  0.00302152  0.00380528 ...,  0.0043787   0.00318416
   0.00294856]]
run 4: probs on gpu
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
Aborted

gdb does not give any additional details

... same as above
[New Thread 0x7fff41ffb700 (LWP 26404)]
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff5effd700 (LWP 26391)]
0x00007ffff6a315f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install libX11-1.6.3-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXdmcp-1.1.1-6.1.el7.x86_64 libuuid-2.23.2-26.el7_2.3.x86_64

Surprisingly, this error occurred just recently. I have been working with this same codebase for months and before now nothing like this happened, but now it is 100% reproducible on my machine with this specific (code, driver version, etc), so it is not something that effects everyone, but rather pops up randomly. I might have changed operation device placement recently.

code (unfortunately, large piece of code that required certain data; failed to produce a minimal reproducing code): [link]

full gdb output (without prints): [link]

environment:

$ ls -l /usr/lib/libcu*
lrwxrwxrwx. 1 root root      12 Oct  3 15:36 /usr/lib/libcuda.so -> libcuda.so.1
lrwxrwxrwx. 1 root root      17 Oct  3 15:36 /usr/lib/libcuda.so.1 -> libcuda.so.367.44
-rwxr-xr-x. 1 root root 7747600 Oct  3 15:36 /usr/lib/libcuda.so.367.44
$ ls -l /usr/local/cuda/lib64/
lrwxrwxrwx. 1 1000 users       13 Jul 27 01:55 libcudnn.so -> libcudnn.so.5
lrwxrwxrwx. 1 1000 users       17 Jul 27 01:55 libcudnn.so.5 -> libcudnn.so.5.1.5
-rwxrwxr-x. 1 1000 users 79337624 Jul 27 01:53 libcudnn.so.5.1.5
-rw-rw-r--. 1 1000 users 69756172 Jul 27 01:53 libcudnn_static.a

Possibly related issues: #2117 #1450 #2810 #665 #1060 #4425

@MInner MInner changed the title tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) Oct 26, 2016
@drpngx drpngx added the bug label Oct 26, 2016
@drpngx
Copy link
Contributor

drpngx commented Oct 26, 2016

Could you help narrowing down the error with cuda-memcheck from nvidia?

To make it more easier to trace, probably setting CUDA_LAUNCH_BLOCK=1 and run with --brain_gpu_sync_every_op.

@MInner
Copy link
Author

MInner commented Oct 26, 2016

@drpngx memcheck with CUDA_LAUNCH_BLOCK=1 . Do I have to re-build tensorflow from source to set --brain_gpu_sync_every_op or there's an easier way?

@drpngx
Copy link
Contributor

drpngx commented Oct 27, 2016

Woops, you're right. It looks like you have to tweak gpu_device_factory.cc, but from the trace it's clear that it's in the eigen ReduceInitKernel, in the dense softmax.

@zheng-xq for any additional insight. We don't have access to the data, but we have access to the source code. If that's an actual bug in tensorflow, it would be great to get to the bottom of this.

@zheng-xq
Copy link
Contributor

To sync after each op, you'll have to build from source by modifying this line.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device_factory.cc#L34

If you still have this problem with the latest driver, then we will need to know which kernel causes this problem. Combining that with memcheck and CUDA_LAUNCH_BLOCK=1 tends to give the best answer.

@drpngx
Copy link
Contributor

drpngx commented Oct 27, 2016

@zheng-xq the memcheck above says it's in ReduceInitKernel from the eigen softmax. I wonder if there are diagnostics we could print (dimensions etc) of buffers vs expected op input

@MInner
Copy link
Author

MInner commented Oct 27, 2016

here's the tf-related part of the code above, if that is be helpful to understand device placement

@drpngx
Copy link
Contributor

drpngx commented Oct 27, 2016

Would it be possible for you to create a repro that we could try locally?

@drpngx
Copy link
Contributor

drpngx commented Oct 28, 2016

If you had a build with debug on, then we might be able to see what the intermediate stack frames are. ReductionInitKernel appears to have been passed nullptr + offset (maybe 300 or so), so it's pulled from a struct which has about 300 bytes worth of stuff before the output, and that struct instance pointer is nullptr, presumably.

@drpngx
Copy link
Contributor

drpngx commented Oct 28, 2016

@benoitsteiner if you're interested.

@drpngx drpngx added the stat:awaiting response Status - Awaiting response from author label Oct 28, 2016
@benoitsteiner
Copy link
Contributor

@MInner Can you try with the release candidate for 0.11 ? There was a bug in 0.10 that could explain your CUDA_ERROR_ILLEGAL_ADDRESS error and has been fixed since.

@benoitsteiner benoitsteiner self-assigned this Oct 28, 2016
@aselle aselle added type:bug Bug and removed bug labels Feb 9, 2017
@2sin18
Copy link
Contributor

2sin18 commented Apr 7, 2017

@benoitsteiner Which commit fixed this bug? I encountered a same error in tensorflow 1.0.0 and try to reproduce it now.

@MInner
Copy link
Author

MInner commented Apr 12, 2017

I also still encounter this problem when use large enough models with google/seq2seq during evaluation on same hardware as mentioned in the topic heading.

@drpngx
Copy link
Contributor

drpngx commented Apr 13, 2017

@MInner to be clear, you are getting CUDA_ERROR_ILLEGAL_ADDRESS?

@drpngx drpngx reopened this Apr 13, 2017
@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Apr 13, 2017
@MInner
Copy link
Author

MInner commented Apr 13, 2017

@drpngx yes, and other random errors like shape mismatch in the middle of training; people in seq2seq issue tracker (referenced above) suggest that this might be related to race condition during concurrent training and evaluation in contrib.learn.experiment.train_and_evaluate. I also sometimes observe weird errors (not related to memory allocation) if I try running two processes concurrently even on different GPUs, and often if I run two processes on same GPU.

@drpngx
Copy link
Contributor

drpngx commented Apr 13, 2017 via email

@MInner
Copy link
Author

MInner commented Apr 13, 2017

I meant this comment.

@drpngx
Copy link
Contributor

drpngx commented Apr 13, 2017

Thanks! Closing this one in favor of the other one, so that we just have place to track.

@drpngx drpngx closed this as completed Apr 13, 2017
@zhangjunhust
Copy link

Have you figured it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants