tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

MInner · 2016-10-26T21:51:15Z

CentOS 7
Tensorflow 0.10.0
TITAN X (Pascal) 367.44

I restore a model previously saved with tf.train.Saver() and try to compute probabilities of outputs for a given input batch. Whenever I try to execute tf.nn.softmax on GPU, I get an error:

E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS

but same computations work just fine on CPU:

print('run 1: logits')
logits = s.run(model.other.logits_labl, tf_run_args)
print(logits.shape) # (1001, 289)
print(np.max(logits), np.min(logits)) # 1.80996 -0.752239

print('run 2: probs in np')
probs = s.run(tf.nn.softmax(logits))
print(probs.shape) # (1001, 289)
print(np.sum(probs, axis=1)) # [ 1.   ...  1.00000012  1.          1.00000012]

print('run 3: probs on cpu')
with tf.device('/cpu:0'):
    t = tf.nn.softmax(model.other.logits_labl)
print(s.run(t, tf_run_args)) # okay

print('run 4: probs on gpu')
with tf.device('/gpu:0'):
    t2 = tf.nn.softmax(model.other.logits_labl)
print(s.run(t2, tf_run_args)) # error

produces

run 1: logits
(1001, 289)
1.80996 -0.752239
run 2: probs in np
(1001, 289)
[ 1.          0.99999994  1.         ...,  1.00000012  1.          1.00000012]
run 3: probs on cpu
[[ 0.00353091  0.0032969   0.00355321 ...,  0.00368173  0.00337926
   0.00326502]
 [ 0.00343715  0.00313538  0.00379693 ...,  0.00426536  0.0032676
   0.0031463 ]
 [ 0.00346572  0.00300998  0.00389543 ...,  0.00458091  0.0031747
   0.0030867 ]
 ...,
 [ 0.0035709   0.00302548  0.00384262 ...,  0.0042819   0.00318443
   0.00299288]
 [ 0.00353101  0.00305104  0.00379521 ...,  0.00428836  0.0031993
   0.0029559 ]
 [ 0.00352879  0.00302152  0.00380528 ...,  0.0043787   0.00318416
   0.00294856]]
run 4: probs on gpu
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
Aborted

gdb does not give any additional details

... same as above
[New Thread 0x7fff41ffb700 (LWP 26404)]
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff5effd700 (LWP 26391)]
0x00007ffff6a315f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install libX11-1.6.3-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXdmcp-1.1.1-6.1.el7.x86_64 libuuid-2.23.2-26.el7_2.3.x86_64

Surprisingly, this error occurred just recently. I have been working with this same codebase for months and before now nothing like this happened, but now it is 100% reproducible on my machine with this specific (code, driver version, etc), so it is not something that effects everyone, but rather pops up randomly. I might have changed operation device placement recently.

code (unfortunately, large piece of code that required certain data; failed to produce a minimal reproducing code): [link]

full gdb output (without prints): [link]

environment:

$ ls -l /usr/lib/libcu*
lrwxrwxrwx. 1 root root      12 Oct  3 15:36 /usr/lib/libcuda.so -> libcuda.so.1
lrwxrwxrwx. 1 root root      17 Oct  3 15:36 /usr/lib/libcuda.so.1 -> libcuda.so.367.44
-rwxr-xr-x. 1 root root 7747600 Oct  3 15:36 /usr/lib/libcuda.so.367.44
$ ls -l /usr/local/cuda/lib64/
lrwxrwxrwx. 1 1000 users       13 Jul 27 01:55 libcudnn.so -> libcudnn.so.5
lrwxrwxrwx. 1 1000 users       17 Jul 27 01:55 libcudnn.so.5 -> libcudnn.so.5.1.5
-rwxrwxr-x. 1 1000 users 79337624 Jul 27 01:53 libcudnn.so.5.1.5
-rw-rw-r--. 1 1000 users 69756172 Jul 27 01:53 libcudnn_static.a

Possibly related issues: #2117 #1450 #2810 #665 #1060 #4425

The text was updated successfully, but these errors were encountered:

drpngx · 2016-10-26T22:57:08Z

Could you help narrowing down the error with cuda-memcheck from nvidia?

To make it more easier to trace, probably setting CUDA_LAUNCH_BLOCK=1 and run with --brain_gpu_sync_every_op.

MInner · 2016-10-26T23:23:32Z

@drpngx memcheck with CUDA_LAUNCH_BLOCK=1 . Do I have to re-build tensorflow from source to set --brain_gpu_sync_every_op or there's an easier way?

drpngx · 2016-10-27T02:41:48Z

Woops, you're right. It looks like you have to tweak gpu_device_factory.cc, but from the trace it's clear that it's in the eigen ReduceInitKernel, in the dense softmax.

@zheng-xq for any additional insight. We don't have access to the data, but we have access to the source code. If that's an actual bug in tensorflow, it would be great to get to the bottom of this.

zheng-xq · 2016-10-27T03:25:46Z

To sync after each op, you'll have to build from source by modifying this line.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device_factory.cc#L34

If you still have this problem with the latest driver, then we will need to know which kernel causes this problem. Combining that with memcheck and CUDA_LAUNCH_BLOCK=1 tends to give the best answer.

drpngx · 2016-10-27T16:37:02Z

@zheng-xq the memcheck above says it's in ReduceInitKernel from the eigen softmax. I wonder if there are diagnostics we could print (dimensions etc) of buffers vs expected op input

MInner · 2016-10-27T19:50:37Z

here's the tf-related part of the code above, if that is be helpful to understand device placement

drpngx · 2016-10-27T23:57:50Z

Would it be possible for you to create a repro that we could try locally?

drpngx · 2016-10-28T19:41:13Z

If you had a build with debug on, then we might be able to see what the intermediate stack frames are. ReductionInitKernel appears to have been passed nullptr + offset (maybe 300 or so), so it's pulled from a struct which has about 300 bytes worth of stuff before the output, and that struct instance pointer is nullptr, presumably.

drpngx · 2016-10-28T21:00:00Z

@benoitsteiner if you're interested.

benoitsteiner · 2016-10-28T23:13:14Z

@MInner Can you try with the release candidate for 0.11 ? There was a bug in 0.10 that could explain your CUDA_ERROR_ILLEGAL_ADDRESS error and has been fixed since.

2sin18 · 2017-04-07T06:30:41Z

@benoitsteiner Which commit fixed this bug? I encountered a same error in tensorflow 1.0.0 and try to reproduce it now.

MInner · 2017-04-12T18:23:42Z

I also still encounter this problem when use large enough models with google/seq2seq during evaluation on same hardware as mentioned in the topic heading.

drpngx · 2017-04-13T01:08:12Z

@MInner to be clear, you are getting CUDA_ERROR_ILLEGAL_ADDRESS?

MInner · 2017-04-13T14:43:08Z

@drpngx yes, and other random errors like shape mismatch in the middle of training; people in seq2seq issue tracker (referenced above) suggest that this might be related to race condition during concurrent training and evaluation in contrib.learn.experiment.train_and_evaluate. I also sometimes observe weird errors (not related to memory allocation) if I try running two processes concurrently even on different GPUs, and often if I run two processes on same GPU.

drpngx · 2017-04-13T15:08:25Z

Please link to issue on seq2seq

…

On Apr 13, 2017 7:45 AM, "Ben Usman" ***@***.***> wrote: @drpngx <https://github.com/drpngx> yes, and other random errors like shape mismatch in the middle of training; people in seq2seq issue tracker (referenced above) suggest that this might be related to race condition during concurrent training and evaluation in contrib.learn.experiment. train_and_evaluate. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AT_Sbbomikr2D_5AN2rwEKTxqtJbBbTKks5rvjTvgaJpZM4Khtms> .

MInner · 2017-04-13T16:00:44Z

I meant this comment.

drpngx · 2017-04-13T17:33:54Z

Thanks! Closing this one in favor of the other one, so that we just have place to track.

zhangjunhust · 2019-11-28T09:07:13Z

Have you figured it out?

MInner changed the title ~~tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU)~~ tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) Oct 26, 2016

drpngx added the bug label Oct 26, 2016

drpngx added the stat:awaiting response Status - Awaiting response from author label Oct 28, 2016

benoitsteiner self-assigned this Oct 28, 2016

benoitsteiner closed this as completed Nov 8, 2016

aselle added type:bug Bug and removed bug labels Feb 9, 2017

MInner mentioned this issue Apr 12, 2017

Various InvalidArgumentError in Evaluation google/seq2seq#103

Closed

drpngx reopened this Apr 13, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Apr 13, 2017

drpngx closed this as completed Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

MInner commented Oct 26, 2016 •

edited

Loading

drpngx commented Oct 26, 2016

MInner commented Oct 26, 2016

drpngx commented Oct 27, 2016 •

edited

Loading

zheng-xq commented Oct 27, 2016

drpngx commented Oct 27, 2016

MInner commented Oct 27, 2016

drpngx commented Oct 27, 2016

drpngx commented Oct 28, 2016

drpngx commented Oct 28, 2016

benoitsteiner commented Oct 28, 2016

2sin18 commented Apr 7, 2017

MInner commented Apr 12, 2017 •

edited

Loading

drpngx commented Apr 13, 2017

MInner commented Apr 13, 2017 •

edited

Loading

drpngx commented Apr 13, 2017 via email

MInner commented Apr 13, 2017

drpngx commented Apr 13, 2017

zhangjunhust commented Nov 28, 2019

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

Comments

MInner commented Oct 26, 2016 • edited Loading

drpngx commented Oct 26, 2016

MInner commented Oct 26, 2016

drpngx commented Oct 27, 2016 • edited Loading

zheng-xq commented Oct 27, 2016

drpngx commented Oct 27, 2016

MInner commented Oct 27, 2016

drpngx commented Oct 27, 2016

drpngx commented Oct 28, 2016

drpngx commented Oct 28, 2016

benoitsteiner commented Oct 28, 2016

2sin18 commented Apr 7, 2017

MInner commented Apr 12, 2017 • edited Loading

drpngx commented Apr 13, 2017

MInner commented Apr 13, 2017 • edited Loading

drpngx commented Apr 13, 2017 via email

MInner commented Apr 13, 2017

drpngx commented Apr 13, 2017

zhangjunhust commented Nov 28, 2019

MInner commented Oct 26, 2016 •

edited

Loading

drpngx commented Oct 27, 2016 •

edited

Loading

MInner commented Apr 12, 2017 •

edited

Loading

MInner commented Apr 13, 2017 •

edited

Loading