-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221
Comments
tf.nn.softmax
on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS
(okay on CPU)
Could you help narrowing down the error with To make it more easier to trace, probably setting |
Woops, you're right. It looks like you have to tweak @zheng-xq for any additional insight. We don't have access to the data, but we have access to the source code. If that's an actual bug in tensorflow, it would be great to get to the bottom of this. |
To sync after each op, you'll have to build from source by modifying this line. If you still have this problem with the latest driver, then we will need to know which kernel causes this problem. Combining that with memcheck and CUDA_LAUNCH_BLOCK=1 tends to give the best answer. |
@zheng-xq the memcheck above says it's in |
here's the tf-related part of the code above, if that is be helpful to understand device placement |
Would it be possible for you to create a repro that we could try locally? |
If you had a build with debug on, then we might be able to see what the intermediate stack frames are. |
@benoitsteiner if you're interested. |
@MInner Can you try with the release candidate for 0.11 ? There was a bug in 0.10 that could explain your CUDA_ERROR_ILLEGAL_ADDRESS error and has been fixed since. |
@benoitsteiner Which commit fixed this bug? I encountered a same error in tensorflow 1.0.0 and try to reproduce it now. |
I also still encounter this problem when use large enough models with |
@MInner to be clear, you are getting |
@drpngx yes, and other random errors like shape mismatch in the middle of training; people in |
Please link to issue on seq2seq
…On Apr 13, 2017 7:45 AM, "Ben Usman" ***@***.***> wrote:
@drpngx <https://github.com/drpngx> yes, and other random errors like
shape mismatch in the middle of training; people in seq2seq issue tracker
(referenced above) suggest that this might be related to race condition
during concurrent training and evaluation in contrib.learn.experiment.
train_and_evaluate.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5221 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AT_Sbbomikr2D_5AN2rwEKTxqtJbBbTKks5rvjTvgaJpZM4Khtms>
.
|
I meant this comment. |
Thanks! Closing this one in favor of the other one, so that we just have place to track. |
Have you figured it out? |
CentOS 7
Tensorflow 0.10.0
TITAN X (Pascal) 367.44
I restore a model previously saved with
tf.train.Saver()
and try to compute probabilities of outputs for a given input batch. Whenever I try to executetf.nn.softmax
on GPU, I get an error:but same computations work just fine on CPU:
produces
gdb
does not give any additional detailsSurprisingly, this error occurred just recently. I have been working with this same codebase for months and before now nothing like this happened, but now it is 100% reproducible on my machine with this specific (code, driver version, etc), so it is not something that effects everyone, but rather pops up randomly. I might have changed operation device placement recently.
code (unfortunately, large piece of code that required certain data; failed to produce a minimal reproducing code): [link]
full gdb output (without prints): [link]
environment:
Possibly related issues: #2117 #1450 #2810 #665 #1060 #4425
The text was updated successfully, but these errors were encountered: