New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to synchronize the stop event #14363

Closed
ljanyst opened this Issue Nov 8, 2017 · 25 comments

Comments

Projects
None yet
@ljanyst
Copy link

ljanyst commented Nov 8, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Ubuntu 16.04

  • TensorFlow installed from (source or binary):
    source

  • TensorFlow version (use command below):
    b'v1.4.0-0-gd752244' 1.4.0

  • Python version:
    3.5.2

  • Bazel version (if compiling from source):
    0.7.0

  • GCC/Compiler version (if compiling from source):
    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

  • CUDA/cuDNN version:
    9.0/7.0

  • GPU model and memory:
    Tesla V100-SXM2-16GB

  • Exact command to reproduce:

git clone https://github.com/ljanyst/image-segmentation-fcn.git
cd image-segmentation-fcn                                       
wget http://www.cvlibs.net/download.php?file=data_road.zip
unzip data_road.zip                                     
./train.py  --data-dir data_road

Describe the problem

It seems like I am hitting some sort of a CUDA/cuDNN synchronization/race issue. Please see the snippet in the next section for the exact error message. The problem only happens with the KITTI dataset. The exact same TensorFlow code works fine for the Cityscapes dataset. Also, the problem only happens on Tesla V100. I tested the same exact software configuration on Tesla K80 and GeForce GTX1080 Ti as well, and things work fine.

Source code / logs

2017-11-08 12:24:52.838039: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838090: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838106: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838137: F tensorflow/stream_executor/cuda/cuda_dnn.cc:3218] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR
zsh: abort (core dumped)  ./train.py --data-dir data_road
@angersson

This comment has been minimized.

Copy link
Member

angersson commented Nov 8, 2017

@zheng-xq, can you take a look at this?

@zheng-xq

This comment has been minimized.

Copy link
Contributor

zheng-xq commented Nov 8, 2017

The synchronization error is only what finds out the issue. The root cause is some GPU kernels had an illegal address access.

If someone wants to root cause this, first it is needed to find the offending kernel. In our past experience, it could be either a kernel bug, or a degenerate data entry.

@ljanyst

This comment has been minimized.

Copy link

ljanyst commented Nov 9, 2017

Thanks for the hint @zheng-xq ! I have had a closer look, and the offending kernel seems to be:

CUDA Exception: Warp Out-of-range Address

Thread 28 "python" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 1994, grid 1995, block (0,0,0), thread (128,0,0), device 0, sm 0, warp 6, lane 0]
0x00007ffe7ac23a50 in volta_scudnn_128x128_stridedB_splitK_xregs_large_nn_v1_LOOP<<<(5,1,160),(256,1,1)>>> ()
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask   GridDim  BlockDim Invocation 
*   1994      -   0 1995 Active 0xffffffff (5,1,160) (256,1,1) .text.volta_scudnn_128x128_stridedB_splitK_xregs_large_nn_v1() 

I was unable to get any useful host-side stack trace because there appears to by something wrong with the DWARF symbols: Unable to access DWARF register number 83886081. I am not sure whether it's about the symbols in CUDA/cuDDN or in TensorFlow. Do you think that recompiling TensorFlow in debug mode will help? If so, how do I pass extra parameters to nvcc with Bazel?

@zheng-xq

This comment has been minimized.

Copy link
Contributor

zheng-xq commented Nov 9, 2017

That seems to be a Cudnn bug. Adding NVIDIA folks.

@benbarsdell, @nluehr, any insight to debug this Cudnn kernel exception?

@juliebernauer

This comment has been minimized.

Copy link

juliebernauer commented Nov 14, 2017

@ljanyst Let's talk offline so that I can have a look at this with debug info. Thanks.

@juliebernauer

This comment has been minimized.

Copy link

juliebernauer commented Nov 16, 2017

@zheng-xq @ljanyst We have a repro and a fix. Roll out is planned in cuDNN 7.0.5 mid-December.

@ManuelaPa

This comment has been minimized.

Copy link

ManuelaPa commented Nov 27, 2017

Hello
I installed tensorflow 1.4 cudnn 6 and cuda 8.0
I have the same problem "cuda_event.cc:49 Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS" when I try yo train with tensoflow
I have tried to install anothers versions but I have the same issue always and in some computers, not only mine. Do you know what I have to do? Thanks

@mholt

This comment has been minimized.

Copy link

mholt commented Nov 29, 2017

Came here to report the exact same thing with our Volta, using the Tensorflow container on NVIDIA GPU Cloud. We will be happy to test the fix with cuDNN 7.0.5 and follow-up. Please let us know if there are any other updates on this issue or if more information is needed.

@RerRayne

This comment has been minimized.

Copy link

RerRayne commented Dec 5, 2017

Same problem with WaveNet on V100:

2017-12-05 14:08:26.119341: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-12-05 14:08:26.119423: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0xbab66e0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-12-05 14:08:26.119435: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0xbab66e0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-12-05 14:08:26.119470: F tensorflow/stream_executor/cuda/cuda_dnn.cc:3218] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR

Is there a more precise timeline info on the fix? We would gladly try a beta version of cuDNN, if any exists.
Thank you!

@tekumara

This comment has been minimized.

Copy link

tekumara commented Dec 11, 2017

Confirmed that cuDNN 7.0.5 from https://developer.nvidia.com/rdp/cudnn-download fixes this on the AWS p3.8xlarge (Volta 4 GPU)

@juliebernauer

This comment has been minimized.

Copy link

juliebernauer commented Dec 11, 2017

@tukushan Glad to hear you can confirm the fix. @ManuelaPa @mholt @RerRayne @ljanyst Should be fixed by using cuDNN 7.0.5.

@mholt

This comment has been minimized.

Copy link

mholt commented Dec 13, 2017

After installing cuDNN 7.0.5, I am still seeing this error. :( I'm using the TensorFlow container from NVIDIA GPU cloud. Anyone know if there's any extra steps I need to take? I extracted the library files and moved them into place according to the installation instructions...

@cliffwoolley

This comment has been minimized.

Copy link

cliffwoolley commented Dec 13, 2017

@mholt - If you're using the NGC image for TensorFlow, you don't actually need to install cuDNN directly, as it's installed in the container for you already. The NGC frameworks release 17.12 includes cuDNN 7.0.5 and should fix this issue. Or are you saying you tried 17.12 and still see an issue?

@mholt

This comment has been minimized.

Copy link

mholt commented Dec 13, 2017

Ah, sorry, I mistakenly thought cuDNN was installed outside the container (but maybe that is CUDA actually) -- pulling and using the latest container fixed it. Looks like that patch did it. Thank you!

@RerRayne

This comment has been minimized.

Copy link

RerRayne commented Dec 14, 2017

@juliebernauer it works after updating. Thank you a lot!

@ljanyst

This comment has been minimized.

Copy link

ljanyst commented Dec 14, 2017

Things work for me too now. Thanks @juliebernauer !

@ManuelaPa

This comment has been minimized.

Copy link

ManuelaPa commented Dec 15, 2017

@juliebernauer Thanks for the answer! I installed cudnn 7.0.5 but it's not compatible with tensorflow 1.4, it needs cudnn64_6.dll, do you know that I should do?
Thank you

@juliebernauer

This comment has been minimized.

Copy link

juliebernauer commented Dec 15, 2017

@ManuelaPa The developer website lists Win7 and Win10 versions for both CUDA9.0 and CUDA9.1. So one has to make sure to download and install the one needed - after removing previous versions on Windows (this is not the case on Linux). Can you please try this? This should work for you. If not, may I suggest you send a ticket or report a bug on the NVIDIA developer website?

@tensorflowbutler

This comment has been minimized.

Copy link
Member

tensorflowbutler commented Jan 3, 2018

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

@shivaniag

This comment has been minimized.

Copy link
Contributor

shivaniag commented Jan 9, 2018

Update seems to fix the issue, closing this.

@shivaniag shivaniag closed this Jan 9, 2018

@weiliu620

This comment has been minimized.

Copy link

weiliu620 commented Jun 1, 2018

@juliebernauer so my understanding is v9.1 has the fix. but I'm having the same error

2018-06-01 09:49:37.379160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:06:00.0
totalMemory: 22.38GiB freeMemory: 22.21GiB
2018-06-01 09:49:37.379206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-06-01 09:49:37.673924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-01 09:49:37.673988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-06-01 09:49:37.673996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-06-01 09:49:37.674543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21559 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-06-01 09:49:44.636341: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2018-06-01 09:49:44.636422: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x6ab0fb0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-06-01 09:49:44.636433: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x6ab0fb0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-06-01 09:49:44.636483: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2328] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR
Aborted

I'm on Red hat EL 7, CUDA V9.1.85 (as seen from nvcc --version) and TF 1.7.

Do I need to upgrade to CUDA 9.2?

@juliebernauer

This comment has been minimized.

Copy link

juliebernauer commented Jun 1, 2018

@weiliu620 you want to make sure you are indeed using the cudnn version mentioned above. Upgrading CUDA won't change that by default (but might get you to use a different dir). Cleaning your LD_LIBRARY_PATH might help.

@lcnature

This comment has been minimized.

Copy link

lcnature commented Jun 6, 2018

I still encounter the same problem as others reported, with CUDA 9.0 and cudnn 7.0.2.

If I tried cudnn 7.1.2, I got a different error:

/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch 
failed
         [[Node: AvgPool3D_15 = AvgPool3D[T=DT_FLOAT, data_format="NDHWC", ksize
=[1, 2, 2, 2, 1], padding="SAME", strides=[1, 2, 2, 2, 1], _device="/job:localho
st/replica:0/task:0/device:GPU:1"](ExpandDims_1)]]
         [[Node: mul_29/_23 = _Recv[client_terminated=false, recv_device="/job:l
ocalhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/t
ask:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_47_mul_29", te
nsor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'AvgPool3D_15', defined at:
 [hiding lines related to customer codes]
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 177, in avg_pool3d
    padding=padding, data_format=data_format, name=name)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cudnn PoolForward launch failed
         [[Node: AvgPool3D_15 = AvgPool3D[T=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="SAME", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/device:GPU:1"](ExpandDims_1)]]
         [[Node: mul_29/_23 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_47_mul_29", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

These happen for multiple versions of tensorflow I tried, from 1.5 to 1.7

@majthehero

This comment has been minimized.

Copy link

majthehero commented Oct 4, 2018

I have this problem on CUDA 10.0. I'm using TF 1.10.0, keras 2.2.2, Window 10, GPU Nvidia mx150.
Some NNs work with no problem, some fail.

@christopher5106

This comment has been minimized.

Copy link

christopher5106 commented Dec 14, 2018

I confirm it solves the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment