failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

CryMasK · 2019-10-19T17:55:03Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Windows 10 Enterprise, 64bit (1903)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
NA
TensorFlow installed from (source or binary):
binary
TensorFlow version (use command below):
tensorflow-gpu-2.0
Python version:
3.6.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
10 / 7.6.0 (also have tried 7.6.1, 7.6.2, 7.6.3, 7.6.4)
GPU model and memory:
2 * RTX 2080 8G

Describe the current behavior

2019-10-20 01:32:26.104969: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-10-20 01:32:26.110674: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

This error randomly occurs in training.
Some times occurs in the first epoch, some times after a few epochs.

When showing this error, also pop-up "python has stopped working" message.

I ran the same code on google cloab, it seems alright.
I also re-install python, tensorflow, cuda, cudnn, and GPU driver, nothing help

Code to reproduce the issue
There are 353 samples in my dataset, all samples are padded to the same length (about 100000).
And it just a simple LSTM model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Masking, TimeDistributed, LSTM, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras import backend as K

DUMMY_VALUE = -1.0

model = Sequential()
model.add(Masking(mask_value=DUMMY_VALUE, input_shape=(None, 100)))
model.add(Bidirectional(LSTM(100, return_sequences=True, implementation=1)))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', K_precision, K_recall],
              sample_weight_mode='temporal')
model.summary()

modelName = 'test'
checkpoint = ModelCheckpoint(filepath='./model_checkpoints/{epoch:02d}-{val_loss:.4f}_' + modelName + '.h5', verbose=1, save_best_only=True, mode='min')
es = EarlyStopping(monitor='val_loss', mode='min', patience=10, verbose=1)

histories = []
histories.append( model.fit(padding_x, padding_y, epochs=30, batch_size=2, validation_split=0.1, callbacks=[es, checkpoint], sample_weight=w) )
model.save(modelName + '.h5')

The text was updated successfully, but these errors were encountered:

rohit-s-shinde · 2019-10-20T16:50:45Z

This issue can be related to drivers. can you try to reinstall the cuda and check?

gadagashwini-zz · 2019-10-21T08:43:22Z

@CryMasK, Downgrade the CuDNN version to 7.4 and try again. Let us know if still issue persists. Thanks!

CryMasK · 2019-10-21T13:15:27Z

@CryMasK, Downgrade the CuDNN version to 7.4 and try again. Let us know if still issue persists. Thanks!

Tensorflow-gpu 2.0 can not build on CuDNN 7.4
2019-10-21 21:11:07.554798: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

After a week of testing, I found the problem.
I think there are some compatibility issues with Nvidia 436.** driver.
I downgraded driver version to 431.60, it works well now.

Just in case, there are other problems with newer driver:

When I want to train a LSTM model with a little bigger input dimensions, it will throw error after specific number of epochs, even the memory of GPU is enough.

like this:
internalerror: [_derived_] failed to call thenrnnbackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 300, 300, 1, 2000, 2, 300 ...

If I load a trained LSTM model from .h5 file, and I want to use .predict() or .fit() function, it will cause OOM error.

wantalgh · 2020-09-14T05:40:00Z

I also encountered this problem, this is caused by a mismatch between the graphics card driver version and the cuda version in my computer. Reinstall the cuda installation package and use the built-in driver version in the package, and the problem is gone.

yuanhzha · 2020-11-30T02:46:28Z

@CryMasK Do you have more thoughts on this problem now? Can you answer my question at: https://stackoverflow.com/questions/65067397/error-polling-for-event-status-failed-to-query-event-cuda-error-launch-failed

gadagashwini-zz self-assigned this Oct 21, 2019

gadagashwini-zz added the TF 2.0 Issues relating to TensorFlow 2.0 label Oct 21, 2019

gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Oct 21, 2019

CryMasK closed this as completed Oct 21, 2019

gadagashwini-zz added comp:gpu GPU related issues type:support Support issues labels Oct 22, 2019

Saduf2019 mentioned this issue Mar 27, 2020

CUDNN_STATUS_INTERNAL_ERROR after hours of HyperParameter optimization. #37932

Closed

mmoussallam mentioned this issue May 6, 2020

[Bug] Crash after processing a few files deezer/spleeter#359

Closed

FLming mentioned this issue Jun 9, 2020

您好，训练过程中出现错误，希望能得到答复 FLming/CRNN.tf2#4

Closed

Saduf2019 mentioned this issue Oct 18, 2020

cuda_11.1.0_456.43_win10 + cudnn-11.1-windows-x64-v8.0.4.30 + master branch= some errors #44128

Closed

amahendrakar mentioned this issue Dec 15, 2020

Error during training: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 #45658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

CryMasK commented Oct 19, 2019 •

edited

rohit-s-shinde commented Oct 20, 2019

gadagashwini-zz commented Oct 21, 2019

CryMasK commented Oct 21, 2019 •

edited

wantalgh commented Sep 14, 2020

yuanhzha commented Nov 30, 2020 •

edited

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

Comments

CryMasK commented Oct 19, 2019 • edited

rohit-s-shinde commented Oct 20, 2019

gadagashwini-zz commented Oct 21, 2019

CryMasK commented Oct 21, 2019 • edited

wantalgh commented Sep 14, 2020

yuanhzha commented Nov 30, 2020 • edited

CryMasK commented Oct 19, 2019 •

edited

CryMasK commented Oct 21, 2019 •

edited

yuanhzha commented Nov 30, 2020 •

edited