Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

Closed
CryMasK opened this issue Oct 19, 2019 · 5 comments
Closed

failed to query event: CUDA_ERROR_LAUNCH_FAILED #33536

CryMasK opened this issue Oct 19, 2019 · 5 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:support Support issues

Comments

@CryMasK
Copy link

CryMasK commented Oct 19, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10 Enterprise, 64bit (1903)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    NA
  • TensorFlow installed from (source or binary):
    binary
  • TensorFlow version (use command below):
    tensorflow-gpu-2.0
  • Python version:
    3.6.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
    10 / 7.6.0 (also have tried 7.6.1, 7.6.2, 7.6.3, 7.6.4)
  • GPU model and memory:
    2 * RTX 2080 8G

Describe the current behavior

2019-10-20 01:32:26.104969: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-10-20 01:32:26.110674: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

This error randomly occurs in training.
Some times occurs in the first epoch, some times after a few epochs.

When showing this error, also pop-up "python has stopped working" message.

I ran the same code on google cloab, it seems alright.
I also re-install python, tensorflow, cuda, cudnn, and GPU driver, nothing help

Code to reproduce the issue
There are 353 samples in my dataset, all samples are padded to the same length (about 100000).
And it just a simple LSTM model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Masking, TimeDistributed, LSTM, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras import backend as K

DUMMY_VALUE = -1.0

model = Sequential()
model.add(Masking(mask_value=DUMMY_VALUE, input_shape=(None, 100)))
model.add(Bidirectional(LSTM(100, return_sequences=True, implementation=1)))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', K_precision, K_recall],
              sample_weight_mode='temporal')
model.summary()

modelName = 'test'
checkpoint = ModelCheckpoint(filepath='./model_checkpoints/{epoch:02d}-{val_loss:.4f}_' + modelName + '.h5', verbose=1, save_best_only=True, mode='min')
es = EarlyStopping(monitor='val_loss', mode='min', patience=10, verbose=1)

histories = []
histories.append( model.fit(padding_x, padding_y, epochs=30, batch_size=2, validation_split=0.1, callbacks=[es, checkpoint], sample_weight=w) )
model.save(modelName + '.h5')
@rohit-s-shinde
Copy link

This issue can be related to drivers. can you try to reinstall the cuda and check?

@gadagashwini-zz gadagashwini-zz self-assigned this Oct 21, 2019
@gadagashwini-zz gadagashwini-zz added the TF 2.0 Issues relating to TensorFlow 2.0 label Oct 21, 2019
@gadagashwini-zz
Copy link
Contributor

@CryMasK, Downgrade the CuDNN version to 7.4 and try again. Let us know if still issue persists. Thanks!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Oct 21, 2019
@CryMasK
Copy link
Author

CryMasK commented Oct 21, 2019

@CryMasK, Downgrade the CuDNN version to 7.4 and try again. Let us know if still issue persists. Thanks!

Tensorflow-gpu 2.0 can not build on CuDNN 7.4
2019-10-21 21:11:07.554798: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.4.1 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

After a week of testing, I found the problem.
I think there are some compatibility issues with Nvidia 436.** driver.
I downgraded driver version to 431.60, it works well now.

Just in case, there are other problems with newer driver:

  1. When I want to train a LSTM model with a little bigger input dimensions, it will throw error after specific number of epochs, even the memory of GPU is enough.

like this:
internalerror: [_derived_] failed to call thenrnnbackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 300, 300, 1, 2000, 2, 300 ...

  1. If I load a trained LSTM model from .h5 file, and I want to use .predict() or .fit() function, it will cause OOM error.

@wantalgh
Copy link

I also encountered this problem, this is caused by a mismatch between the graphics card driver version and the cuda version in my computer. Reinstall the cuda installation package and use the built-in driver version in the package, and the problem is gone.

@yuanhzha
Copy link

yuanhzha commented Nov 30, 2020

@CryMasK Do you have more thoughts on this problem now? Can you answer my question at: https://stackoverflow.com/questions/65067397/error-polling-for-event-status-failed-to-query-event-cuda-error-launch-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:support Support issues
Projects
None yet
Development

No branches or pull requests

5 participants