Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during training: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 #45658

Closed
sergorl opened this issue Dec 14, 2020 · 8 comments
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.1 for tracking issues in 2.1 release type:support Support issues

Comments

@sergorl
Copy link

sergorl commented Dec 14, 2020

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): installed from conda
  • TensorFlow version (use command below): 2.1.0
  • Python version: Python 3.7.9
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: cuda64_101, cudnn64_7
  • GPU model and memory: GeForce RTX 2080 Super, 8 GB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

During training neural network on 17th epoch I faced error:

F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

I tried rerun many times and every time failed epoch number of training was different.

Describe the expected behavior

I think training should be stable.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

I deployed this repo: https://github.com/arthurflor23/handwritten-text-recognition

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

This code:

import tensorflow as tf

sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))

gives me:

2020-12-14 19:14:24.943891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-12-14 19:14:26.611932: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020-12-14 19:14:26.614457: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-12-14 19:14:26.644016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Super computeCapability: 7.5
coreClock: 1.56GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-12-14 19:14:26.644168: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-12-14 19:14:26.647233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-12-14 19:14:26.649912: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-12-14 19:14:26.651042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-12-14 19:14:26.654689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-12-14 19:14:26.656359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-12-14 19:14:26.662690: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-12-14 19:14:26.662820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-12-14 19:14:27.098083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-14 19:14:27.098175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-12-14 19:14:27.098252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-12-14 19:14:27.098438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6265 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Super, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-12-14 19:14:27.101774: I tensorflow/core/common_runtime/direct_session.cc:358] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080 Super, pci bus id: 0000:01:00.0, compute capability: 7.5
(tf_gpu) D:\repositories\handwritten-text-recognition\src>python main.py --source=bentham --train
Weights are from ..\output\bentham\flor\checkpoint_weights.hdf5
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           [(None, 1024, 128, 1)]    0
_________________________________________________________________
conv2d (Conv2D)              (None, 1024, 64, 16)      160
_________________________________________________________________
p_re_lu (PReLU)              (None, 1024, 64, 16)      16
_________________________________________________________________
batch_normalization (BatchNo (None, 1024, 64, 16)      112
_________________________________________________________________
full_gated_conv2d (FullGated (None, 1024, 64, 16)      4640
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 1024, 64, 32)      4640
_________________________________________________________________
p_re_lu_1 (PReLU)            (None, 1024, 64, 32)      32
_________________________________________________________________
batch_normalization_1 (Batch (None, 1024, 64, 32)      224
_________________________________________________________________
full_gated_conv2d_1 (FullGat (None, 1024, 64, 32)      18496
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 512, 16, 40)       10280
_________________________________________________________________
p_re_lu_2 (PReLU)            (None, 512, 16, 40)       40
_________________________________________________________________
batch_normalization_2 (Batch (None, 512, 16, 40)       280
_________________________________________________________________
full_gated_conv2d_2 (FullGat (None, 512, 16, 40)       28880
_________________________________________________________________
dropout (Dropout)            (None, 512, 16, 40)       0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 512, 16, 48)       17328
_________________________________________________________________
p_re_lu_3 (PReLU)            (None, 512, 16, 48)       48
_________________________________________________________________
batch_normalization_3 (Batch (None, 512, 16, 48)       336
_________________________________________________________________
full_gated_conv2d_3 (FullGat (None, 512, 16, 48)       41568
_________________________________________________________________
dropout_1 (Dropout)          (None, 512, 16, 48)       0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 256, 4, 56)        21560
_________________________________________________________________
p_re_lu_4 (PReLU)            (None, 256, 4, 56)        56
_________________________________________________________________
batch_normalization_4 (Batch (None, 256, 4, 56)        392
_________________________________________________________________
full_gated_conv2d_4 (FullGat (None, 256, 4, 56)        56560
_________________________________________________________________
dropout_2 (Dropout)          (None, 256, 4, 56)        0
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 256, 4, 64)        32320
_________________________________________________________________
p_re_lu_5 (PReLU)            (None, 256, 4, 64)        64
_________________________________________________________________
batch_normalization_5 (Batch (None, 256, 4, 64)        448
_________________________________________________________________
reshape (Reshape)            (None, 256, 256)          0
_________________________________________________________________
bidirectional (Bidirectional (None, 256, 256)          296448
_________________________________________________________________
dense (Dense)                (None, 256, 256)          65792
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256, 256)          296448
_________________________________________________________________
dense_1 (Dense)              (None, 256, 98)           25186
=================================================================
Total params: 922,354
Trainable params: 921,074
Non-trainable params: 1,280
_________________________________________________________________
Train for 1101 steps, validate for 172 steps
Epoch 1/1000
1100/1101 [============================>.] - ETA: 0s - loss: 19.5034
Epoch 00001: val_loss improved from inf to 18.13556, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 150s 136ms/step - loss: 19.5006 - val_loss: 18.1356
Epoch 2/1000
1100/1101 [============================>.] - ETA: 0s - loss: 18.7811
Epoch 00002: val_loss did not improve from 18.13556
1101/1101 [==============================] - 140s 127ms/step - loss: 18.7732 - val_loss: 18.9815
Epoch 3/1000
1100/1101 [============================>.] - ETA: 0s - loss: 17.4834
Epoch 00003: val_loss did not improve from 18.13556
1101/1101 [==============================] - 140s 127ms/step - loss: 17.4750 - val_loss: 18.3697
Epoch 4/1000
1100/1101 [============================>.] - ETA: 0s - loss: 16.9503
Epoch 00004: val_loss improved from 18.13556 to 17.28087, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 140s 127ms/step - loss: 16.9409 - val_loss: 17.2809
Epoch 5/1000
1100/1101 [============================>.] - ETA: 0s - loss: 16.1360
Epoch 00005: val_loss improved from 17.28087 to 16.63544, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 126ms/step - loss: 16.1276 - val_loss: 16.6354
Epoch 6/1000
1100/1101 [============================>.] - ETA: 0s - loss: 15.7264
Epoch 00006: val_loss improved from 16.63544 to 16.15779, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 140s 128ms/step - loss: 15.7176 - val_loss: 16.1578
Epoch 7/1000
1100/1101 [============================>.] - ETA: 0s - loss: 15.0694
Epoch 00007: val_loss improved from 16.15779 to 15.39602, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 127ms/step - loss: 15.0607 - val_loss: 15.3960
Epoch 8/1000
1100/1101 [============================>.] - ETA: 0s - loss: 14.6364
Epoch 00008: val_loss improved from 15.39602 to 15.06812, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 126ms/step - loss: 14.6277 - val_loss: 15.0681
Epoch 9/1000
1100/1101 [============================>.] - ETA: 0s - loss: 14.4449
Epoch 00009: val_loss improved from 15.06812 to 15.01459, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 127ms/step - loss: 14.4367 - val_loss: 15.0146
Epoch 10/1000
1100/1101 [============================>.] - ETA: 0s - loss: 14.1694
Epoch 00010: val_loss improved from 15.01459 to 14.35110, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 127ms/step - loss: 14.1645 - val_loss: 14.3511
Epoch 11/1000
1100/1101 [============================>.] - ETA: 0s - loss: 13.7056
Epoch 00011: val_loss improved from 14.35110 to 13.85971, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 139s 126ms/step - loss: 13.6979 - val_loss: 13.8597
Epoch 12/1000
1100/1101 [============================>.] - ETA: 0s - loss: 13.3614
Epoch 00012: val_loss did not improve from 13.85971
1101/1101 [==============================] - 140s 127ms/step - loss: 13.3553 - val_loss: 13.9131
Epoch 13/1000
1100/1101 [============================>.] - ETA: 0s - loss: 13.0623
Epoch 00013: val_loss improved from 13.85971 to 13.21627, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 140s 127ms/step - loss: 13.0562 - val_loss: 13.2163
Epoch 14/1000
1100/1101 [============================>.] - ETA: 0s - loss: 12.9299
Epoch 00014: val_loss did not improve from 13.21627
1101/1101 [==============================] - 141s 128ms/step - loss: 12.9227 - val_loss: 13.3021
Epoch 15/1000
1100/1101 [============================>.] - ETA: 0s - loss: 12.6765
Epoch 00015: val_loss improved from 13.21627 to 13.18161, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 141s 128ms/step - loss: 12.6724 - val_loss: 13.1816
Epoch 16/1000
1100/1101 [============================>.] - ETA: 0s - loss: 12.4314
Epoch 00016: val_loss improved from 13.18161 to 13.12220, saving model to ..\output\bentham\flor\checkpoint_weights.hdf5
1101/1101 [==============================] - 142s 129ms/step - loss: 12.4244 - val_loss: 13.1222
Epoch 17/1000
 130/1101 [==>...........................] - ETA: 1:56 - loss: 10.35982020-12-14 17:54:37.809588: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
@amahendrakar
Copy link
Contributor

@sergorl,
Looking at similar issues #33536 and #43914, seems like the error is caused due to the NVIDIA drivers.

Could you please update TensorFlow, CUDA, cuDNN and the NVIDIA drivers to the latest version as per the installation guide and check if you are facing the same issue. Thanks!

@amahendrakar amahendrakar added comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.1 for tracking issues in 2.1 release type:support Support issues and removed type:bug Bug labels Dec 15, 2020
@sergorl
Copy link
Author

sergorl commented Dec 15, 2020

@amahendrakar, I did like you said: I updated cuda, cudnn, gpu driver. But now I see all gets stuck: there is no cuda activity, but memory consuming happens, training freezes on first iteration.

error

@amahendrakar
Copy link
Contributor

I did like you said: I updated cuda, cudnn, gpu driver

@sergorl,
Could you please specify the TensorFlow, CUDA and cuDNN version you are using now?

Also, please run the below code snippet and share the output with us.

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Thanks!

@amahendrakar amahendrakar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting response Status - Awaiting response from author labels Dec 16, 2020
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Dec 23, 2020
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@anidh
Copy link

anidh commented Mar 25, 2022

Hi @sergorl,
I've been facing a similar kind of issue which happens randomly, I was wondering whether you were able to fix the problem?

Thanks,
Anidh Singh

@animesh-wynk
Copy link

animesh-wynk commented Nov 4, 2022

Hi @sergorl @anidh
I've been facing a similar kind of issue which happens randomly, I was wondering whether you were able to fix the problem?
Thanks,
Animesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.1 for tracking issues in 2.1 release type:support Support issues
Projects
None yet
Development

No branches or pull requests

4 participants