Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Status: device kernel image is invalid" with A100 #47563

Closed
cgebbe opened this issue Mar 4, 2021 · 4 comments
Closed

"Status: device kernel image is invalid" with A100 #47563

cgebbe opened this issue Mar 4, 2021 · 4 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.3 Issues related to TF 2.3 type:bug Bug

Comments

@cgebbe
Copy link

cgebbe commented Mar 4, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): source (pip)
  • TensorFlow version (use command below): 2.3.0
  • Python version: 3.6.3
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.2 (also have 10.1 installed)
  • GPU model and memory: A100-SXM4-40GB

Describe the current behavior

import tensorflow as tf
tf.constant(0)

yields the error "Status: device kernel image is invalid". This issue mentions that tensorflow==2.3 no longer support some older GPUs, but this shouldn't apply to this case. Is it some cuda library confusion?

Detailed log

Python 3.6.3 (default, Mar 20 2018, 13:50:41)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-03-04 14:19:04.766920: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> x = tf.constant(0)
2021-03-04 14:19:31.998272: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-03-04 14:19:32.186620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.188740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:41:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.190843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:81:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.192913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:c1:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.192970: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-04 14:19:32.195229: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-04 14:19:32.196491: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-04 14:19:32.196916: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-04 14:19:32.198940: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-04 14:19:32.200061: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-04 14:19:32.204798: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-04 14:19:32.221788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2021-03-04 14:19:32.222380: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-04 14:19:32.238093: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3200000000 Hz
2021-03-04 14:19:32.246203: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x484daa0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-04 14:19:32.246307: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-04 14:19:32.624138: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x40cc380 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-04 14:19:32.624199: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): A100-SXM4-40GB, Compute Capability 8.0
2021-03-04 14:19:32.624218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): A100-SXM4-40GB, Compute Capability 8.0
2021-03-04 14:19:32.624235: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): A100-SXM4-40GB, Compute Capability 8.0
2021-03-04 14:19:32.624286: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): A100-SXM4-40GB, Compute Capability 8.0
2021-03-04 14:19:32.633961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.636050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:41:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.638161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:81:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.640212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:c1:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-03-04 14:19:32.640284: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-04 14:19:32.640334: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-04 14:19:32.640364: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-04 14:19:32.640394: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-04 14:19:32.640421: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-04 14:19:32.640449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-04 14:19:32.640476: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-04 14:19:32.656584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2021-03-04 14:19:32.656644: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
    allow_broadcast=True)
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 275, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 300, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/opt/app-root/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 539, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

@cgebbe cgebbe added the type:bug Bug label Mar 4, 2021
@Saduf2019 Saduf2019 added the TF 2.3 Issues related to TF 2.3 label Mar 4, 2021
@Saduf2019
Copy link
Contributor

@cgebbe
Could you please refer to these existing resolved issues and let us know: #43911, #43701,#41990,#41132,#42428

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Mar 5, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 12, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.3 Issues related to TF 2.3 type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants