Skip to content

Can't find libdevice directory when training object detection #10673

@Perondas

Description

@Perondas

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ j] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [j ] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [ j] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

While attempting to train a model I get the following errors:

Instructions for updating:
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2022-06-16 08:42:21.581826: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
  File "E:\Git\models\research\object_detection\model_main_tf2.py", line 114, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\absl\app.py", line 312, in run
    _run_main(main, args)
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\absl\app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "E:\Git\models\research\object_detection\model_main_tf2.py", line 105, in main
    model_lib_v2.train_loop(
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
    losses_dict = _dist_train_step(train_input_iter)
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 930, in _bootstrap
      self._bootstrap_inner()
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
      self.run()
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
      if record_summaries:
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
      tf.compat.v2.summary.image(
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
      tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
      lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 930, in _bootstrap
      self._bootstrap_inner()
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
      self.run()
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
      if record_summaries:
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
      tf.compat.v2.summary.image(
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
      tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
    File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
      lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node train_input_images/write_summary/mod}}]]
         [[Identity_5/_494]]
         [[{{node train_input_images/write_summary/mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_67016]

3. Steps to reproduce

I am following this tutorial and
copying the code from this collab

4. Expected behavior

The model should train with no errors.

5. Additional context

I deviated from the tutorial in that I installed protobuf 3.20.1, due to otherwise getting an error like this.

ImportError: cannot import name 'builder' from 'google.protobuf.internal' 

This error does not occur if I uninstall CUDA

6. System information

  • Windows 10 Pro 21H1
  • TensorFlow installed via pip
  • TensorFlow version: 2.9.1
  • Python 3.9.7
  • Cuda: V11.7.64, cuDNN: 8.4.1
  • GPU model and memory: RTX 2070s 8Gb

Metadata

Metadata

Assignees

Labels

models:researchmodels that come under research directorystat:awaiting responseWaiting on input from the contributortype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions