-
Notifications
You must be signed in to change notification settings - Fork 45.4k
Closed
Labels
models:researchmodels that come under research directorymodels that come under research directorystat:awaiting responseWaiting on input from the contributorWaiting on input from the contributortype:bugBug in the codeBug in the code
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ j] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- [j ] I am reporting the issue to the correct repository. (Model Garden official or research directory)
- [ j] I checked to make sure that this issue has not already been filed.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
2. Describe the bug
While attempting to train a model I get the following errors:
Instructions for updating:
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2022-06-16 08:42:21.581826: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
File "E:\Git\models\research\object_detection\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "E:\Git\models\research\object_detection\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
losses_dict = _dist_train_step(train_input_iter)
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
self.run()
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 930, in _bootstrap
self._bootstrap_inner()
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
self.run()
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\mergt\AppData\Local\Programs\Python\Python39\lib\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node train_input_images/write_summary/mod}}]]
[[Identity_5/_494]]
[[{{node train_input_images/write_summary/mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_67016]
3. Steps to reproduce
I am following this tutorial and
copying the code from this collab
4. Expected behavior
The model should train with no errors.
5. Additional context
I deviated from the tutorial in that I installed protobuf 3.20.1, due to otherwise getting an error like this.
ImportError: cannot import name 'builder' from 'google.protobuf.internal'
This error does not occur if I uninstall CUDA
6. System information
- Windows 10 Pro 21H1
- TensorFlow installed via pip
- TensorFlow version: 2.9.1
- Python 3.9.7
- Cuda: V11.7.64, cuDNN: 8.4.1
- GPU model and memory: RTX 2070s 8Gb
Metadata
Metadata
Assignees
Labels
models:researchmodels that come under research directorymodels that come under research directorystat:awaiting responseWaiting on input from the contributorWaiting on input from the contributortype:bugBug in the codeBug in the code