Eval doesn't work in TF2 OD API when batch_size != 1 #8999

qraleq · 2020-07-29T11:15:55Z

Prerequisites

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

Performing evaluation using batch_size: 1 works fine using efficientdet_d0_coco17_tpu-32 model. When I change batch_size to some other number, I get the error that is c/p in "Additional context".

3. Steps to reproduce

Try evaluating the efficientdet_d0_coco17_tpu-32 model using batch_size: 8.

4. Expected behavior

Eval should work when batch_size is changed.

5. Additional context

Traceback (most recent call last):
File "model_main_tf2.py", line 119, in
tf.compat.v1.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 94, in main
wait_interval=300, timeout=FLAGS.eval_timeout)
File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 976, in eval_continuously
global_step=global_step)
File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 783, in eager_eval_loop
eval_dict, losses_dict, class_agnostic = compute_eval_dict(features, labels)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Shapes of all inputs must match: values[0].shape = [10] != values[1].shape = [1]
[[node stack_32 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:153) ]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression_3/Reshape_86/_952]]
(1) Invalid argument: Shapes of all inputs must match: values[0].shape = [10] != values[1].shape = [1]
[[node stack_32 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:153) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_78603]

Errors may have originated from an input operation.
Input Source operations connected to node stack_32:
Slice_5 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:265)

Input Source operations connected to node stack_32:
Slice_5 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:265)

Function call stack:
compute_eval_dict -> compute_eval_dict

6. System information

== check python ===================================================
python version: 3.6.9
python branch:
python build version: ('default', 'Apr 18 2020 01:56:04')
python compiler version: GCC 8.4.0
python implementation: CPython

== check os platform ===============================================
os: Linux
os kernel version: #123-Ubuntu SMP Sat Jul 4 02:03:15 UTC 2020
os release version: 4.4.0-1111-aws
os platform: Linux-4.4.0-1111-aws-x86_64-with-Ubuntu-18.04-bionic
linux distribution: ('Ubuntu', '18.04', 'bionic')
linux os distribution: ('Ubuntu', '18.04', 'bionic')
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='e456acec5a2f', release='4.4.0-1111-aws', version='#123-Ubuntu SMP Sat Jul 4 02:03:15 UTC 2020', machine='x86_64', processor='x86_64')
architecture: ('64bit', '')
machine: x86_64

== are we in docker =============================================
Yes

== compiler =====================================================
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== check pips ===================================================
numpy 1.18.4
protobuf 3.11.3
tensorflow 2.3.0
tensorflow-addons 0.10.0
tensorflow-datasets 3.2.1
tensorflow-estimator 2.3.0
tensorflow-gpu 2.2.0
tensorflow-hub 0.8.0
tensorflow-metadata 0.22.2
tensorflow-model-optimization 0.4.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.version.VERSION = 2.3.0
tf.version.GIT_VERSION = v2.3.0-rc2-23-gb36436b087
tf.version.COMPILER_VERSION = 7.3.1 20180303

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
DYLD_LIBRARY_PATH is unset

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

== cuda libs ===================================================
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart_static.a

== tensorflow installed from info ==================
Name: tensorflow
Version: 2.3.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.6/dist-packages
Required-by: tf-models-official

== python version ==============================================
(major, minor, micro, releaselevel, serial)
(3, 6, 9, 'final', 0)

== bazel version ===============================================

The text was updated successfully, but these errors were encountered:

pn12 · 2020-07-29T21:12:54Z

Yes, that would be the case , because in general if you look at any Image domain problems - they would predict with a batch size of 1 means - doing classification/detection/prediction one by one.

Another way to understand this ; while training - why we do batch training ? - First, because of computation and second factor could be providing the model with a collection of images instead of a single one before it does weight adjustment.

Whereas in prediction ; there is no such need. If we add a group of images for prediction - that would really mean nothing to the model .

Hope it helps.

saikumarchalla · 2020-07-30T11:22:28Z

@qraleq Could you please respond on the above comment.Hope it helps.Thanks!

LackesLab · 2020-08-24T19:07:56Z

@pn12 using a greater batch size should just lead to an enrollment of the model. If you want to infer on another huge dataset, it is preferred to infer a batch of images instead of singles ones to save time.

@qraleq Were you able to increase the batch size?

tazu786 · 2020-09-18T17:58:09Z

@qraleq I have exactly the same problem. Did you manage to solve it?

qraleq · 2020-09-19T08:23:14Z

@tazu786 I decided to stick with batch==1 for now.

qraleq added models:research models that come under research directory type:bug Bug in the code labels Jul 29, 2020

saikumarchalla self-assigned this Jul 30, 2020

saikumarchalla added type:support stat:awaiting response Waiting on input from the contributor and removed type:bug Bug in the code labels Jul 30, 2020

qraleq closed this as completed Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

qraleq commented Jul 29, 2020

pn12 commented Jul 29, 2020

saikumarchalla commented Jul 30, 2020

LackesLab commented Aug 24, 2020 •

edited

tazu786 commented Sep 18, 2020

qraleq commented Sep 19, 2020

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

Comments

qraleq commented Jul 29, 2020

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

pn12 commented Jul 29, 2020

saikumarchalla commented Jul 30, 2020

LackesLab commented Aug 24, 2020 • edited

tazu786 commented Sep 18, 2020

qraleq commented Sep 19, 2020

LackesLab commented Aug 24, 2020 •

edited