Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

Closed
qraleq opened this issue Jul 29, 2020 · 5 comments
Closed

Eval doesn't work in TF2 OD API when batch_size != 1 #8999

qraleq opened this issue Jul 29, 2020 · 5 comments
Assignees
Labels
models:research models that come under research directory stat:awaiting response Waiting on input from the contributor type:support

Comments

@qraleq
Copy link

qraleq commented Jul 29, 2020

Prerequisites

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

Performing evaluation using batch_size: 1 works fine using efficientdet_d0_coco17_tpu-32 model. When I change batch_size to some other number, I get the error that is c/p in "Additional context".

3. Steps to reproduce

Try evaluating the efficientdet_d0_coco17_tpu-32 model using batch_size: 8.

4. Expected behavior

Eval should work when batch_size is changed.

5. Additional context

Traceback (most recent call last):
File "model_main_tf2.py", line 119, in
tf.compat.v1.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 94, in main
wait_interval=300, timeout=FLAGS.eval_timeout)
File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 976, in eval_continuously
global_step=global_step)
File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 783, in eager_eval_loop
eval_dict, losses_dict, class_agnostic = compute_eval_dict(features, labels)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Shapes of all inputs must match: values[0].shape = [10] != values[1].shape = [1]
[[node stack_32 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:153) ]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression_3/Reshape_86/_952]]
(1) Invalid argument: Shapes of all inputs must match: values[0].shape = [10] != values[1].shape = [1]
[[node stack_32 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:153) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_compute_eval_dict_78603]

Errors may have originated from an input operation.
Input Source operations connected to node stack_32:
Slice_5 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:265)

Input Source operations connected to node stack_32:
Slice_5 (defined at /usr/local/lib/python3.6/dist-packages/object_detection/model_lib.py:265)

Function call stack:
compute_eval_dict -> compute_eval_dict

6. System information

== check python ===================================================
python version: 3.6.9
python branch:
python build version: ('default', 'Apr 18 2020 01:56:04')
python compiler version: GCC 8.4.0
python implementation: CPython

== check os platform ===============================================
os: Linux
os kernel version: #123-Ubuntu SMP Sat Jul 4 02:03:15 UTC 2020
os release version: 4.4.0-1111-aws
os platform: Linux-4.4.0-1111-aws-x86_64-with-Ubuntu-18.04-bionic
linux distribution: ('Ubuntu', '18.04', 'bionic')
linux os distribution: ('Ubuntu', '18.04', 'bionic')
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='e456acec5a2f', release='4.4.0-1111-aws', version='#123-Ubuntu SMP Sat Jul 4 02:03:15 UTC 2020', machine='x86_64', processor='x86_64')
architecture: ('64bit', '')
machine: x86_64

== are we in docker =============================================
Yes

== compiler =====================================================
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== check pips ===================================================
numpy 1.18.4
protobuf 3.11.3
tensorflow 2.3.0
tensorflow-addons 0.10.0
tensorflow-datasets 3.2.1
tensorflow-estimator 2.3.0
tensorflow-gpu 2.2.0
tensorflow-hub 0.8.0
tensorflow-metadata 0.22.2
tensorflow-model-optimization 0.4.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.version.VERSION = 2.3.0
tf.version.GIT_VERSION = v2.3.0-rc2-23-gb36436b087
tf.version.COMPILER_VERSION = 7.3.1 20180303

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Wed Jul 29 11:12:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 45C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

== cuda libs ===================================================
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart_static.a

== tensorflow installed from info ==================
Name: tensorflow
Version: 2.3.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.6/dist-packages
Required-by: tf-models-official

== python version ==============================================
(major, minor, micro, releaselevel, serial)
(3, 6, 9, 'final', 0)

== bazel version ===============================================

@qraleq qraleq added models:research models that come under research directory type:bug Bug in the code labels Jul 29, 2020
@pn12
Copy link

pn12 commented Jul 29, 2020

Yes, that would be the case , because in general if you look at any Image domain problems - they would predict with a batch size of 1 means - doing classification/detection/prediction one by one.

Another way to understand this ; while training - why we do batch training ? - First, because of computation and second factor could be providing the model with a collection of images instead of a single one before it does weight adjustment.

Whereas in prediction ; there is no such need. If we add a group of images for prediction - that would really mean nothing to the model .

Hope it helps.

@saikumarchalla saikumarchalla self-assigned this Jul 30, 2020
@saikumarchalla saikumarchalla added type:support stat:awaiting response Waiting on input from the contributor and removed type:bug Bug in the code labels Jul 30, 2020
@saikumarchalla
Copy link

@qraleq Could you please respond on the above comment.Hope it helps.Thanks!

@qraleq qraleq closed this as completed Jul 30, 2020
@LackesLab
Copy link

LackesLab commented Aug 24, 2020

@pn12 using a greater batch size should just lead to an enrollment of the model. If you want to infer on another huge dataset, it is preferred to infer a batch of images instead of singles ones to save time.

@qraleq Were you able to increase the batch size?

@tazu786
Copy link

tazu786 commented Sep 18, 2020

@qraleq I have exactly the same problem. Did you manage to solve it?

@qraleq
Copy link
Author

qraleq commented Sep 19, 2020

@tazu786 I decided to stick with batch==1 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory stat:awaiting response Waiting on input from the contributor type:support
Projects
None yet
Development

No branches or pull requests

5 participants