Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

sglvladi · 2020-07-16T07:53:36Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am using the latest TensorFlow Model Garden release and TensorFlow 2.
I am reporting the issue to the correct repository. (Model Garden official or research directory)
I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/sglvladi/models/blob/train_eval/research/object_detection/model_main_tf2.py

2. Describe the bug

I have been trying to modify research/object_detection/model_main_tf2.py so that it interleaves training and evaluation, similar to how research/object_detection/model_main.py did for TensorFlow 1.x. In doing so, I've had to make some changes to research/object_detection/model_liv_v2.py, where I have added a call to eager_eval_loop() when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to run model_main_tf2.py with these changes, I get the following error when evaluation is run:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 102, in main
    model_lib_v2.train_loop(
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
    eval_step_fn(latest_checkpoint)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 641, in eval_step_fn
    eager_eval_loop(detection_model,
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 870, in eager_eval_loop
    loss_metrics[loss_key].update_state(loss_tensor)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\utils\metrics_utils.py", line 90, in decorated
    update_op = update_state_fn(*args, **kwargs)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\metrics.py", line 355, in update_state
    update_total_op = self.total.assign_add(value_sum)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\distribute\values.py", line 981, in assign_add
    raise ValueError(
ValueError: SyncOnReadVariable does not support `assign_add` in cross-replica context when aggregation is set to `tf.VariableAggregation.SUM`.

After doing some digging I found that changing the default distribution strategy at line 96 of model_main_tf2.py to OneDeviceStrategy (as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.

3. Steps to reproduce

Pull this version of the models and try to run model_main_tf2.py on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:

model {
  ssd {
    num_classes: 1
    ...
  }
  ...
}

train_config: {
  ...
  fine_tune_checkpoint_type: "detection"
  batch_size: 8
  use_bfloat16: false
  ...
}

The command used to train (and evaluate) the model is the following:

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

where the assumption is made that checkpoint_dir is the same as model_dir.

4. Expected behavior

Upon execution of model_main_tf2.py the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set to OneDeviceStrategy

5. Additional context

None

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device name if the issue happens on a mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2
Python version: 3.8
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: CUDA 10.1/ cuDNN 7.5.6
GPU model and memory: nVidia GTX 1070 Ti

The text was updated successfully, but these errors were encountered:

turowicz · 2020-09-11T09:31:43Z

Can we get an ETA on this? This is very important for QoL aspects of using TF.

I can see why people are fleeing to Pytorch.

cc @tombstone @ravikyram

sglvladi added models:research models that come under research directory type:bug Bug in the code labels Jul 16, 2020

sglvladi changed the title ~~Issue with MirroredStrategy() when trying to interleave training and evaluation~~ Issue with MirroredStrategy() when trying to interleave training and evaluation Jul 16, 2020

sglvladi changed the title ~~Issue with MirroredStrategy() when trying to interleave training and evaluation~~ Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models Jul 16, 2020

ravikyram assigned jvishnuvardhan Jul 17, 2020

syiming assigned tombstone Jul 27, 2020

ravikyram unassigned jvishnuvardhan Sep 3, 2020

jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

sglvladi commented Jul 16, 2020 •

edited

turowicz commented Sep 11, 2020 •

edited

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

Comments

sglvladi commented Jul 16, 2020 • edited

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

turowicz commented Sep 11, 2020 • edited

sglvladi commented Jul 16, 2020 •

edited

turowicz commented Sep 11, 2020 •

edited