Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

Open
3 tasks done
sglvladi opened this issue Jul 16, 2020 · 1 comment
Assignees
Labels

Comments

@sglvladi
Copy link

sglvladi commented Jul 16, 2020

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/sglvladi/models/blob/train_eval/research/object_detection/model_main_tf2.py

2. Describe the bug

I have been trying to modify research/object_detection/model_main_tf2.py so that it interleaves training and evaluation, similar to how research/object_detection/model_main.py did for TensorFlow 1.x. In doing so, I've had to make some changes to research/object_detection/model_liv_v2.py, where I have added a call to eager_eval_loop() when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to run model_main_tf2.py with these changes, I get the following error when evaluation is run:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 102, in main
    model_lib_v2.train_loop(
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
    eval_step_fn(latest_checkpoint)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 641, in eval_step_fn
    eager_eval_loop(detection_model,
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 870, in eager_eval_loop
    loss_metrics[loss_key].update_state(loss_tensor)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\utils\metrics_utils.py", line 90, in decorated
    update_op = update_state_fn(*args, **kwargs)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\metrics.py", line 355, in update_state
    update_total_op = self.total.assign_add(value_sum)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\distribute\values.py", line 981, in assign_add
    raise ValueError(
ValueError: SyncOnReadVariable does not support `assign_add` in cross-replica context when aggregation is set to `tf.VariableAggregation.SUM`.

After doing some digging I found that changing the default distribution strategy at line 96 of model_main_tf2.py to OneDeviceStrategy (as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.

3. Steps to reproduce

Pull this version of the models and try to run model_main_tf2.py on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:

model {
  ssd {
    num_classes: 1
    ...
  }
  ...
}

train_config: {
  ...
  fine_tune_checkpoint_type: "detection"
  batch_size: 8
  use_bfloat16: false
  ...
}

The command used to train (and evaluate) the model is the following:

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

where the assumption is made that checkpoint_dir is the same as model_dir.

4. Expected behavior

Upon execution of model_main_tf2.py the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set to OneDeviceStrategy

5. Additional context

None

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Mobile device name if the issue happens on a mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.2
  • Python version: 3.8
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: CUDA 10.1/ cuDNN 7.5.6
  • GPU model and memory: nVidia GTX 1070 Ti
@sglvladi sglvladi added models:research models that come under research directory type:bug Bug in the code labels Jul 16, 2020
@sglvladi sglvladi changed the title Issue with MirroredStrategy() when trying to interleave training and evaluation Issue with MirroredStrategy() when trying to interleave training and evaluation Jul 16, 2020
@sglvladi sglvladi changed the title Issue with MirroredStrategy() when trying to interleave training and evaluation Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models Jul 16, 2020
@turowicz
Copy link

turowicz commented Sep 11, 2020

Can we get an ETA on this? This is very important for QoL aspects of using TF.

I can see why people are fleeing to Pytorch.

cc @tombstone @ravikyram

@jaeyounkim jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants