Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_update in cross-replica mode is broken (BatchNormalization layer impossible to use) #29481

Closed
galeone opened this issue Jun 6, 2019 · 20 comments

Comments

@galeone
Copy link

commented Jun 6, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Archlinux
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.12.1-3374-g9eb67b17bf 2.0.0-dev20190605
  • Python version: 3.6
  • CUDA/cuDNN version: 10
  • GPU model and memory: 1080 Ti

Describe the current behavior

I expect to do a forward pass with a model with a BachNormalization layer in training mode, when using the tf.distribuite.MirroredStrategy but I can't, because it reises the following exception:

RuntimeError: add_update was called in a cross-replica context. This is not expected. If you require this feature, please file an issue.

Why it is not expected?

Describe the expected behavior

It should work.
The commit that introduced this behavior is: 316cd57#diff-8eb7e20502209f082d0cb15119a50413

Code to reproduce the issue

import tensorflow as tf

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(10),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(1),
    ]
)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    out = model(tf.zeros((1, 10)), training=True)
@galeone galeone changed the title add_update in cross-replica mode is broen (BatchNormalization layer impossibile to use) add_update in cross-replica mode is broken (BatchNormalization layer impossible to use) Jun 6, 2019
@achandraa achandraa self-assigned this Jun 10, 2019
@achandraa

This comment has been minimized.

Copy link

commented Jun 10, 2019

I was able to reproduce the issue on Colab with TensorFlow version 2.0.0-dev20190605.

@achandraa achandraa assigned ymodak and unassigned achandraa Jun 10, 2019
@ymodak ymodak assigned robieta and unassigned ymodak Jun 11, 2019
@luvwinnie

This comment has been minimized.

Copy link

commented Jun 20, 2019

does this have been resolved yet?

@XuChunqiao

This comment has been minimized.

Copy link

commented Jun 29, 2019

I met same question recently,does this have been resolved yet?

@davesc

This comment has been minimized.

Copy link

commented Jul 1, 2019

Has this been resolved? I can't use tf.distribute.MirroredStrategy with BatchNormalization in a training loop

@robieta robieta assigned omalleyt12 and unassigned robieta Jul 15, 2019
@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

Hi @galeone, when using custom training loops with DistributionStrategy you have to use experimental_run_v2, please see: https://www.tensorflow.org/guide/distribute_strategy#using_tfdistributestrategy_with_custom_training_loops

@omalleyt12 omalleyt12 closed this Jul 15, 2019
@tensorflow-bot

This comment has been minimized.

Copy link

commented Jul 15, 2019

Are you satisfied with the resolution of your issue?
Yes
No

@galeone

This comment has been minimized.

Copy link
Author

commented Jul 15, 2019

HI @omalleyt12 , I'm my tests it still continues to fail, even when using experimental_run_v2.
If the system has more then one GPU (and thus the distribution strategy can really distribute the computation) the same exception is raised even if I change the code in this way:

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():


  model = tf.keras.Sequential([
        tf.keras.layers.Dense(10),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(1),
    ])

  def forward():
    return model(tf.zeros((1, 10)), training=True)
    
  print(strategy.experimental_run_v2(forward, args=()))
@vmarkovtsev

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

I hit the same problem when I use batch normalization, too.

File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/normalization.py", line 659, in call
    outputs = self._fused_batch_norm(inputs, training=training)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/normalization.py", line 556, in _fused_batch_norm
    self.add_update(mean_update)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1113, in add_update
    '`add_update` was called in a cross-replica context. This is not '
RuntimeError: `add_update` was called in a cross-replica context. This is not expected. If you require this feature, please file an issue

The problem happens while Model.build()-s so there is no way to try experimental_run_v2 for me.
Using tf-nightly-gpu-2.0-preview from today.

@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

@galeone @vmarkovtsev this should be fixed in the latest nightly, could you please check and confirm?

@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

@vmarkovtsev if it's not fixed for your use case, could you provide a simple repro?

@vmarkovtsev

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

I confirm that this issue is fixed now. Thank you @omalleyt12 !

@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

Thanks!

@omalleyt12 omalleyt12 closed this Aug 20, 2019
@tensorflow-bot

This comment has been minimized.

Copy link

commented Aug 20, 2019

Are you satisfied with the resolution of your issue?
Yes
No

@3fen

This comment has been minimized.

Copy link

commented Aug 30, 2019

@omalleyt12 I met the same problem in 2.0.0rc.

@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2019

@3fen can you share a simple reproduction that is failing?

@3fen

This comment has been minimized.

Copy link

commented Aug 30, 2019

@omalleyt12 it is similar to the main thread.

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(10),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(1),
    ]
)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    out = model(tf.zeros((1, 10)), training=True)
    print(out)

Outupt:

RuntimeError: add_update was called in a cross-replica context. This is not expected. If you require this feature, please file an issue.

@omalleyt12

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2019

@3fen , when using custom training loops with DistributionStrategy you have to use experimental_run_v2, please see: https://www.tensorflow.org/guide/distribute_strategy#using_tfdistributestrategy_with_custom_training_loops

@3fen

This comment has been minimized.

Copy link

commented Sep 2, 2019

@omalleyt12 Got it, thanks for the confirmation.

@DAEHEESHIN

This comment has been minimized.

Copy link

commented Sep 5, 2019

I hit the same problem when I use batch normalization, too.
i am using tensorflow 2.0 b1..
image

@vmarkovtsev

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2019

@DAEHEESHIN v2.0 b1 is ancient, we are discussing the current nightly here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.