-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise ValueError when saving a model created in mirroredstrategy #40366
Comments
Hi @djdongjin can you please provide a reproducible example? What arguments did you pass when running the script? Thanks Note this section from the docs: "We don't allow operations like v.assign_add in a cross-replica context for sync on read variables" which is essentially the error message you are seeing. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
I hit this error as well, so I created a reproducible example. I also have a StackOverflow post about it. import tensorflow as tf
class Sampling(tf.keras.layers.Layer):
"""Uses (z_mean, z_log_var) to sample z, the vector encoding a digit."""
def call(self, inputs):
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
class Encoder(tf.keras.layers.Layer):
"""Maps MNIST digits to a triplet (z_mean, z_log_var, z)."""
def __init__(self, latent_dim=32, intermediate_dim=64, name="encoder", **kwargs):
super(Encoder, self).__init__(name=name, **kwargs)
self.dense_proj = tf.keras.layers.Dense(intermediate_dim, activation="relu")
self.dense_mean = tf.keras.layers.Dense(latent_dim)
self.dense_log_var = tf.keras.layers.Dense(latent_dim)
self.sampling = Sampling()
def call(self, inputs):
x = self.dense_proj(inputs)
z_mean = self.dense_mean(x)
z_log_var = self.dense_log_var(x)
z = self.sampling((z_mean, z_log_var))
return z_mean, z_log_var, z
class Decoder(tf.keras.layers.Layer):
"""Converts z, the encoded digit vector, back into a readable digit."""
def __init__(self, original_dim, intermediate_dim=64, name="decoder", **kwargs):
super(Decoder, self).__init__(name=name, **kwargs)
self.dense_proj = tf.keras.layers.Dense(intermediate_dim, activation="relu")
self.dense_output = tf.keras.layers.Dense(original_dim, activation="sigmoid")
def call(self, inputs):
x = self.dense_proj(inputs)
return self.dense_output(x)
class VariationalAutoEncoder(tf.keras.Model):
"""Combines the encoder and decoder into an end-to-end model for training."""
def __init__(self, original_dim, intermediate_dim=64, latent_dim=32, name="autoencoder", **kwargs):
super(VariationalAutoEncoder, self).__init__(name=name, **kwargs)
self.original_dim = original_dim
self.encoder = Encoder(latent_dim=latent_dim, intermediate_dim=intermediate_dim)
self.decoder = Decoder(original_dim, intermediate_dim=intermediate_dim)
def call(self, inputs):
z_mean, z_log_var, z = self.encoder(inputs)
reconstructed = self.decoder(z)
# Add KL divergence regularization loss.
kl_loss = -0.5 * tf.reduce_mean(
z_log_var - tf.square(z_mean) - tf.exp(z_log_var) + 1
)
self.add_loss(kl_loss)
self.add_metric([0.], name="foo")
return reconstructed
(x_train, _), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype("float32") / 255
original_dim = 784
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
vae = VariationalAutoEncoder(original_dim, 64, 32)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
vae.compile(optimizer, loss=tf.keras.losses.MeanSquaredError())
vae.fit(x_train, x_train, epochs=3, batch_size=64)
vae.save("vae") For me, if I just remove the As for environment, I'm using the Google AI Platform runtime version 2.3. |
Hi @acarl005, thanks for providing a simple, reproducible example! I took a deeper look at this issue, seems to be a known bug that has been fixed in 2.4. Please see this gist that runs without error. You should now be able to export a Keras model trained with a custom metric. If you want to test on AI Platform I think you'll need to use a custom container since the latest runtime version is only 2.3 |
Thanks @nikitamaia for the quick response, and for providing the gist. We'll work on upgrading to 2.4. |
This seems to be a problem also with: Using TF 2.5.0 and get the same error with MirroredStrategy |
I'm getting this error with tf.keras.metrics.CategoricalAccuracy() (which inherits from Mean) when using MirroredStrategy OS Ubuntu 18.04.5 |
I am facing the same issue with Tensorflow 2.7.0 Mirrored Strategy using tf.keras.metrics.mean() |
I am also seeing this issue with Tensorflow 2.7.0 |
Facing the same issue with Tensorflow 2.8.0 using |
I am also getting this error in Tensorflow 2.8.0 using |
Same here. any workarounds? |
I'm using TensorFlow 2.9.1, and I found a workaround. strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = ...
loss_fn = ...
optimizer = ...
# Out of scope, this works.
metrics = tf.keras.metrics.Mean(name='total_loss') This works when you are using custom training loop. The model should not be compiled within the |
I'm facing the same issue under tf 2.8.0. But in my opinion, that quite makes sense. A metric object is supposed to collect the metric tensors from all the replicas created. So it needs to be created in the cross-replica context enabled with So the solution is to split metric creation and update. One can pack a The Distributed Training with Keras Tutorial uses |
Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
The model is created inside a mirroredstrategy. When I save the model using
model.save(save_path)
after training, it raisesValueError: SyncOnReadVariable does not support
assign_addin cross-replica context when aggregation is set to
tf.VariableAggregation.SUM.
The error is triggered here. The related error tracing is:I also attached a complete tracing for your reference.
Describe the expected behavior
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: