MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

vmarkovtsev · 2019-10-19T13:48:42Z

System information
The same environment as in #32654
But with 2 machines instead of 1 and Tensorflow 2.0 release from PyPi.

Describe the current behavior

I am training DenseNet121 on Imagenet with standard Keras code and custom dataset pipeline. model.compile is called with the only "accuracy" metric. I am using MultiWorkerMirroredStrategy as described in the tutorial. Here is the log. I had to erase ~7,000 warnings which are all the same: 2019-10-19 12:23:10.615259: W tensorflow/core/framework/op_kernel.cc:309] OpKernelContext is tracking allocations but they are not being consumed by the StepStatsCollector.

Compiling with RMSprop
Fitting...
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
2019-10-19 03:57:05.813768: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:400] Cannot find shardable dataset, adding a shard node at the end of the dataset instead. This may have performance implications.
2019-10-19 03:57:19.342401: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:400] Cannot find shardable dataset, adding a shard node at the end of the dataset instead. This may have performance implications.
2019-10-19 03:59:48.236258: I tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:316] Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.236611: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:381] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.236834: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:990] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.237468: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1007] ScopedAllocatorOptimizer: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.237593: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:782] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.299255: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] scoped_allocator_optimizer failed: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 04:00:01.523007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-19 04:00:11.506609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-19 04:00:19.689077: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2019-10-19 04:00:29.848332: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-10-19 04:00:29.848931: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.0'; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory
2019-10-19 04:00:29.849025: W tensorflow/core/profiler/lib/profiler_session.cc:192] Encountered error while starting profiler: Unavailable: CUPTI error: CUPTI could not be loaded or symbol could not be found.
Train for 15974.0 steps

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0009375.
Epoch 1/400
2019-10-19 04:00:34.268294: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 0 memcpy records.
2019-10-19 04:00:34.314465: E tensorflow/core/platform/default/device_tracer.cc:70] CUPTI error: CUPTI could not be loaded or symbol could not be found.
15973/15974 [============================>.] - ETA: 1s - loss: 8.3656 - accuracy: 0.01342019-10-19 12:14:21.634609: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635077: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635164: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635253: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_3/metrics/accuracy/AssignAddVariableOp_1/_55]]
2019-10-19 12:14:21.635336: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2019-10-19 12:14:21.635412: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635529: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635680: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_3/metrics/accuracy/AssignAddVariableOp_1/_43]]
2019-10-19 12:14:21.635764: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635930: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_1/metrics/accuracy/AssignAddVariableOp_1/_63]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2019-10-19 12:14:21.636135: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
2019-10-19 12:14:23.196391: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196583: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196683: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196964: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197197: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197232: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[CollectiveReduce_1/_16]]
2019-10-19 12:14:23.197283: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]]
2019-10-19 12:14:23.197353: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[CollectiveReduce/ReadVariableOp/_18]]
2019-10-19 12:14:23.197395: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197460: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197507: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197742: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197870: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_1]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 668, in on_start
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 372, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 685, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 298, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 963, in on_epoch_end
    self._save_model(epoch=epoch, logs=logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 1001, in _save_model
    self.model.save(filepath, overwrite=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 975, in save
    signatures, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/save.py", line 112, in save_model
    model, filepath, overwrite, include_optimizer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 109, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 627, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3296, in batch_get_value
    return [x.numpy() for x in tensors]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3296, in <listcomp>
    return [x.numpy() for x in tensors]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 389, in __getattr__
    return getattr(self.get(), name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 1237, in _get_cross_replica
    self, axis=None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 805, in reduce
    return self._extended._reduce(reduce_op, value)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1436, in _reduce
    device_util.current() or "/device:CPU:0"))[0]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/collective_all_reduce_strategy.py", line 490, in _reduce_to
    reduce_op, value, destinations=destinations)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 282, in reduce
    destinations)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1025, in reduce_implementation
    all_reduced = self._batch_all_reduce(reduce_op, [per_replica_value])[0]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1091, in _batch_all_reduce
    dense_results = self._do_batch_all_reduce_dense(reduce_op, dense_values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1120, in _do_batch_all_reduce_dense
    "Id")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_utils.py", line 365, in build_collective_reduce
    return collective_all_reduce()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError:  [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]] [Op:__inference_collective_all_reduce_2894985]

Function call stack:
collective_all_reduce


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/user/vmarkovtsev/images/efficientoffice/efficientoffice/__main__.py", line 5, in <module>
    sys.exit(main())
  File "/user/vmarkovtsev/images/efficientoffice/efficientoffice/main.py", line 221, in main
    callbacks=[tensorboard_callback, checkpoint_callback, scheduler])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 789, in fit
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 776, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 771, in _worker_fn
    return method(model, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 372, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 671, in on_start
    self.callbacks._call_end_hook(mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 258, in _call_end_hook
    self.on_train_end()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 375, in on_train_end
    callback.on_train_end(logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 940, in on_train_end
    self._training_state.delete_backup()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/multi_worker_training_state.py", line 161, in delete_backup
    tracking.AutoTrackable.__delattr__(self._model, CKPT_SAVED_EPOCH)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/tracking.py", line 94, in __delattr__
    super(AutoTrackable, self).__delattr__(name)
AttributeError: _ckpt_saved_epoch

Epoch 00001: loss improved from inf to 8.36576, saving model to model/DenseNet121-0001-8.366.hdf5
2019-10-19 12:14:33.567096: W tensorflow/core/common_runtime/eager/context.cc:290] Unable to destroy server_ object, so releasing instead. Servers don't support clean shutdown.

Describe the expected behavior

The expected behavior is a successful epoch ending.

Code to reproduce the issue

#!/usr/bin/env python3
import sys
import tensorflow as tf
# Otherwise nothing works, and it really sucks, but is declared in the docs
multi_worker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

def main():
    batch_size = 12
    features_shape = 372, 558, 3
    labels = 10
    sample = tf.random.uniform(features_shape)

    def with_shape(t, shape):
        t = tf.squeeze(t)
        t.set_shape(shape)
        return t

    ds_train = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).map(lambda s, l: (with_shape(s, (batch_size,) + features_shape),
                                                      with_shape(l, (batch_size, labels))))
    ds_val = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).take(10).map(
        lambda s, l: (with_shape(s, (batch_size,) + features_shape), with_shape(l, (batch_size, labels))))
    with multi_worker_strategy.scope():
        model = tf.keras.applications.DenseNet121(
            weights=None, input_shape=features_shape, classes=labels)
        model.build((batch_size,) + features_shape)
        model.summary()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        cross_entropy = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
        model.compile(optimizer=optimizer, loss=cross_entropy, metrics=["accuracy"])
    model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)


if __name__ == "__main__":
    sys.exit(main())

The text was updated successfully, but these errors were encountered:

rmothukuru · 2019-10-21T08:49:44Z

@vmarkovtsev,
I tried reproducing the error with the code you provided but it resulted in no error. Here is the Gist.
Can you please help us in reproducing the error. Thanks!

vmarkovtsev · 2019-10-21T09:05:31Z

@rmothukuru You cannot reproduce it in Colab because it requires at least two physical nodes.

vmarkovtsev · 2019-10-21T09:07:43Z

Besides, you need to edit my snippet to proceed with the second epoch because the error happens during an epoch change.

rchao · 2019-12-17T20:17:37Z

@vmarkovtsev, thanks for the report and apologies for the delay. I'm looking into this and will get back as soon as I find something. I was wondering how you set TF_CONFIG- is that before launching of this python program?

rchao · 2019-12-24T06:52:53Z

As I looked into it I have not been able to repro using code attached (the only difference is I've set the TF_CONFIG on the two workers). That said we can add a check before deleting the attr.

Flamefire · 2020-01-23T16:48:55Z

I can verify this. I independently reported #36153 which seems to be the same issue. I haven't seen an influence of the accuracy metric though and it does happen when using a single node and multiple GPUs too. It does NOT happen when using a single GPU only. It does happen when using 2 nodes with 1 GPU each.

I tried the code posted here but get multiple warnings:

: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:428] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorDataset/_1"
Ignoring multi-device function optimization failure: Invalid argument: The graph couldn't be sorted in topological order.
2020-01-23 17:43:32.377687: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-01-23 17:43:32.377719: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-01-23 17:43:32.377891: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
1

And then a similar error as mine:

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 200 batches). You may need to use the repeat() function when building your dataset.
Epoch 2/2
Epoch 2/2
Traceback (most recent call last):
  File "git/tensorflow_tests/tf_issue_33531.py", line 50, in <module>
    sys.exit(main())
  File "git/tensorflow_tests/tf_issue_33531.py", line 46, in main
    model.fit(ds_train, validation_data=ds_val, epochs=2, steps_per_epoch=100)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
    *args, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
    return method(model, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 187, in run_one_epoch
    aggregator.finalize()
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 144, in finalize
    raise ValueError('Empty training data.')
ValueError: Empty training data.
Traceback (most recent call last):
  File "git/tensorflow_tests/tf_issue_33531.py", line 50, in <module>
    sys.exit(main())
  File "git/tensorflow_tests/tf_issue_33531.py", line 46, in main
    model.fit(ds_train, validation_data=ds_val, epochs=2, steps_per_epoch=100)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
    *args, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
    return method(model, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.CancelledError: 2 root error(s) found.
  (0) Cancelled:  RPC Request was cancelled
	 [[node allreduce_1/CollectiveReduce (defined at git/tensorflow_tests/tf_issue_33531.py:46) ]]
	 [[densenet121/conv3_block1_0_bn/ReadVariableOp/_835]]
  (1) Cancelled:  RPC Request was cancelled
	 [[node allreduce_1/CollectiveReduce (defined at git/tensorflow_tests/tf_issue_33531.py:46) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_41558]

Errors may have originated from an input operation.
Input Source operations connected to node allreduce_1/CollectiveReduce:
 Cast_2 (defined at /scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/threading.py:926)

Input Source operations connected to node allreduce_1/CollectiveReduce:
 Cast_2 (defined at /scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/threading.py:926)

Function call stack:
distributed_function -> distributed_function

robertlugg · 2020-04-28T19:56:32Z

So here's what I think is going on. This is based on the same error message I saw during my runs. I

TL;DR your dataset size must be an even multiple of your "total" batch size.

Walking through what I saw:

I'm using a dataset of size, let's say 3200. My batch size is 128, and I'm using datasets. I'm running without strategy/dp. It runs fine.

I then switch over and start running two nodes in MWMS and same error:

tensorflow.python.framework.errors_impl.OutOfRangeError:  [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]] [Op:__inference_collective_all_reduce_2894985]

I realize that the true batch size is 128 * number of workers = 256. Note that 3200 is evenly divisible by 128, yet not by 256.

Again, not sure if its the same problem, so buyer beware.

Flamefire · 2020-04-29T06:58:33Z

The actual issue is 2 things (I might have explained that in #36153 ):

MultiWorkerMirroredStrategy requires steps_per_epoch being set (and the dataset to deliver that many full batches)
It does not work with multiple epochs unless you .repeat() the dataset because the iterator is not reset. I reported that in tf.data.Dataset unusable with steps_per_epoch standard training loop #36539

Using those 2 it works, but it's of course a pitfall with confusing error messages.

goldiegadde · 2020-08-05T00:41:03Z

Based on this comment multiworkermirroredstrategy can now handle partial batch size , and no error is raised with TF 2.3.0 release.
I am closing this issue for now, @vmarkovtsev feel free to re-open if this is still not working for you.

google-ml-butler · 2020-08-05T00:41:05Z

Are you satisfied with the resolution of your issue?
Yes
No

TSHTUM007 · 2020-09-01T12:53:31Z

Hey I have a hiccup with the multiworker srategy to include validation set during training just to have a sense of the model overfit. here is the error I am getting:

2020-09-01 13:17:58,695 WARNING (MainThread-32393) eval_fn is not passed in. The worker_fn will be used if an "evaluator" task exists in the cluster.
2020-09-01 13:17:58,695 WARNING (MainThread-32393) eval_strategy is not passed in. No distribution strategy will be used for evaluation.
2020-09-01 13:17:58,697 INFO (MainThread-32393) Using MirroredStrategy with devices ('/job:worker/task:71',)

TSHTUM007 · 2020-09-01T12:55:08Z

Here is the code to reproduce this issue

`def main_fun(args, ctx):
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
from tensorflowonspark import compat

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

BUFFER_SIZE = args.buffer_size
BATCH_SIZE = args.batch_size
NUM_WORKERS = args.cluster_size
total_days, n_days, n_features, n_sequence = 60, 56, 1019, 4

def parse_tfos(example_proto):
    num_features = 1019
    
    feature_def = {"day_response": tf.io.FixedLenFeature(n_sequence, tf.int64)
                   ,"days_features": tf.io.FixedLenFeature(n_sequence*n_days*n_features, tf.int64)
                  }
    
    features = tf.io.parse_single_example(example_proto, feature_def)
    
    
    data= tf.cast(features['days_features'], tf.float64)

    label = tf.cast(features['day_response'], tf.float64)
    
    #data_validation = tf.cast(features['days_features'][ (n_sequence-1) * n_days * n_features:], tf.float64)
    
    #label_validation = tf.cast(features['day_response'][ (n_sequence-1) * n_days * n_features:], tf.float64)


    data= tf.reshape(data, (n_sequence, n_days, n_features))
    label= tf.reshape(label, (n_sequence, 1))
    
    #data_validation = tf.reshape(data_validation, (n_sequence - (n_sequence-1),
                                                   #n_days, 
                                                   #n_features
                                                   #))
    
    #label_validation = tf.reshape(label_validation, (n_sequence - (n_sequence -1), 
                                                     #1))
    
    return (data, label)#, (data_validation, label_validation)


week_pattern_train = ctx.absolute_path(args.week_week_outcome_train)
ds_train = tf.data.Dataset.list_files(week_pattern_train)
ds_train = ds_train.repeat(args.epochs).shuffle(BUFFER_SIZE)
ds_train = ds_train.interleave(tf.data.TFRecordDataset)


week_pattern_validate = ctx.absolute_path(args.week_week_outcome_validate)

ds_validate = tf.data.Dataset.list_files(week_pattern_validate)
ds_validate = ds_validate.repeat(args.epochs).shuffle(BUFFER_SIZE)
ds_validate = ds_validate.interleave(tf.data.TFRecordDataset)




train_datasets_unbatched = ds_train.map(parse_tfos)
validation_datasets_unbatched = ds_validate.map(parse_tfos)

def build_and_compile_lstm_model():
    num_features = 1019
    n_days = 56
    model = tf.keras.Sequential([
        tf.keras.layers.LSTM(num_features, input_shape=(n_days, num_features)),
        tf.keras.layers.Dense(num_features, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(int(num_features*.5), activation='softplus'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1),
        ])
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

GLOBAL_BATCH_SIZE = BATCH_SIZE * NUM_WORKERS

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=True)

tf.io.gfile.makedirs(args.model_dir)
filepath = args.model_dir + "/weights-{epoch:04d}"
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=filepath, verbose=1, save_weights_only=False) ,
             tf.keras.callbacks.TensorBoard(log_dir=args.model_dir)]

steps_per_epoch = 200

with strategy.scope():
    multi_worker_model = build_and_compile_lstm_model()
    multi_worker_model.fit(x=train_datasets_unbatched, epochs=args.epochs, #steps_per_epoch=steps_per_epoch,
                           callbacks=callbacks,
                           validation_data = validation_datasets_unbatched)

from tensorflow_estimator.python.estimator.export import export_lib
export_dir = export_lib.get_timestamped_export_dir(args.export_dir)
compat.export_saved_model(multi_worker_model, export_dir, ctx.job_name == 'chief')

`

rmothukuru self-assigned this Oct 21, 2019

rmothukuru added comp:dist-strat Distribution Strategy related issues comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 stat:awaiting response Status - Awaiting response from author labels Oct 21, 2019

rmothukuru added type:bug Bug stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed comp:keras Keras related issues stat:awaiting response Status - Awaiting response from author labels Oct 21, 2019

vmarkovtsev mentioned this issue Nov 6, 2019

tf.data.Dataset fixed size batching with subsequent map() under tf.distribute.MirroredStrategy leads to a crash #34039

Closed

vmarkovtsev mentioned this issue Nov 19, 2019

No IteratorGetNextSync in profile results under MirroredStrategy tensorflow/tensorboard#2948

Open

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 18, 2019

rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 8, 2020

goldiegadde added this to In progress in TensorFlow 2.3.0 Aug 5, 2020

goldiegadde closed this as completed Aug 5, 2020

TensorFlow 2.3.0 automation moved this from In progress to Done Aug 5, 2020

rajeev921 mentioned this issue Sep 10, 2020

scaling issue by using multiworkerstrategy for CPU #42888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

vmarkovtsev commented Oct 19, 2019

rmothukuru commented Oct 21, 2019

vmarkovtsev commented Oct 21, 2019

vmarkovtsev commented Oct 21, 2019

rchao commented Dec 17, 2019 •

edited

rchao commented Dec 24, 2019

Flamefire commented Jan 23, 2020

robertlugg commented Apr 28, 2020

Flamefire commented Apr 29, 2020

goldiegadde commented Aug 5, 2020

google-ml-butler bot commented Aug 5, 2020

TSHTUM007 commented Sep 1, 2020 •

edited

TSHTUM007 commented Sep 1, 2020

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

Comments

vmarkovtsev commented Oct 19, 2019

rmothukuru commented Oct 21, 2019

vmarkovtsev commented Oct 21, 2019

vmarkovtsev commented Oct 21, 2019

rchao commented Dec 17, 2019 • edited

rchao commented Dec 24, 2019

Flamefire commented Jan 23, 2020

robertlugg commented Apr 28, 2020

Walking through what I saw:

Flamefire commented Apr 29, 2020

goldiegadde commented Aug 5, 2020

google-ml-butler bot commented Aug 5, 2020

TSHTUM007 commented Sep 1, 2020 • edited

TSHTUM007 commented Sep 1, 2020

Here is the code to reproduce this issue

steps_per_epoch = 200

rchao commented Dec 17, 2019 •

edited

TSHTUM007 commented Sep 1, 2020 •

edited