Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531

Closed
vmarkovtsev opened this issue Oct 19, 2019 · 12 comments
Closed
Assignees
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@vmarkovtsev
Copy link
Contributor

System information
The same environment as in #32654
But with 2 machines instead of 1 and Tensorflow 2.0 release from PyPi.

Describe the current behavior

I am training DenseNet121 on Imagenet with standard Keras code and custom dataset pipeline. model.compile is called with the only "accuracy" metric. I am using MultiWorkerMirroredStrategy as described in the tutorial. Here is the log. I had to erase ~7,000 warnings which are all the same: 2019-10-19 12:23:10.615259: W tensorflow/core/framework/op_kernel.cc:309] OpKernelContext is tracking allocations but they are not being consumed by the StepStatsCollector.

Compiling with RMSprop
Fitting...
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
2019-10-19 03:57:05.813768: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:400] Cannot find shardable dataset, adding a shard node at the end of the dataset instead. This may have performance implications.
2019-10-19 03:57:19.342401: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:400] Cannot find shardable dataset, adding a shard node at the end of the dataset instead. This may have performance implications.
2019-10-19 03:59:48.236258: I tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:316] Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.236611: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:381] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.236834: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:990] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.237468: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1007] ScopedAllocatorOptimizer: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.237593: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:782] error: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 03:59:48.299255: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] scoped_allocator_optimizer failed: Internal: Abandoning ScopedAllocatorOptimizer because input FusedBatchNormGradV3_99 output 1 is already assigned to scope_id 132
2019-10-19 04:00:01.523007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-19 04:00:11.506609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-19 04:00:19.689077: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2019-10-19 04:00:29.848332: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-10-19 04:00:29.848931: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.0'; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory
2019-10-19 04:00:29.849025: W tensorflow/core/profiler/lib/profiler_session.cc:192] Encountered error while starting profiler: Unavailable: CUPTI error: CUPTI could not be loaded or symbol could not be found.
Train for 15974.0 steps

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0009375.
Epoch 1/400
2019-10-19 04:00:34.268294: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 0 memcpy records.
2019-10-19 04:00:34.314465: E tensorflow/core/platform/default/device_tracer.cc:70] CUPTI error: CUPTI could not be loaded or symbol could not be found.
15973/15974 [============================>.] - ETA: 1s - loss: 8.3656 - accuracy: 0.01342019-10-19 12:14:21.634609: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635077: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635164: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635253: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_3/metrics/accuracy/AssignAddVariableOp_1/_55]]
2019-10-19 12:14:21.635336: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2019-10-19 12:14:21.635412: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635529: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635680: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_3/metrics/accuracy/AssignAddVariableOp_1/_43]]
2019-10-19 12:14:21.635764: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:21.635930: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[replica_1/metrics/accuracy/AssignAddVariableOp_1/_63]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
Additional GRPC error information:
{"created":"@1571487261.635191370","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2019-10-19 12:14:21.636135: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext_3}}]]
2019-10-19 12:14:23.196391: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196583: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196683: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.196964: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197197: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197232: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[CollectiveReduce_1/_16]]
2019-10-19 12:14:23.197283: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]]
2019-10-19 12:14:23.197353: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[CollectiveReduce/ReadVariableOp/_18]]
2019-10-19 12:14:23.197395: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197460: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197507: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_3]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197742: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
2019-10-19 12:14:23.197870: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_1]]
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 668, in on_start
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 372, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 685, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 298, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 963, in on_epoch_end
    self._save_model(epoch=epoch, logs=logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 1001, in _save_model
    self.model.save(filepath, overwrite=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 975, in save
    signatures, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/save.py", line 112, in save_model
    model, filepath, overwrite, include_optimizer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 109, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 627, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3296, in batch_get_value
    return [x.numpy() for x in tensors]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3296, in <listcomp>
    return [x.numpy() for x in tensors]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 389, in __getattr__
    return getattr(self.get(), name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 1237, in _get_cross_replica
    self, axis=None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 805, in reduce
    return self._extended._reduce(reduce_op, value)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1436, in _reduce
    device_util.current() or "/device:CPU:0"))[0]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/collective_all_reduce_strategy.py", line 490, in _reduce_to
    reduce_op, value, destinations=destinations)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 282, in reduce
    destinations)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1025, in reduce_implementation
    all_reduced = self._batch_all_reduce(reduce_op, [per_replica_value])[0]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1091, in _batch_all_reduce
    dense_results = self._do_batch_all_reduce_dense(reduce_op, dense_values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_ops.py", line 1120, in _do_batch_all_reduce_dense
    "Id")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/cross_device_utils.py", line 365, in build_collective_reduce
    return collective_all_reduce()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError:  [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]] [Op:__inference_collective_all_reduce_2894985]

Function call stack:
collective_all_reduce


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/user/vmarkovtsev/images/efficientoffice/efficientoffice/__main__.py", line 5, in <module>
    sys.exit(main())
  File "/user/vmarkovtsev/images/efficientoffice/efficientoffice/main.py", line 221, in main
    callbacks=[tensorboard_callback, checkpoint_callback, scheduler])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 789, in fit
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 776, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 771, in _worker_fn
    return method(model, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 372, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 671, in on_start
    self.callbacks._call_end_hook(mode)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 258, in _call_end_hook
    self.on_train_end()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 375, in on_train_end
    callback.on_train_end(logs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py", line 940, in on_train_end
    self._training_state.delete_backup()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/multi_worker_training_state.py", line 161, in delete_backup
    tracking.AutoTrackable.__delattr__(self._model, CKPT_SAVED_EPOCH)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/tracking.py", line 94, in __delattr__
    super(AutoTrackable, self).__delattr__(name)
AttributeError: _ckpt_saved_epoch

Epoch 00001: loss improved from inf to 8.36576, saving model to model/DenseNet121-0001-8.366.hdf5
2019-10-19 12:14:33.567096: W tensorflow/core/common_runtime/eager/context.cc:290] Unable to destroy server_ object, so releasing instead. Servers don't support clean shutdown.

Describe the expected behavior

The expected behavior is a successful epoch ending.

Code to reproduce the issue

#!/usr/bin/env python3
import sys
import tensorflow as tf
# Otherwise nothing works, and it really sucks, but is declared in the docs
multi_worker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

def main():
    batch_size = 12
    features_shape = 372, 558, 3
    labels = 10
    sample = tf.random.uniform(features_shape)

    def with_shape(t, shape):
        t = tf.squeeze(t)
        t.set_shape(shape)
        return t

    ds_train = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).map(lambda s, l: (with_shape(s, (batch_size,) + features_shape),
                                                      with_shape(l, (batch_size, labels))))
    ds_val = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).take(10).map(
        lambda s, l: (with_shape(s, (batch_size,) + features_shape), with_shape(l, (batch_size, labels))))
    with multi_worker_strategy.scope():
        model = tf.keras.applications.DenseNet121(
            weights=None, input_shape=features_shape, classes=labels)
        model.build((batch_size,) + features_shape)
        model.summary()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        cross_entropy = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
        model.compile(optimizer=optimizer, loss=cross_entropy, metrics=["accuracy"])
    model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)


if __name__ == "__main__":
    sys.exit(main())
@rmothukuru rmothukuru self-assigned this Oct 21, 2019
@rmothukuru
Copy link
Contributor

@vmarkovtsev,
I tried reproducing the error with the code you provided but it resulted in no error. Here is the Gist.
Can you please help us in reproducing the error. Thanks!

@rmothukuru rmothukuru added comp:dist-strat Distribution Strategy related issues comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 stat:awaiting response Status - Awaiting response from author labels Oct 21, 2019
@vmarkovtsev
Copy link
Contributor Author

@rmothukuru You cannot reproduce it in Colab because it requires at least two physical nodes.

@vmarkovtsev
Copy link
Contributor Author

Besides, you need to edit my snippet to proceed with the second epoch because the error happens during an epoch change.

@rchao
Copy link
Contributor

rchao commented Dec 17, 2019

@vmarkovtsev, thanks for the report and apologies for the delay. I'm looking into this and will get back as soon as I find something. I was wondering how you set TF_CONFIG- is that before launching of this python program?

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 18, 2019
@rchao
Copy link
Contributor

rchao commented Dec 24, 2019

As I looked into it I have not been able to repro using code attached (the only difference is I've set the TF_CONFIG on the two workers). That said we can add a check before deleting the attr.

@rmothukuru rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 8, 2020
@Flamefire
Copy link
Contributor

I can verify this. I independently reported #36153 which seems to be the same issue. I haven't seen an influence of the accuracy metric though and it does happen when using a single node and multiple GPUs too. It does NOT happen when using a single GPU only. It does happen when using 2 nodes with 1 GPU each.

I tried the code posted here but get multiple warnings:

: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:428] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorDataset/_1"
Ignoring multi-device function optimization failure: Invalid argument: The graph couldn't be sorted in topological order.
2020-01-23 17:43:32.377687: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-01-23 17:43:32.377719: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-01-23 17:43:32.377891: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Cancelled: [Derived]Cancelled
Additional GRPC error information:
{"created":"@1579797812.377570803","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
1

And then a similar error as mine:

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 200 batches). You may need to use the repeat() function when building your dataset.
Epoch 2/2
Epoch 2/2
Traceback (most recent call last):
  File "git/tensorflow_tests/tf_issue_33531.py", line 50, in <module>
    sys.exit(main())
  File "git/tensorflow_tests/tf_issue_33531.py", line 46, in main
    model.fit(ds_train, validation_data=ds_val, epochs=2, steps_per_epoch=100)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
    *args, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
    return method(model, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 187, in run_one_epoch
    aggregator.finalize()
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 144, in finalize
    raise ValueError('Empty training data.')
ValueError: Empty training data.
Traceback (most recent call last):
  File "git/tensorflow_tests/tf_issue_33531.py", line 50, in <module>
    sys.exit(main())
  File "git/tensorflow_tests/tf_issue_33531.py", line 46, in main
    model.fit(ds_train, validation_data=ds_val, epochs=2, steps_per_epoch=100)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
    *args, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
    return method(model, **kwargs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.CancelledError: 2 root error(s) found.
  (0) Cancelled:  RPC Request was cancelled
	 [[node allreduce_1/CollectiveReduce (defined at git/tensorflow_tests/tf_issue_33531.py:46) ]]
	 [[densenet121/conv3_block1_0_bn/ReadVariableOp/_835]]
  (1) Cancelled:  RPC Request was cancelled
	 [[node allreduce_1/CollectiveReduce (defined at git/tensorflow_tests/tf_issue_33531.py:46) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_41558]

Errors may have originated from an input operation.
Input Source operations connected to node allreduce_1/CollectiveReduce:
 Cast_2 (defined at /scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/threading.py:926)

Input Source operations connected to node allreduce_1/CollectiveReduce:
 Cast_2 (defined at /scratch/ws/s3248973-EasyBuild/easybuild-haswell/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/threading.py:926)

Function call stack:
distributed_function -> distributed_function

@robertlugg
Copy link

So here's what I think is going on. This is based on the same error message I saw during my runs. I

TL;DR your dataset size must be an even multiple of your "total" batch size.

Walking through what I saw:

I'm using a dataset of size, let's say 3200. My batch size is 128, and I'm using datasets. I'm running without strategy/dp. It runs fine.

I then switch over and start running two nodes in MWMS and same error:

tensorflow.python.framework.errors_impl.OutOfRangeError:  [_Derived_]End of sequence
	 [[{{node IteratorGetNext_3}}]]
	 [[GroupCrossDeviceControlEdges_1/metrics/accuracy/div_no_nan/_127]]
	 [[CollectiveReduce_2]] [Op:__inference_collective_all_reduce_2894985]

I realize that the true batch size is 128 * number of workers = 256. Note that 3200 is evenly divisible by 128, yet not by 256.

Again, not sure if its the same problem, so buyer beware.

@Flamefire
Copy link
Contributor

The actual issue is 2 things (I might have explained that in #36153 ):

Using those 2 it works, but it's of course a pitfall with confusing error messages.

@goldiegadde goldiegadde added this to In progress in TensorFlow 2.3.0 Aug 5, 2020
@goldiegadde
Copy link
Contributor

Based on this comment multiworkermirroredstrategy can now handle partial batch size , and no error is raised with TF 2.3.0 release.
I am closing this issue for now, @vmarkovtsev feel free to re-open if this is still not working for you.

TensorFlow 2.3.0 automation moved this from In progress to Done Aug 5, 2020
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@TSHTUM007
Copy link

TSHTUM007 commented Sep 1, 2020

Hey I have a hiccup with the multiworker srategy to include validation set during training just to have a sense of the model overfit. here is the error I am getting:

2020-09-01 13:17:58,695 WARNING (MainThread-32393) eval_fn is not passed in. The worker_fn will be used if an "evaluator" task exists in the cluster.
2020-09-01 13:17:58,695 WARNING (MainThread-32393) eval_strategy is not passed in. No distribution strategy will be used for evaluation.
2020-09-01 13:17:58,697 INFO (MainThread-32393) Using MirroredStrategy with devices ('/job:worker/task:71',)

@TSHTUM007
Copy link

Here is the code to reproduce this issue

`def main_fun(args, ctx):
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
from tensorflowonspark import compat

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

BUFFER_SIZE = args.buffer_size
BATCH_SIZE = args.batch_size
NUM_WORKERS = args.cluster_size
total_days, n_days, n_features, n_sequence = 60, 56, 1019, 4

def parse_tfos(example_proto):
    num_features = 1019
    
    feature_def = {"day_response": tf.io.FixedLenFeature(n_sequence, tf.int64)
                   ,"days_features": tf.io.FixedLenFeature(n_sequence*n_days*n_features, tf.int64)
                  }
    
    features = tf.io.parse_single_example(example_proto, feature_def)
    
    
    data= tf.cast(features['days_features'], tf.float64)

    label = tf.cast(features['day_response'], tf.float64)
    
    #data_validation = tf.cast(features['days_features'][ (n_sequence-1) * n_days * n_features:], tf.float64)
    
    #label_validation = tf.cast(features['day_response'][ (n_sequence-1) * n_days * n_features:], tf.float64)


    data= tf.reshape(data, (n_sequence, n_days, n_features))
    label= tf.reshape(label, (n_sequence, 1))
    
    #data_validation = tf.reshape(data_validation, (n_sequence - (n_sequence-1),
                                                   #n_days, 
                                                   #n_features
                                                   #))
    
    #label_validation = tf.reshape(label_validation, (n_sequence - (n_sequence -1), 
                                                     #1))
    
    return (data, label)#, (data_validation, label_validation)


week_pattern_train = ctx.absolute_path(args.week_week_outcome_train)
ds_train = tf.data.Dataset.list_files(week_pattern_train)
ds_train = ds_train.repeat(args.epochs).shuffle(BUFFER_SIZE)
ds_train = ds_train.interleave(tf.data.TFRecordDataset)


week_pattern_validate = ctx.absolute_path(args.week_week_outcome_validate)

ds_validate = tf.data.Dataset.list_files(week_pattern_validate)
ds_validate = ds_validate.repeat(args.epochs).shuffle(BUFFER_SIZE)
ds_validate = ds_validate.interleave(tf.data.TFRecordDataset)




train_datasets_unbatched = ds_train.map(parse_tfos)
validation_datasets_unbatched = ds_validate.map(parse_tfos)

def build_and_compile_lstm_model():
    num_features = 1019
    n_days = 56
    model = tf.keras.Sequential([
        tf.keras.layers.LSTM(num_features, input_shape=(n_days, num_features)),
        tf.keras.layers.Dense(num_features, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(int(num_features*.5), activation='softplus'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1),
        ])
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

GLOBAL_BATCH_SIZE = BATCH_SIZE * NUM_WORKERS

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=True)

tf.io.gfile.makedirs(args.model_dir)
filepath = args.model_dir + "/weights-{epoch:04d}"
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=filepath, verbose=1, save_weights_only=False) ,
             tf.keras.callbacks.TensorBoard(log_dir=args.model_dir)]

steps_per_epoch = 200

with strategy.scope():
    multi_worker_model = build_and_compile_lstm_model()
    multi_worker_model.fit(x=train_datasets_unbatched, epochs=args.epochs, #steps_per_epoch=steps_per_epoch,
                           callbacks=callbacks,
                           validation_data = validation_datasets_unbatched)

from tensorflow_estimator.python.estimator.export import export_lib
export_dir = export_lib.get_timestamped_export_dir(args.export_dir)
compat.export_saved_model(multi_worker_model, export_dir, ctx.job_name == 'chief')

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
Development

No branches or pull requests

8 participants