collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

chengmengli06 · 2022-07-25T04:13:57Z

Click to expand!

Issue Type

Bug

Source

binary

Tensorflow Version

tf 2.5 or tf 2.8

Custom Code

No

OS Platform and Distribution

Centos 72.

Mobile device

No response

Python version

3.7

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

I use MultiWorkerMirroredStrategy to train a DeepRecommendation model in EasyRec. However it reports the following error:

I0725 11:56:59.386929 140272595433216 basic_session_run_hooks.py:254] lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
2022-07-25 11:57:07.263300: E tensorflow/core/common_runtime/base_collective_executor.cc:243] BaseCollectiveExecutor::StartAbort Invalid argument: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
2022-07-25 11:57:07.264955: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264762538","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265079: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264794128","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265118: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.

However, if I update replace the function collective_ops.all_gather_v2 with all_gather in tensorflow/python/distribute/cross_device_utils.py: 381, everything runs fine. I want to know what does ordering_token means?

Standalone code to reproduce the issue

git clone https://github.com/alibaba/EasyRec.git
cd EasyRec
bash scripts/init.sh
TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror

Relevant log output

I0725 11:56:59.385384 140272595433216 basic_session_run_hooks.py:262] loss = 1.0119569, step = 0
INFO:tensorflow:lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
I0725 11:56:59.386929 140272595433216 basic_session_run_hooks.py:254] lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
2022-07-25 11:57:07.263300: E tensorflow/core/common_runtime/base_collective_executor.cc:243] BaseCollectiveExecutor::StartAbort Invalid argument: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
2022-07-25 11:57:07.264955: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264762538","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265079: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264794128","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265118: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.265255: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264813400","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265326: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.268713: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264830344","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.268788: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.269536: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Traceback (most recent call last):
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveGatherV2_16}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 145, in <module>
    tf.app.run()
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 139, in main
    FLAGS.check_mode)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/main.py", line 330, in _train_and_evaluate_impl
    estimator_train.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/compat/estimator_train.py", line 75, in train_and_evaluate
    _TrainingExecutor)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
    session_config=run_config.session_config)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 861, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 252, in _worker_fn
    hooks=hooks)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1173, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1235, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1349, in _actual_train_model_distributed
    saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1427, in _train_with_estimator_spec
    estimator_spec, worker_hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1374, in _train_with_estimator_spec_distributed
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 779, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1284, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1385, in run
    raise six.reraise(*original_exc_info)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1370, in run
    return self._sess.run(*args, **kwargs)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1443, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    return self._sess.run(*args, **kwargs)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
	 [[node CollectiveGatherV2_16 (defined at /anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1319) ]]

Original stack trace for 'CollectiveGatherV2_16':
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 145, in <module>
    tf.app.run()
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 139, in main
    FLAGS.check_mode)
  File "/easy_rec_outer/EasyRec/easy_rec/python/main.py", line 330, in _train_and_evaluate_impl
    estimator_train.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/easy_rec_outer/EasyRec/easy_rec/python/compat/estimator_train.py", line 75, in train_and_evaluate
    _TrainingExecutor)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
    session_config=run_config.session_config)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 861, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 252, in _worker_fn
    hooks=hooks)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1173, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1235, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1319, in _actual_train_model_distributed
    self.config))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 679, in _call_for_each_replica
    self._container_strategy(), fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py", line 104, in call_for_each_replica
    return _call_for_each_replica(strategy, fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py", line 239, in _call_for_each_replica
    **merge_kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/optimizer.py", line 676, in _distributed_apply
    ds_reduce_util.ReduceOp.SUM, grads_and_vars)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2402, in batch_reduce_to
    return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _batch_reduce_to
    options=self._communication_options.merge(options))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 447, in batch_reduce
    options)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1270, in batch_reduce_implementation
    for value, dest in value_destination_pairs
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1270, in <listcomp>
    for value, dest in value_destination_pairs
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1225, in reduce_implementation
    options)[0]
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1212, in _all_reduce_per_replica_values
    self._all_reduce(reduce_op, values_by_device[i], i, options))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1175, in _all_reduce
    options.timeout_seconds))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 566, in all_reduce_indexed_slices
    length, communication_hint, timeout=timeout)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 388, in _all_gather
    ordering_token=ordering_token)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/ops/collective_ops.py", line 200, in all_gather_v2
    ordering_token=ordering_token or [])
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/ops/gen_collective_ops.py", line 545, in collective_gather_v2
    timeout_seconds=timeout_seconds, name=name)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

chengmengli06 · 2022-07-25T04:17:15Z

Could @crccw give some explanations on ordering_token?

gadagashwini · 2022-07-28T06:19:26Z

Hi @chengmengli06,
I tried to replicate the issue with Tf v2.9, but I don’t see any error

(base) gadag@ashwini-gpu:~/EasyRec$ TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror
[2022-07-28 06:17:19,486][WARNING] pyhive is not installed.
[2022-07-28 06:17:19,592][INFO] GraphLearn is not installed. You can install it by "pip install https://easyrec.oss-cn-beijing.aliyuncs.com/3rdparty/graphlearn-0.7-cp27-cp27mu-linux_x86_64.whl"
[2022-07-28 06:17:20,430][WARNING] DataHub is not installed. You can install it by: pip install pydatahub
easy_rec version: 0.5.4
Usage: easy_rec.help()
/home/gadag/EasyRec/easy_rec/python/test/train_eval_test.py:639: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  LooseVersion(tf.__version__) != LooseVersion('2.3.0'),
2022-07-28 06:17:20.563878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.564777: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.587171: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.587993: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.588737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.589462: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Running tests under Python 3.9.12: /home/gadag/anaconda3/bin/python
[ RUN      ] TrainEvalTest.test_train_with_multi_worker_mirror
[  SKIPPED ] TrainEvalTest.test_train_with_multi_worker_mirror
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK (skipped=1)

Could you confirm the original issue still persists. Thank you!

chengmengli06 · 2022-08-04T06:31:31Z

https://github.com/alibaba/EasyRec/tree/fix_mirrored_bug, it can be reproduced with this branch using tensorflow 2.9.1 . The test is skipped in master branch temporarily.

gadagashwini · 2022-08-08T04:15:30Z

Hi @chengmengli06, I tried with Tensorflow 2.9.1, but i didn’t see any error. Thank you!

chengmengli06 · 2022-08-10T11:33:52Z

could you post your logs here? @gadagashwini

chengmengli06 · 2022-08-10T11:56:26Z

My output.

gadagashwini · 2022-08-16T06:33:53Z

Hi @chengmengli06, I tried with Tf 2.9.1 and CUDA 11.4.

(base) gadag@ashwini-gpu:~/EasyRec$ TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror
[2022-08-16 05:31:13,582][WARNING] pyhive is not installed.
[2022-08-16 05:31:13,683][INFO] GraphLearn is not installed. You can install it by "pip install https://easyrec.oss-cn-beijing.aliyuncs.com/3rdparty/graphlearn-0.7-cp27-cp27mu-linux_x86_64.whl"
[2022-08-16 05:31:14,443][WARNING] DataHub is not installed. You can install it by: pip install pydatahub
easy_rec version: 0.5.4
Usage: easy_rec.help()
/home/gadag/EasyRec/easy_rec/python/test/train_eval_test.py:639: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  LooseVersion(tf.__version__) != LooseVersion('2.3.0'),
2022-08-16 05:31:15.848221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.849197: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.980279: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.981214: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.981963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.982725: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Running tests under Python 3.9.12: /home/gadag/anaconda3/bin/python
[ RUN      ] TrainEvalTest.test_train_with_multi_worker_mirror
[  SKIPPED ] TrainEvalTest.test_train_with_multi_worker_mirror
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)

chengmengli06 · 2022-08-16T12:31:39Z

As can see from the log, the test case is skipped, could you checkout fix_mirrored_bug branch, and run the test again?

chengmengli06 · 2022-08-27T16:24:26Z

any progress?

chengmengli06 · 2022-08-27T16:24:40Z

@gowthamkpr

chengmengli06 · 2022-09-02T06:02:51Z

@crccw any progress?

chengmengli06 · 2022-09-28T06:25:54Z

Could you explain what is the purpose of ordering_token, the order of communication? Is it related to nccl? Maybe we could help with it.

google-ml-butler bot added the type:bug Bug label Jul 25, 2022

google-ml-butler bot assigned tilakrayal Jul 25, 2022

tilakrayal added TF 2.8 comp:ops OPs related issues comp:dist-strat Distribution Strategy related issues labels Jul 25, 2022

tilakrayal assigned gadagashwini and unassigned tilakrayal Jul 26, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Jul 28, 2022

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 4, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Aug 8, 2022

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 10, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Aug 16, 2022

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 16, 2022

gadagashwini assigned gowthamkpr and unassigned gadagashwini Aug 19, 2022

gowthamkpr assigned crccw and unassigned gowthamkpr Aug 29, 2022

gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

chengmengli06 commented Jul 25, 2022 •

edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

chengmengli06 commented Jul 25, 2022

gadagashwini commented Jul 28, 2022 •

edited

chengmengli06 commented Aug 4, 2022

gadagashwini commented Aug 8, 2022

chengmengli06 commented Aug 10, 2022 •

edited

chengmengli06 commented Aug 10, 2022

gadagashwini commented Aug 16, 2022

chengmengli06 commented Aug 16, 2022

chengmengli06 commented Aug 27, 2022

chengmengli06 commented Aug 27, 2022

chengmengli06 commented Sep 2, 2022

chengmengli06 commented Sep 28, 2022

collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

Comments

chengmengli06 commented Jul 25, 2022 • edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

chengmengli06 commented Jul 25, 2022

gadagashwini commented Jul 28, 2022 • edited

chengmengli06 commented Aug 4, 2022

gadagashwini commented Aug 8, 2022

chengmengli06 commented Aug 10, 2022 • edited

chengmengli06 commented Aug 10, 2022

gadagashwini commented Aug 16, 2022

chengmengli06 commented Aug 16, 2022

chengmengli06 commented Aug 27, 2022

chengmengli06 commented Aug 27, 2022

chengmengli06 commented Sep 2, 2022

chengmengli06 commented Sep 28, 2022

chengmengli06 commented Jul 25, 2022 •

edited by google-ml-butler bot

gadagashwini commented Jul 28, 2022 •

edited

chengmengli06 commented Aug 10, 2022 •

edited