Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885

Open
chengmengli06 opened this issue Jul 25, 2022 · 12 comments
Open
Assignees
Labels
comp:dist-strat Distribution Strategy related issues comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.8 type:bug Bug

Comments

@chengmengli06
Copy link

chengmengli06 commented Jul 25, 2022

Click to expand!

Issue Type

Bug

Source

binary

Tensorflow Version

tf 2.5 or tf 2.8

Custom Code

No

OS Platform and Distribution

Centos 72.

Mobile device

No response

Python version

3.7

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

I use MultiWorkerMirroredStrategy to train a DeepRecommendation model in EasyRec. However it reports the following error:

I0725 11:56:59.386929 140272595433216 basic_session_run_hooks.py:254] lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
2022-07-25 11:57:07.263300: E tensorflow/core/common_runtime/base_collective_executor.cc:243] BaseCollectiveExecutor::StartAbort Invalid argument: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
2022-07-25 11:57:07.264955: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264762538","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265079: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264794128","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265118: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.

However, if I update replace the function collective_ops.all_gather_v2 with all_gather in tensorflow/python/distribute/cross_device_utils.py: 381, everything runs fine. I want to know what does ordering_token means?

Standalone code to reproduce the issue

git clone https://github.com/alibaba/EasyRec.git
cd EasyRec
bash scripts/init.sh
TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror

Relevant log output

I0725 11:56:59.385384 140272595433216 basic_session_run_hooks.py:262] loss = 1.0119569, step = 0
INFO:tensorflow:lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
I0725 11:56:59.386929 140272595433216 basic_session_run_hooks.py:254] lr = 0.001,step = 0,cross_entropy_loss = 1.0084826,regularization_loss = 0.003870264,total_loss = 1.0123528
2022-07-25 11:57:07.263300: E tensorflow/core/common_runtime/base_collective_executor.cc:243] BaseCollectiveExecutor::StartAbort Invalid argument: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
2022-07-25 11:57:07.264955: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264762538","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265079: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264794128","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265118: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.265255: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264813400","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.265326: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.268713: E tensorflow/core/common_runtime/ring_alg.cc:276] Aborting RingGather with Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1658721427.264830344","description":"Error received from peer ipv4:127.0.0.1:10838","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:1/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.\nThe error could be from a previous operation. Restart your program to reset.","grpc_status":3}
2022-07-25 11:57:07.268788: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
2022-07-25 11:57:07.269536: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at collective_ops.cc:713 : Invalid argument: [_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
Traceback (most recent call last):
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveGatherV2_16}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 145, in <module>
    tf.app.run()
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 139, in main
    FLAGS.check_mode)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/main.py", line 330, in _train_and_evaluate_impl
    estimator_train.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/apsarapangu/disk3/mengli.cml/easy_rec_outer/EasyRec/easy_rec/python/compat/estimator_train.py", line 75, in train_and_evaluate
    _TrainingExecutor)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
    session_config=run_config.session_config)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 861, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 252, in _worker_fn
    hooks=hooks)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1173, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1235, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1349, in _actual_train_model_distributed
    saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1427, in _train_with_estimator_spec
    estimator_spec, worker_hooks, saving_listeners)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1374, in _train_with_estimator_spec_distributed
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 779, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1284, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1385, in run
    raise six.reraise(*original_exc_info)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1370, in run
    return self._sess.run(*args, **kwargs)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1443, in run
    run_metadata=run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    return self._sess.run(*args, **kwargs)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
[_Derived_]Collective ops is aborted by: Shape mismatch in the collective instance 244. Op at device /job:worker/replica:0/task:0/device:CPU:0 expected shape [98] but another member in the group expected shape [102]. This is likely due to different input shapes at different members of the collective op.
The error could be from a previous operation. Restart your program to reset.
	 [[node CollectiveGatherV2_16 (defined at /anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1319) ]]

Original stack trace for 'CollectiveGatherV2_16':
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 145, in <module>
    tf.app.run()
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/easy_rec_outer/EasyRec/easy_rec/python/train_eval.py", line 139, in main
    FLAGS.check_mode)
  File "/easy_rec_outer/EasyRec/easy_rec/python/main.py", line 330, in _train_and_evaluate_impl
    estimator_train.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/easy_rec_outer/EasyRec/easy_rec/python/compat/estimator_train.py", line 75, in train_and_evaluate
    _TrainingExecutor)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
    session_config=run_config.session_config)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 861, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 252, in _worker_fn
    hooks=hooks)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1173, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1235, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1319, in _actual_train_model_distributed
    self.config))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 679, in _call_for_each_replica
    self._container_strategy(), fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py", line 104, in call_for_each_replica
    return _call_for_each_replica(strategy, fn, args, kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py", line 239, in _call_for_each_replica
    **merge_kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/training/optimizer.py", line 676, in _distributed_apply
    ds_reduce_util.ReduceOp.SUM, grads_and_vars)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2402, in batch_reduce_to
    return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770, in _batch_reduce_to
    options=self._communication_options.merge(options))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 447, in batch_reduce
    options)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1270, in batch_reduce_implementation
    for value, dest in value_destination_pairs
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1270, in <listcomp>
    for value, dest in value_destination_pairs
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1225, in reduce_implementation
    options)[0]
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1212, in _all_reduce_per_replica_values
    self._all_reduce(reduce_op, values_by_device[i], i, options))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 1175, in _all_reduce
    options.timeout_seconds))
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 566, in all_reduce_indexed_slices
    length, communication_hint, timeout=timeout)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 388, in _all_gather
    ordering_token=ordering_token)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/ops/collective_ops.py", line 200, in all_gather_v2
    ordering_token=ordering_token or [])
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/ops/gen_collective_ops.py", line 545, in collective_gather_v2
    timeout_seconds=timeout_seconds, name=name)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
    op_def=op_def)
  File "/anaconda3/envs/tf_py3_20/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)
@chengmengli06
Copy link
Author

Could @crccw give some explanations on ordering_token?

@tilakrayal tilakrayal added TF 2.8 comp:ops OPs related issues comp:dist-strat Distribution Strategy related issues labels Jul 25, 2022
@gadagashwini
Copy link
Contributor

gadagashwini commented Jul 28, 2022

Hi @chengmengli06,
I tried to replicate the issue with Tf v2.9, but I don’t see any error

(base) gadag@ashwini-gpu:~/EasyRec$ TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror
[2022-07-28 06:17:19,486][WARNING] pyhive is not installed.
[2022-07-28 06:17:19,592][INFO] GraphLearn is not installed. You can install it by "pip install https://easyrec.oss-cn-beijing.aliyuncs.com/3rdparty/graphlearn-0.7-cp27-cp27mu-linux_x86_64.whl"
[2022-07-28 06:17:20,430][WARNING] DataHub is not installed. You can install it by: pip install pydatahub
easy_rec version: 0.5.4
Usage: easy_rec.help()
/home/gadag/EasyRec/easy_rec/python/test/train_eval_test.py:639: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  LooseVersion(tf.__version__) != LooseVersion('2.3.0'),
2022-07-28 06:17:20.563878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.564777: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.587171: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.587993: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.588737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 06:17:20.589462: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Running tests under Python 3.9.12: /home/gadag/anaconda3/bin/python
[ RUN      ] TrainEvalTest.test_train_with_multi_worker_mirror
[  SKIPPED ] TrainEvalTest.test_train_with_multi_worker_mirror
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK (skipped=1)

Could you confirm the original issue still persists. Thank you!

@gadagashwini gadagashwini added the stat:awaiting response Status - Awaiting response from author label Jul 28, 2022
@chengmengli06
Copy link
Author

https://github.com/alibaba/EasyRec/tree/fix_mirrored_bug, it can be reproduced with this branch using tensorflow 2.9.1 . The test is skipped in master branch temporarily.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 4, 2022
@gadagashwini
Copy link
Contributor

Hi @chengmengli06, I tried with Tensorflow 2.9.1, but i didn’t see any error. Thank you!

@gadagashwini gadagashwini added the stat:awaiting response Status - Awaiting response from author label Aug 8, 2022
@chengmengli06
Copy link
Author

chengmengli06 commented Aug 10, 2022

could you post your logs here? @gadagashwini

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 10, 2022
@chengmengli06
Copy link
Author

image
My output.

@gadagashwini
Copy link
Contributor

Hi @chengmengli06, I tried with Tf 2.9.1 and CUDA 11.4.

(base) gadag@ashwini-gpu:~/EasyRec$ TEST_DEVICES='' python -m easy_rec.python.test.train_eval_test TrainEvalTest.test_train_with_multi_worker_mirror
[2022-08-16 05:31:13,582][WARNING] pyhive is not installed.
[2022-08-16 05:31:13,683][INFO] GraphLearn is not installed. You can install it by "pip install https://easyrec.oss-cn-beijing.aliyuncs.com/3rdparty/graphlearn-0.7-cp27-cp27mu-linux_x86_64.whl"
[2022-08-16 05:31:14,443][WARNING] DataHub is not installed. You can install it by: pip install pydatahub
easy_rec version: 0.5.4
Usage: easy_rec.help()
/home/gadag/EasyRec/easy_rec/python/test/train_eval_test.py:639: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  LooseVersion(tf.__version__) != LooseVersion('2.3.0'),
2022-08-16 05:31:15.848221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.849197: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.980279: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.981214: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.981963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-16 05:31:15.982725: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Running tests under Python 3.9.12: /home/gadag/anaconda3/bin/python
[ RUN      ] TrainEvalTest.test_train_with_multi_worker_mirror
[  SKIPPED ] TrainEvalTest.test_train_with_multi_worker_mirror
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)

@gadagashwini gadagashwini added the stat:awaiting response Status - Awaiting response from author label Aug 16, 2022
@chengmengli06
Copy link
Author

As can see from the log, the test case is skipped, could you checkout fix_mirrored_bug branch, and run the test again?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Aug 16, 2022
@chengmengli06
Copy link
Author

any progress?

@chengmengli06
Copy link
Author

@gowthamkpr

@gowthamkpr gowthamkpr assigned crccw and unassigned gowthamkpr Aug 29, 2022
@gowthamkpr gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 29, 2022
@chengmengli06
Copy link
Author

@crccw any progress?

@chengmengli06
Copy link
Author

Could you explain what is the purpose of ordering_token, the order of communication? Is it related to nccl? Maybe we could help with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.8 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants