Distributed training fails when using CollectiveAllReduceStrategy #24887

threeleafzerg · 2019-01-14T06:46:58Z

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
Source
TensorFlow version (use command below):
tensorflow (1.12.0rc0)
Python version:
Python3.4
Bazel version (if compiling from source):
bazel 0.16
GCC/Compiler version (if compiling from source):
gcc4.8.5
CUDA/cuDNN version:
GPU model and memory:

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
I am trying to employing CollectiveAllReduceStrategy upon tensorflow official model resnet following the instructions from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute#multi-worker-training.
Code: https://github.com/threeleafzerg/models
Steps:

Prepare dataset: python3 cifar10_download_and_extract.py --${PWD}/cifar_data
Start Worker1: https://github.com/threeleafzerg/models/blob/master/official/resnet/worker1.sh
Start Worker2: https://github.com/threeleafzerg/models/blob/master/official/resnet/worker2.sh
I expect that the distributed training could start successfully.

But unfortunately, I got python exceptions.
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 403, in _init_from_arg s
initial_value() if init_from_fn else initial_value,
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/contrib/distribute/python/collective_all_reduce_strategy.py", lin e 180, in _overridden_initial_value_fn
group_size, group_key, collective_instance_key)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/collective_ops.py", line 94, in broadcast_send
'Parameter group_size to broadcast_send must be at least 2.')
ValueError: Parameter group_size to broadcast_send must be at least 2.

Describe the expected behavior
Distributed training can start successfully.

Code to reproduce the issue
I have uploaded my experiment code in my private branch: https://github.com/threeleafzerg/models

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

byronyi · 2019-01-14T08:24:40Z

Could you try with tf-nightly? I believe we fixed a couple of bugs related to collective ops after we cut r1.12.

threeleafzerg · 2019-01-15T03:28:25Z

byronyi,
Thanks for your quick response.
Following your instructions, I install tf-nightly. I found two issues with tf-nightly build.

There's bug in this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/estimator_training.py#L302. It should be "if 'evaluator' in cluster_spec.jobs:".
After fix 1) bug, it complains that we should use standalone client mode when using estimator.train
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
hooks)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/estimator_training.py", line 310, in estimator_train
raise ValueError('Only STANDALONE_CLIENT mode is supported when you call '
ValueError: Only STANDALONE_CLIENT mode is supported when you call estimator.train

So I followed the following post to launch standalone client. But I still met the issue. Can you help check whether there's any issue in my procedure?
Post: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute#standalone-client-mode
Step:

Launch cluster.
`#!/bin/bash

export TF_CONFIG='{
"cluster": {
"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]
},
"task": {"type": "worker", "index": 0}
}'
python3 tf_std_server.pyand #!/bin/bash

export TF_CONFIG='{
"cluster": {
"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]
},
"task": {"type": "worker", "index": 1}
}'
python3 tf_std_server.py`
tf_std_server.py is from ecosystem. https://github.com/tensorflow/ecosystem/blob/master/distribution_strategy/tf_std_server.py
2. Change Distribute Config

distribution_strategy = collective_all_reduce_strategy.CollectiveAllReduceStrategy( num_gpus_per_worker=0) run_config = tf.estimator.RunConfig( experimental_distribute=tf.contrib.distribute.DistributeConfig( train_distribute=distribution_strategy, remote_cluster={"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]}), session_config=session_config, save_checkpoints_secs=60*60*24, )
And run the real resnet model:
python3 cifar10_main.py --data_dir=${PWD}/cifar_data

But I encounter this assertion failure:
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 704, in _batch_reduce_to
reduce_op, value_destination_pairs)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 277, in batch_reduce
return self._batch_reduce(reduce_op, value_destination_pairs)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 872, in _batch_reduce
[v[0] for v in value_destination_pairs])
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 912, in _batch_all_reduce
"Id")
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 353, in build_collective_reduce
raise ValueError('num_workers * len(input_tensors) must be 2 or greater')
ValueError: num_workers * len(input_tensors) must be 2 or greater

byronyi · 2019-01-15T04:08:27Z

@yuefengz Could you take a look here?

yuefengz · 2019-01-16T07:16:05Z

The error message points out that you need to have two replicas to make CollectiveAllReduceStrategy to work. Either at least 2 GPUs on a single machine or multiple machines.

For distributed training, please use tf.estimator.train_and_evaluate API.

Closing this issue, feel free to re-open it.

threeleafzerg · 2019-01-16T08:43:41Z

@yuefengz
My scnario is that we use CPU only to do the distributed training. So I pass num_gpus_per_worker=0 to CollectiveAllReduceStrategy. Do you mean currently CollectiveAllReduceStrategy doesn't support CPU only distributed training?

yuefengz · 2019-01-16T20:45:51Z

If you use cpu only, you have to have at least two machines.

byronyi · 2019-01-16T22:51:40Z

Try not touch num_gpus_per_worker, i.e. leave it as it’s default, and try again.

I tried myself a couple of months ago and it did work with CPU only TF. Let me know if your case does not.

threeleafzerg · 2019-01-17T02:36:48Z

@yuefengz Yes, I did do the distributed training on two machine with CPU only. (192.168.20.50, 192.168.20.52)
@byronyi I tried to leave num_gpus_per_worker as default. (=2) But the error is still the same.
Following is what I found after I debug the code:
In standalone client mode, CollectiveAllReduceExtended will use default init function _initialize_local_worker in which it will set num_workers to 1.
`
def _initialize_local_worker(self, num_gpus_per_worker):
"""Initializes the object for local training."""
self._is_chief = True
self._num_workers = 1

if num_gpus_per_worker:
  local_devices = tuple(
      "/device:GPU:%d" % i for i in range(num_gpus_per_worker)
  )
else:
  local_devices = ("/device:CPU:0",)

`
Thus its CollectiveAllReduce.num_workers is set to 1 too. That's why I got the error. "ValueError: num_workers * len(input_tensors) must be 2 or greater".
So does that mean standalone client mode is not suitable for multi machine distributed training?
I will modify code to use tf.estimator.train_and_evaluate so that I can use the independent worker mode.

Background: I am a developer from Intel tensorflow team focusing upon multi-node. Currently, we are using horovod as allreduce solution. But it's said that distributed strategy will be a trend in tensorflow for allreduce solution. So I am evaluating the impact of distributed strategy. Any comments and helps are highly appreciated!

byronyi · 2019-01-17T04:41:18Z

Sorry, could you give a minimal reproducing example? I have tried my self with latest nightly and failed to find any problems.

https://gist.github.com/ca1b55e5a5423d5b3abb9efc6fd34b80

byronyi · 2019-01-17T04:42:41Z

By the way, I could not find a way to get rid of the following warning:

WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.

@yuefengz Any idea what is wrong?

yuefengz · 2019-01-17T05:13:08Z

@byronyi might be something wrong with the model. I didn't see it with ResNet50. I'll re-run examples to reproduce this warning later.

threeleafzerg · 2019-01-17T11:27:42Z

@byronyi Thank you for your enlightening script. I tried it both on our company's cluster and my private machine. On our cluster the worker script got hang and in my private machine (a cleaner environment), the worker script got error. I have pasted the log in https://gist.github.com/ca1b55e5a5423d5b3abb9efc6fd34b80. Can you help to check?

threeleafzerg · 2019-01-17T11:29:21Z

@yuefengz Can you provide the full script of resnet50 example? Thanks!

byronyi · 2019-01-17T12:09:10Z

I have no idea why you met 'https' scheme not supported in proxy URI problem. I never saw that before.

Could you replace localhost with 127.0.0.1? Also running in an TF docker might help.

threeleafzerg · 2019-01-18T01:49:51Z

@byronyi I set it as 127.0.0.1, but it's the same.
I found that it's caused by invalid setting of https_proxy. After I set https_proxy as the correct one, my private desktop got hang too as my clusters. Anyway, I think it's related with grpc's https_proxy, do you know how to disable https_proxy run-time for grpc? Also, I am confused that even within single machine (localhost), the tf server needs https_proxy.

threeleafzerg · 2019-01-18T08:30:43Z

@byronyi I also asked one of my colleague to run the same set of scripts and got the same result.

The hang python call stack is as follows:
Call initializer instance with the dtype argument instead of passing it to the constructor
^CTraceback (most recent call last):
File "worker.py", line 48, in
eval_spec=eval_spec)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 462, in train_and_evaluate
estimator, train_spec, eval_spec, _TrainingExecutor)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 289, in train_and_evaluate
session_config=run_config.session_config)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 778, in run_distribute_coordinator
session_config, rpc_layer)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 466, in _run_between_graph_client
coord.join(threads_to_join)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 363, in join
while any(t.is_alive() for t in threads) and not self.wait_for_stop(1.0):
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 311, in wait_for_stop
return self._stop_event.wait(timeout)
File "/home/sunbear/miniconda2/lib/python2.7/threading.py", line 614, in wait
self.__cond.wait(timeout)
File "/home/sunbear/miniconda2/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt

threeleafzerg · 2019-01-24T08:13:07Z

@byronyi I can run the distributed training successfully(CollectiveAllReduceStrategy) on my local machine. But it's still failed in our company cluster. (same set of software) Our clusters's is managed by slurm. Does it conflicts with gRPC? Is there any pre-cautions for using gRPC from network perspective?

Failed log for independent worker mode:
INFO:tensorflow:Graph was finalized.
2019-01-24 15:55:33.368302: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 725fac0e10d1fc7b with config: device_filters: "/job:worker/task:1" device_filters: "/job:worker/task:1" allow_soft_placement: true graph_options { rewrite_ options { meta_optimizer_iterations: ONE scoped_allocator_optimization: ON scoped_allocator_opts { enable_op: "CollectiveReduce" enab le_op: "CollectiveReduce" enable_op: "CollectiveReduce" } } } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
2019-01-24 15:55:33.519166: E tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc:90] Bad response [Unavailable: Socke t closed
Additional GRPC error information:
{"created":"@1548316533.519114543","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file _line":1036,"grpc_message":"Socket closed","grpc_status":14}] from GetStepSequenceAsync call to /job:worker/replica:0/task:0
2019-01-24 15:55:33.519293: E tensorflow/core/distributed_runtime/master_session.cc:1493] Bad status from collective_executor_mgr->Re freshStepIdSequence: Unavailable: Socket closed

Failed Log from standalone client mode:
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

threeleafzerg · 2019-01-24T08:38:05Z

@byronyi root caused this issue. It's caused by http_proxy settings. Thank you all for your support.

byronyi · 2019-01-24T10:10:47Z

@threeleafzerg Nevermind, and thanks for reporting your issue.

threeleafzerg · 2019-02-02T00:56:27Z

@byronyi BTW, do you have any public design doc about distribute strategy or public future plan about distribute strategy which can be shared with us? Thanks!

byronyi · 2019-02-02T03:05:45Z

I’ll suggest you to take a look at https://github.com/tensorflow/community/blob/master/rfcs/20181016-replicator.md and tensorflow/community#55.

threeleafzerg · 2019-02-02T05:34:04Z

@byronyi Thanks!

threeleafzerg · 2019-02-20T02:59:29Z

@byronyi Sorry for bothering you again. Do you know any open material about HorovodDistributionStrategy?

byronyi · 2019-02-22T05:52:13Z

Sorry I have no knowledge of that.

vaibhavkumar049 · 2019-07-29T02:11:01Z

How to work around this error , I am using TPU Distribute strategy?

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run ExperimentalAutoShardDataset: Unable to parse tensor proto
Additional GRPC error information:
{"created":"@1564365854.431149350","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to parse tensor proto","grpc_status":3} [Op:ExperimentalAutoShardDataset]

jvishnuvardhan self-assigned this Jan 15, 2019

jvishnuvardhan added comp:model Model related issues type:bug Bug labels Jan 15, 2019

jvishnuvardhan assigned jch1 and unassigned jvishnuvardhan Jan 15, 2019

yuefengz self-assigned this Jan 16, 2019

yuefengz closed this as completed Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training fails when using CollectiveAllReduceStrategy #24887

Distributed training fails when using CollectiveAllReduceStrategy #24887

threeleafzerg commented Jan 14, 2019

byronyi commented Jan 14, 2019 •

edited

threeleafzerg commented Jan 15, 2019

byronyi commented Jan 15, 2019

yuefengz commented Jan 16, 2019

threeleafzerg commented Jan 16, 2019

yuefengz commented Jan 16, 2019

byronyi commented Jan 16, 2019

threeleafzerg commented Jan 17, 2019

byronyi commented Jan 17, 2019 •

edited

byronyi commented Jan 17, 2019

yuefengz commented Jan 17, 2019

threeleafzerg commented Jan 17, 2019

threeleafzerg commented Jan 17, 2019

byronyi commented Jan 17, 2019

threeleafzerg commented Jan 18, 2019

threeleafzerg commented Jan 18, 2019

threeleafzerg commented Jan 24, 2019

threeleafzerg commented Jan 24, 2019

byronyi commented Jan 24, 2019

threeleafzerg commented Feb 2, 2019

byronyi commented Feb 2, 2019

threeleafzerg commented Feb 2, 2019

threeleafzerg commented Feb 20, 2019

byronyi commented Feb 22, 2019

vaibhavkumar049 commented Jul 29, 2019

Distributed training fails when using CollectiveAllReduceStrategy #24887

Distributed training fails when using CollectiveAllReduceStrategy #24887

Comments

threeleafzerg commented Jan 14, 2019

byronyi commented Jan 14, 2019 • edited

threeleafzerg commented Jan 15, 2019

byronyi commented Jan 15, 2019

yuefengz commented Jan 16, 2019

threeleafzerg commented Jan 16, 2019

yuefengz commented Jan 16, 2019

byronyi commented Jan 16, 2019

threeleafzerg commented Jan 17, 2019

byronyi commented Jan 17, 2019 • edited

byronyi commented Jan 17, 2019

yuefengz commented Jan 17, 2019

threeleafzerg commented Jan 17, 2019

threeleafzerg commented Jan 17, 2019

byronyi commented Jan 17, 2019

threeleafzerg commented Jan 18, 2019

threeleafzerg commented Jan 18, 2019

threeleafzerg commented Jan 24, 2019

threeleafzerg commented Jan 24, 2019

byronyi commented Jan 24, 2019

threeleafzerg commented Feb 2, 2019

byronyi commented Feb 2, 2019

threeleafzerg commented Feb 2, 2019

threeleafzerg commented Feb 20, 2019

byronyi commented Feb 22, 2019

vaibhavkumar049 commented Jul 29, 2019

byronyi commented Jan 14, 2019 •

edited

byronyi commented Jan 17, 2019 •

edited