Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training fails when using CollectiveAllReduceStrategy #24887

Closed
threeleafzerg opened this issue Jan 14, 2019 · 25 comments
Closed

Distributed training fails when using CollectiveAllReduceStrategy #24887

threeleafzerg opened this issue Jan 14, 2019 · 25 comments
Assignees
Labels
comp:model Model related issues type:bug Bug

Comments

@threeleafzerg
Copy link

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes.
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
    Source
  • TensorFlow version (use command below):
    tensorflow (1.12.0rc0)
  • Python version:
    Python3.4
  • Bazel version (if compiling from source):
    bazel 0.16
  • GCC/Compiler version (if compiling from source):
    gcc4.8.5
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
I am trying to employing CollectiveAllReduceStrategy upon tensorflow official model resnet following the instructions from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute#multi-worker-training.
Code: https://github.com/threeleafzerg/models
Steps:

  1. Prepare dataset: python3 cifar10_download_and_extract.py --${PWD}/cifar_data
  2. Start Worker1: https://github.com/threeleafzerg/models/blob/master/official/resnet/worker1.sh
  3. Start Worker2: https://github.com/threeleafzerg/models/blob/master/official/resnet/worker2.sh
    I expect that the distributed training could start successfully.

But unfortunately, I got python exceptions.
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 403, in _init_from_arg s
initial_value() if init_from_fn else initial_value,
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/contrib/distribute/python/collective_all_reduce_strategy.py", lin e 180, in _overridden_initial_value_fn
group_size, group_key, collective_instance_key)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/collective_ops.py", line 94, in broadcast_send
'Parameter group_size to broadcast_send must be at least 2.')
ValueError: Parameter group_size to broadcast_send must be at least 2.

Describe the expected behavior
Distributed training can start successfully.

Code to reproduce the issue
I have uploaded my experiment code in my private branch: https://github.com/threeleafzerg/models

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@byronyi
Copy link
Contributor

byronyi commented Jan 14, 2019

Could you try with tf-nightly? I believe we fixed a couple of bugs related to collective ops after we cut r1.12.

@threeleafzerg
Copy link
Author

byronyi,
Thanks for your quick response.
Following your instructions, I install tf-nightly. I found two issues with tf-nightly build.

  1. There's bug in this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/estimator_training.py#L302. It should be "if 'evaluator' in cluster_spec.jobs:".
  2. After fix 1) bug, it complains that we should use standalone client mode when using estimator.train
    File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
    hooks)
    File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/estimator_training.py", line 310, in estimator_train
    raise ValueError('Only STANDALONE_CLIENT mode is supported when you call '
    ValueError: Only STANDALONE_CLIENT mode is supported when you call estimator.train

So I followed the following post to launch standalone client. But I still met the issue. Can you help check whether there's any issue in my procedure?
Post: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute#standalone-client-mode
Step:

  1. Launch cluster.
    `#!/bin/bash

export TF_CONFIG='{
"cluster": {
"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]
},
"task": {"type": "worker", "index": 0}
}'
python3 tf_std_server.pyand #!/bin/bash

export TF_CONFIG='{
"cluster": {
"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]
},
"task": {"type": "worker", "index": 1}
}'
python3 tf_std_server.py`
tf_std_server.py is from ecosystem. https://github.com/tensorflow/ecosystem/blob/master/distribution_strategy/tf_std_server.py
2. Change Distribute Config

distribution_strategy = collective_all_reduce_strategy.CollectiveAllReduceStrategy( num_gpus_per_worker=0) run_config = tf.estimator.RunConfig( experimental_distribute=tf.contrib.distribute.DistributeConfig( train_distribute=distribution_strategy, remote_cluster={"worker": ["192.168.20.50:1111", "192.168.20.52:1112"]}), session_config=session_config, save_checkpoints_secs=60*60*24, )
And run the real resnet model:
python3 cifar10_main.py --data_dir=${PWD}/cifar_data

But I encounter this assertion failure:
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 704, in _batch_reduce_to
reduce_op, value_destination_pairs)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 277, in batch_reduce
return self._batch_reduce(reduce_op, value_destination_pairs)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 872, in _batch_reduce
[v[0] for v in value_destination_pairs])
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_ops.py", line 912, in _batch_all_reduce
"Id")
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/distribute/cross_device_utils.py", line 353, in build_collective_reduce
raise ValueError('num_workers * len(input_tensors) must be 2 or greater')
ValueError: num_workers * len(input_tensors) must be 2 or greater

@byronyi
Copy link
Contributor

byronyi commented Jan 15, 2019

@yuefengz Could you take a look here?

@jvishnuvardhan jvishnuvardhan self-assigned this Jan 15, 2019
@jvishnuvardhan jvishnuvardhan added comp:model Model related issues type:bug Bug labels Jan 15, 2019
@yuefengz yuefengz self-assigned this Jan 16, 2019
@yuefengz
Copy link
Contributor

The error message points out that you need to have two replicas to make CollectiveAllReduceStrategy to work. Either at least 2 GPUs on a single machine or multiple machines.

For distributed training, please use tf.estimator.train_and_evaluate API.

Closing this issue, feel free to re-open it.

@threeleafzerg
Copy link
Author

@yuefengz
My scnario is that we use CPU only to do the distributed training. So I pass num_gpus_per_worker=0 to CollectiveAllReduceStrategy. Do you mean currently CollectiveAllReduceStrategy doesn't support CPU only distributed training?

@yuefengz
Copy link
Contributor

If you use cpu only, you have to have at least two machines.

@byronyi
Copy link
Contributor

byronyi commented Jan 16, 2019

Try not touch num_gpus_per_worker, i.e. leave it as it’s default, and try again.

I tried myself a couple of months ago and it did work with CPU only TF. Let me know if your case does not.

@threeleafzerg
Copy link
Author

@yuefengz Yes, I did do the distributed training on two machine with CPU only. (192.168.20.50, 192.168.20.52)
@byronyi I tried to leave num_gpus_per_worker as default. (=2) But the error is still the same.
Following is what I found after I debug the code:
In standalone client mode, CollectiveAllReduceExtended will use default init function _initialize_local_worker in which it will set num_workers to 1.
`
def _initialize_local_worker(self, num_gpus_per_worker):
"""Initializes the object for local training."""
self._is_chief = True
self._num_workers = 1

if num_gpus_per_worker:
  local_devices = tuple(
      "/device:GPU:%d" % i for i in range(num_gpus_per_worker)
  )
else:
  local_devices = ("/device:CPU:0",)

`
Thus its CollectiveAllReduce.num_workers is set to 1 too. That's why I got the error. "ValueError: num_workers * len(input_tensors) must be 2 or greater".
So does that mean standalone client mode is not suitable for multi machine distributed training?
I will modify code to use tf.estimator.train_and_evaluate so that I can use the independent worker mode.

Background: I am a developer from Intel tensorflow team focusing upon multi-node. Currently, we are using horovod as allreduce solution. But it's said that distributed strategy will be a trend in tensorflow for allreduce solution. So I am evaluating the impact of distributed strategy. Any comments and helps are highly appreciated!

@byronyi
Copy link
Contributor

byronyi commented Jan 17, 2019

Sorry, could you give a minimal reproducing example? I have tried my self with latest nightly and failed to find any problems.

https://gist.github.com/ca1b55e5a5423d5b3abb9efc6fd34b80

@byronyi
Copy link
Contributor

byronyi commented Jan 17, 2019

By the way, I could not find a way to get rid of the following warning:

WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.

@yuefengz Any idea what is wrong?

@yuefengz
Copy link
Contributor

@byronyi might be something wrong with the model. I didn't see it with ResNet50. I'll re-run examples to reproduce this warning later.

@threeleafzerg
Copy link
Author

@byronyi Thank you for your enlightening script. I tried it both on our company's cluster and my private machine. On our cluster the worker script got hang and in my private machine (a cleaner environment), the worker script got error. I have pasted the log in https://gist.github.com/ca1b55e5a5423d5b3abb9efc6fd34b80. Can you help to check?

@threeleafzerg
Copy link
Author

@yuefengz Can you provide the full script of resnet50 example? Thanks!

@byronyi
Copy link
Contributor

byronyi commented Jan 17, 2019

I have no idea why you met 'https' scheme not supported in proxy URI problem. I never saw that before.

Could you replace localhost with 127.0.0.1? Also running in an TF docker might help.

@threeleafzerg
Copy link
Author

@byronyi I set it as 127.0.0.1, but it's the same.
I found that it's caused by invalid setting of https_proxy. After I set https_proxy as the correct one, my private desktop got hang too as my clusters. Anyway, I think it's related with grpc's https_proxy, do you know how to disable https_proxy run-time for grpc? Also, I am confused that even within single machine (localhost), the tf server needs https_proxy.

@threeleafzerg
Copy link
Author

@byronyi I also asked one of my colleague to run the same set of scripts and got the same result.

The hang python call stack is as follows:
Call initializer instance with the dtype argument instead of passing it to the constructor
^CTraceback (most recent call last):
File "worker.py", line 48, in
eval_spec=eval_spec)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 462, in train_and_evaluate
estimator, train_spec, eval_spec, _TrainingExecutor)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 289, in train_and_evaluate
session_config=run_config.session_config)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 778, in run_distribute_coordinator
session_config, rpc_layer)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 466, in _run_between_graph_client
coord.join(threads_to_join)
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 363, in join
while any(t.is_alive() for t in threads) and not self.wait_for_stop(1.0):
File "/home/sunbear/miniconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 311, in wait_for_stop
return self._stop_event.wait(timeout)
File "/home/sunbear/miniconda2/lib/python2.7/threading.py", line 614, in wait
self.__cond.wait(timeout)
File "/home/sunbear/miniconda2/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt

@threeleafzerg
Copy link
Author

@byronyi I can run the distributed training successfully(CollectiveAllReduceStrategy) on my local machine. But it's still failed in our company cluster. (same set of software) Our clusters's is managed by slurm. Does it conflicts with gRPC? Is there any pre-cautions for using gRPC from network perspective?

Failed log for independent worker mode:
INFO:tensorflow:Graph was finalized.
2019-01-24 15:55:33.368302: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 725fac0e10d1fc7b with config: device_filters: "/job:worker/task:1" device_filters: "/job:worker/task:1" allow_soft_placement: true graph_options { rewrite_ options { meta_optimizer_iterations: ONE scoped_allocator_optimization: ON scoped_allocator_opts { enable_op: "CollectiveReduce" enab le_op: "CollectiveReduce" enable_op: "CollectiveReduce" } } } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
2019-01-24 15:55:33.519166: E tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc:90] Bad response [Unavailable: Socke t closed
Additional GRPC error information:
{"created":"@1548316533.519114543","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file _line":1036,"grpc_message":"Socket closed","grpc_status":14}] from GetStepSequenceAsync call to /job:worker/replica:0/task:0
2019-01-24 15:55:33.519293: E tensorflow/core/distributed_runtime/master_session.cc:1493] Bad status from collective_executor_mgr->Re freshStepIdSequence: Unavailable: Socket closed

Failed Log from standalone client mode:
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

@threeleafzerg
Copy link
Author

@byronyi root caused this issue. It's caused by http_proxy settings. Thank you all for your support.

@byronyi
Copy link
Contributor

byronyi commented Jan 24, 2019

@threeleafzerg Nevermind, and thanks for reporting your issue.

@threeleafzerg
Copy link
Author

@byronyi BTW, do you have any public design doc about distribute strategy or public future plan about distribute strategy which can be shared with us? Thanks!

@byronyi
Copy link
Contributor

byronyi commented Feb 2, 2019

@threeleafzerg
Copy link
Author

@byronyi Thanks!

@threeleafzerg
Copy link
Author

@byronyi Sorry for bothering you again. Do you know any open material about HorovodDistributionStrategy?

@byronyi
Copy link
Contributor

byronyi commented Feb 22, 2019

Sorry I have no knowledge of that.

@vaibhavkumar049
Copy link

How to work around this error , I am using TPU Distribute strategy?

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run ExperimentalAutoShardDataset: Unable to parse tensor proto
Additional GRPC error information:
{"created":"@1564365854.431149350","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to parse tensor proto","grpc_status":3} [Op:ExperimentalAutoShardDataset]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:model Model related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests

6 participants