New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training fails when using CollectiveAllReduceStrategy #24887
Comments
Could you try with |
byronyi,
So I followed the following post to launch standalone client. But I still met the issue. Can you help check whether there's any issue in my procedure?
export TF_CONFIG='{ export TF_CONFIG='{
But I encounter this assertion failure: |
@yuefengz Could you take a look here? |
The error message points out that you need to have two replicas to make For distributed training, please use Closing this issue, feel free to re-open it. |
@yuefengz |
If you use cpu only, you have to have at least two machines. |
Try not touch num_gpus_per_worker, i.e. leave it as it’s default, and try again. I tried myself a couple of months ago and it did work with CPU only TF. Let me know if your case does not. |
@yuefengz Yes, I did do the distributed training on two machine with CPU only. (192.168.20.50, 192.168.20.52)
` Background: I am a developer from Intel tensorflow team focusing upon multi-node. Currently, we are using horovod as allreduce solution. But it's said that distributed strategy will be a trend in tensorflow for allreduce solution. So I am evaluating the impact of distributed strategy. Any comments and helps are highly appreciated! |
Sorry, could you give a minimal reproducing example? I have tried my self with latest nightly and failed to find any problems. |
By the way, I could not find a way to get rid of the following warning:
@yuefengz Any idea what is wrong? |
@byronyi might be something wrong with the model. I didn't see it with ResNet50. I'll re-run examples to reproduce this warning later. |
@byronyi Thank you for your enlightening script. I tried it both on our company's cluster and my private machine. On our cluster the worker script got hang and in my private machine (a cleaner environment), the worker script got error. I have pasted the log in https://gist.github.com/ca1b55e5a5423d5b3abb9efc6fd34b80. Can you help to check? |
@yuefengz Can you provide the full script of resnet50 example? Thanks! |
I have no idea why you met 'https' scheme not supported in proxy URI problem. I never saw that before. Could you replace localhost with 127.0.0.1? Also running in an TF docker might help. |
@byronyi I set it as 127.0.0.1, but it's the same. |
@byronyi I also asked one of my colleague to run the same set of scripts and got the same result. The hang python call stack is as follows: |
@byronyi I can run the distributed training successfully(CollectiveAllReduceStrategy) on my local machine. But it's still failed in our company cluster. (same set of software) Our clusters's is managed by slurm. Does it conflicts with gRPC? Is there any pre-cautions for using gRPC from network perspective? Failed log for independent worker mode: Failed Log from standalone client mode: |
@byronyi root caused this issue. It's caused by http_proxy settings. Thank you all for your support. |
@threeleafzerg Nevermind, and thanks for reporting your issue. |
@byronyi BTW, do you have any public design doc about distribute strategy or public future plan about distribute strategy which can be shared with us? Thanks! |
I’ll suggest you to take a look at https://github.com/tensorflow/community/blob/master/rfcs/20181016-replicator.md and tensorflow/community#55. |
@byronyi Thanks! |
@byronyi Sorry for bothering you again. Do you know any open material about HorovodDistributionStrategy? |
Sorry I have no knowledge of that. |
How to work around this error , I am using TPU Distribute strategy? Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run ExperimentalAutoShardDataset: Unable to parse tensor proto |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Yes.
16.04
Source
tensorflow (1.12.0rc0)
Python3.4
bazel 0.16
gcc4.8.5
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior
I am trying to employing CollectiveAllReduceStrategy upon tensorflow official model resnet following the instructions from https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute#multi-worker-training.
Code: https://github.com/threeleafzerg/models
Steps:
I expect that the distributed training could start successfully.
But unfortunately, I got python exceptions.
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 403, in _init_from_arg s
initial_value() if init_from_fn else initial_value,
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/contrib/distribute/python/collective_all_reduce_strategy.py", lin e 180, in _overridden_initial_value_fn
group_size, group_key, collective_instance_key)
File "/home/zhouhaiy/.local/lib/python3.4/site-packages/tensorflow/python/ops/collective_ops.py", line 94, in broadcast_send
'Parameter group_size to broadcast_send must be at least 2.')
ValueError: Parameter group_size to broadcast_send must be at least 2.
Describe the expected behavior
Distributed training can start successfully.
Code to reproduce the issue
I have uploaded my experiment code in my private branch: https://github.com/threeleafzerg/models
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: