-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some confusion about Tensorflow Distributed Training(The amount of synchronized data is much larger than the theoretical value) #66601
Comments
@certainly-cyber, https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy Also I suspect you are using the code which was related tensorflow 1.x version which is not actively supported now. Kindly convert the code for the latest version and use the tensorflow v2.15 or v2.16. Thank you! |
I understand what you mean, but this API used to be available...right? |
@certainly-cyber, Also CollectiveAllReduceStrategy reduces computation overhead by distributing it across workers, there's additional overhead associated with transferring data between workers. This overhead can include serialization, deserialization, and network latency, all of which can contribute to a larger communication volume than the theoretical minimum. Thank you! |
Okay, I got it. Thank you again for your answer~ |
@certainly-cyber, Also this is not the bug/feature request, Kindly file the issue on TensorFlow Forum . There is also a larger community that reads questions there. Thank you! |
ok, I got it~ |
@certainly-cyber, |
Sure, my pleasure. |
Issue type
Support
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
TF1.14
Custom code
No
OS platform and distribution
Linux Ubuntu 20.04
Mobile device
No response
Python version
2
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
As far as I know, TensorFlow uses ring allreduce, which seems to mean that the number of synchronized data should be [2 * (N -1)/N ]* number of parameters * 4bytes(Reduce Scatter + All Gather), N represents the number of workers. If we do some simplification, this should be like:
(2 * number of parameters * 4bytes)
The distributed approach I adopt is:
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=0)
On the other hand, I conducted a simple experiment,my model parameters are like this:
This should result about:2 * 193 * 4 = 1544bytes, but in fact, I caught around 2500 bytes of sending packages when I use tcpdump to capture package about tensorflow at one worker node.
Why there have such a big gap? Is this related to the tensorflow environment I am using(Kubeflow+Tensorflowtraining operator in K8s)? May I ask if there exists a more accurate formula?
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: