Some confusion about Tensorflow Distributed Training（The amount of synchronized data is much larger than the theoretical value） #66601

certainly-cyber · 2024-04-29T07:24:24Z

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

TF1.14

Custom code

No

OS platform and distribution

Linux Ubuntu 20.04

Mobile device

No response

Python version

2

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

As far as I know, TensorFlow uses ring allreduce, which seems to mean that the number of synchronized data should be [2 * (N -1)/N ]* number of parameters * 4bytes(Reduce Scatter + All Gather), N represents the number of workers. If we do some simplification, this should be like:
(2 * number of parameters * 4bytes)
The distributed approach I adopt is：
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=0)
On the other hand, I conducted a simple experiment，my model parameters are like this:

Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 16)                176
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17
=================================================================
Total params: 193
Trainable params: 193
Non-trainable params: 0

This should result about:2 * 193 * 4 = 1544bytes, but in fact, I caught around 2500 bytes of sending packages when I use tcpdump to capture package about tensorflow at one worker node.
Why there have such a big gap? Is this related to the tensorflow environment I am using(Kubeflow+Tensorflowtraining operator in K8s)? May I ask if there exists a more accurate formula?

Standalone code to reproduce the issue

This is the code I am using：
https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/distribution_strategy/estimator-API/keras_model_to_estimator.py

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

tilakrayal · 2024-04-30T09:27:16Z

@certainly-cyber,
tf.contrib.distribute.CollectiveAllReduceStrategy is not available and it was the deprecated API. The CollectiveAllReduceStrategy is exported as MultiWorkerMirroredStrategy.

https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy

Also I suspect you are using the code which was related tensorflow 1.x version which is not actively supported now. Kindly convert the code for the latest version and use the tensorflow v2.15 or v2.16. Thank you!

certainly-cyber · 2024-04-30T14:22:54Z

I understand what you mean, but this API used to be available...right？
Due to some reasons, I may not be able to migrate and use the relevant version of TF2. In fact, the above code can run successfully and obtain good training results using TF1.14 and CollectiveAllReduceStrategy API. The only point I am confused about is why there is a much larger communication volume between workers than the theoretical value.
Looking forward to your reply, and have a nice day~

tilakrayal · 2024-05-07T10:59:05Z

@certainly-cyber,
The API tf.contrib.distribute.CollectiveAllReduceStrategy is not available and as it is part of the deprecated tf.contrib
CollectiveAllReduceStrategy is MultiWorkerMirrorStrategy. CollectiveAllReduceStrategy is a name we used in the implementation. Please refer to documentations of MultiWorkerMirrorStrategy.

Also CollectiveAllReduceStrategy reduces computation overhead by distributing it across workers, there's additional overhead associated with transferring data between workers. This overhead can include serialization, deserialization, and network latency, all of which can contribute to a larger communication volume than the theoretical minimum. Thank you!

certainly-cyber · 2024-05-07T12:09:08Z

Okay, I got it. Thank you again for your answer~
By the way, if I have adopted MultiWorkerMirroredStrategy, can we accurately calculate (or roughly estimate) the amount of data that needs to be synchronized? I have already know how to use model parameters for estimation, but I think it's not accurate enough:) Such like what the format of the synchronized message will be like, or what is the proportion of overhead information? Thank you!

tilakrayal · 2024-05-21T12:36:27Z

@certainly-cyber,
In MultiWorkerMirroredStrategy, the synchronized messages primarily consist of gradients associated with the model's weights and biases. These gradients are numerical tensors representing the direction and magnitude for updating the model parameters during training.
https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy

Also this is not the bug/feature request, Kindly file the issue on TensorFlow Forum . There is also a larger community that reads questions there. Thank you!

certainly-cyber · 2024-05-21T12:41:26Z

ok, I got it~
Thanks again for your answer, I'll keep exploring, thank you and have a nice day！

tilakrayal · 2024-05-21T13:49:36Z

@certainly-cyber,
Glad the issue was resolved. Could you please feel free to move this issue to closed status. Thank you!

certainly-cyber · 2024-05-21T13:50:53Z

Sure, my pleasure.
Thanks again~

google-ml-butler · 2024-05-21T13:51:04Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:support Support issues label Apr 29, 2024

google-ml-butler bot assigned tilakrayal Apr 29, 2024

tilakrayal added TF 1.14 for issues seen with TF 1.14 comp:dist-strat Distribution Strategy related issues labels Apr 29, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 30, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 30, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 7, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 7, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 21, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024

tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 21, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024

certainly-cyber closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some confusion about Tensorflow Distributed Training（The amount of synchronized data is much larger than the theoretical value） #66601

Some confusion about Tensorflow Distributed Training（The amount of synchronized data is much larger than the theoretical value） #66601

certainly-cyber commented Apr 29, 2024

tilakrayal commented Apr 30, 2024

certainly-cyber commented Apr 30, 2024

tilakrayal commented May 7, 2024

certainly-cyber commented May 7, 2024

tilakrayal commented May 21, 2024

certainly-cyber commented May 21, 2024

tilakrayal commented May 21, 2024

certainly-cyber commented May 21, 2024

google-ml-butler bot commented May 21, 2024

Some confusion about Tensorflow Distributed Training（The amount of synchronized data is much larger than the theoretical value） #66601

Some confusion about Tensorflow Distributed Training（The amount of synchronized data is much larger than the theoretical value） #66601

Comments

certainly-cyber commented Apr 29, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

tilakrayal commented Apr 30, 2024

certainly-cyber commented Apr 30, 2024

tilakrayal commented May 7, 2024

certainly-cyber commented May 7, 2024

tilakrayal commented May 21, 2024

certainly-cyber commented May 21, 2024

tilakrayal commented May 21, 2024

certainly-cyber commented May 21, 2024

google-ml-butler bot commented May 21, 2024