Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some confusion about Tensorflow Distributed Training(The amount of synchronized data is much larger than the theoretical value) #66601

Closed
certainly-cyber opened this issue Apr 29, 2024 · 9 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues TF 1.14 for issues seen with TF 1.14 type:support Support issues

Comments

@certainly-cyber
Copy link

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

TF1.14

Custom code

No

OS platform and distribution

Linux Ubuntu 20.04

Mobile device

No response

Python version

2

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

As far as I know, TensorFlow uses ring allreduce, which seems to mean that the number of synchronized data should be [2 * (N -1)/N ]* number of parameters * 4bytes(Reduce Scatter + All Gather), N represents the number of workers. If we do some simplification, this should be like:
(2 * number of parameters * 4bytes)
The distributed approach I adopt is:
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=0)
On the other hand, I conducted a simple experiment,my model parameters are like this:

Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 16)                176
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17
=================================================================
Total params: 193
Trainable params: 193
Non-trainable params: 0

This should result about:2 * 193 * 4 = 1544bytes, but in fact, I caught around 2500 bytes of sending packages when I use tcpdump to capture package about tensorflow at one worker node.
Why there have such a big gap? Is this related to the tensorflow environment I am using(Kubeflow+Tensorflowtraining operator in K8s)? May I ask if there exists a more accurate formula?

Standalone code to reproduce the issue

This is the code I am using:
https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/distribution_strategy/estimator-API/keras_model_to_estimator.py

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:support Support issues label Apr 29, 2024
@tilakrayal tilakrayal added TF 1.14 for issues seen with TF 1.14 comp:dist-strat Distribution Strategy related issues labels Apr 29, 2024
@tilakrayal
Copy link
Contributor

@certainly-cyber,
tf.contrib.distribute.CollectiveAllReduceStrategy is not available and it was the deprecated API. The CollectiveAllReduceStrategy is exported as MultiWorkerMirroredStrategy.

https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy

Also I suspect you are using the code which was related tensorflow 1.x version which is not actively supported now. Kindly convert the code for the latest version and use the tensorflow v2.15 or v2.16. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 30, 2024
@certainly-cyber
Copy link
Author

I understand what you mean, but this API used to be available...right?
Due to some reasons, I may not be able to migrate and use the relevant version of TF2. In fact, the above code can run successfully and obtain good training results using TF1.14 and CollectiveAllReduceStrategy API. The only point I am confused about is why there is a much larger communication volume between workers than the theoretical value.
Looking forward to your reply, and have a nice day~

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 30, 2024
@tilakrayal
Copy link
Contributor

@certainly-cyber,
The API tf.contrib.distribute.CollectiveAllReduceStrategy is not available and as it is part of the deprecated tf.contrib
CollectiveAllReduceStrategy is MultiWorkerMirrorStrategy. CollectiveAllReduceStrategy is a name we used in the implementation. Please refer to documentations of MultiWorkerMirrorStrategy.

image

Also CollectiveAllReduceStrategy reduces computation overhead by distributing it across workers, there's additional overhead associated with transferring data between workers. This overhead can include serialization, deserialization, and network latency, all of which can contribute to a larger communication volume than the theoretical minimum. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 7, 2024
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 7, 2024
@certainly-cyber
Copy link
Author

Okay, I got it. Thank you again for your answer~
By the way, if I have adopted MultiWorkerMirroredStrategy, can we accurately calculate (or roughly estimate) the amount of data that needs to be synchronized? I have already know how to use model parameters for estimation, but I think it's not accurate enough:) Such like what the format of the synchronized message will be like, or what is the proportion of overhead information? Thank you!

@tilakrayal
Copy link
Contributor

@certainly-cyber,
In MultiWorkerMirroredStrategy, the synchronized messages primarily consist of gradients associated with the model's weights and biases. These gradients are numerical tensors representing the direction and magnitude for updating the model parameters during training.
https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy

Also this is not the bug/feature request, Kindly file the issue on TensorFlow Forum . There is also a larger community that reads questions there. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@certainly-cyber
Copy link
Author

ok, I got it~
Thanks again for your answer, I'll keep exploring, thank you and have a nice day!

@tilakrayal
Copy link
Contributor

@certainly-cyber,
Glad the issue was resolved. Could you please feel free to move this issue to closed status. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@certainly-cyber
Copy link
Author

Sure, my pleasure.
Thanks again~

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues TF 1.14 for issues seen with TF 1.14 type:support Support issues
Projects
None yet
Development

No branches or pull requests

2 participants