New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collective_ops.all_reduce_v2 with ordering_token does not work correctly #56885
Comments
Could @crccw give some explanations on ordering_token? |
Hi @chengmengli06,
Could you confirm the original issue still persists. Thank you! |
https://github.com/alibaba/EasyRec/tree/fix_mirrored_bug, it can be reproduced with this branch using tensorflow 2.9.1 . The test is skipped in master branch temporarily. |
Hi @chengmengli06, I tried with Tensorflow 2.9.1, but i didn’t see any error. Thank you! |
could you post your logs here? @gadagashwini |
Hi @chengmengli06, I tried with Tf 2.9.1 and CUDA 11.4.
|
As can see from the log, the test case is skipped, could you checkout fix_mirrored_bug branch, and run the test again? |
any progress? |
@crccw any progress? |
Could you explain what is the purpose of ordering_token, the order of communication? Is it related to nccl? Maybe we could help with it. |
Click to expand!
Issue Type
Bug
Source
binary
Tensorflow Version
tf 2.5 or tf 2.8
Custom Code
No
OS Platform and Distribution
Centos 72.
Mobile device
No response
Python version
3.7
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current Behaviour?
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: