-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault_tolerance_test fails on some systems #56717
Comments
Build output and Test log file of a test with See also c08fda5 which disables this test on mac |
Hi @Flamefire !
Attached gist in 2.9 for reference. |
So the change is: Same issue. I also checked the Bazel output, started a new bash and set up the environment for this test as per the Bazel output, changed the CWD to
From what I can tell the test is faulty. See for example this error log:
So we have a tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py Lines 239 to 244 in 51fe2b1
So there the comment clearly states "There may have been some elements prefetched from the first worker before it was stopped." yet it unconditionally(!) calls Not only is that code better written as
To further investigate this I added tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py Line 234 in 51fe2b1
tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py Line 242 in 51fe2b1
next(iterator)
The first loop prints out tensors with 0-49, the 2nd 50-99. So it looks like the comment "The dataset starts over now that we read from the new worker." doesn't seem to hold: The dataset continues till the end. I'm not sure where the mistake is and what the expected behavior should be. To me it looks like the fault tolerance works very well: The dataset is not disturbed by the worker restart (if that really happens) I hope that helps |
Hi @gadagashwini ! |
Hi @Flamefire , The PR proposed by you has been merged.Is it good to to close the issue now? Please check and confirm. Thanks! |
@SuryanarayanaY I haven't proposed a PR to tensorflow to fix this issue so this is (likely) still an issue which needs investigation and a fix. See my comment above:
If you were referring to easybuilders/easybuild-easyconfigs#15882: This is a PR against something using TF were in order to run the tests I disabled the test which is seemingly broken as reported in this issue. So that is only a workaround and not even to TF so the issue in TF is unaffected by that PR. |
Could you please confirm whether this issue is specific to RHEL7 or to your particular system with the mentioned CPUs and GPUs?I can see you mentioned that the test passes in other system. Please also confirm if this is still an issue with latest version ? Also currently TF supports Ubuntu Linux instructions officially as per this source. |
@SuryanarayanaY It is not specific to the system except that it seemingly needs a specific level of possible (or not possible) parallelism to run into the behavior I described in #56717 (comment)
Yes this test seems to be flaky where the potential parallelism (i.e. number of CPU cores) might affect the likelihood of this occurring. |
Click to expand!
Issue Type
Bug
Source
source
Tensorflow Version
2.7.1
Custom Code
No
OS Platform and Distribution
Linux RHEL 7
Mobile device
No response
Python version
3.9.6
Bazel version
3.7.2
GCC/Compiler version
11.2
CUDA/cuDNN version
11.4.1 / cuDNN 8.2.2.26
GPU model and memory
8 * Tesla A100
Current Behaviour?
Standalone code to reproduce the issue
Run the following build command: `bazel test --config=noaws --config=nogcp --config=nohdfs --compilation_mode=opt --config=opt --subcommands --verbose_failures --jobs=1 --distinct_host_configuration=false --test_output=errors --build_tests_only --local_test_jobs=1 -- //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test`
Relevant log output
The text was updated successfully, but these errors were encountered: