Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault_tolerance_test fails on some systems #56717

Open
Flamefire opened this issue Jul 8, 2022 · 8 comments
Open

fault_tolerance_test fails on some systems #56717

Flamefire opened this issue Jul 8, 2022 · 8 comments
Assignees
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype:bazel Bazel related Build_Installation issues subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.7 Issues related to TF 2.7.0 type:build/install Build and install issues

Comments

@Flamefire
Copy link
Contributor

Flamefire commented Jul 8, 2022

Click to expand!

Issue Type

Bug

Source

source

Tensorflow Version

2.7.1

Custom Code

No

OS Platform and Distribution

Linux RHEL 7

Mobile device

No response

Python version

3.9.6

Bazel version

3.7.2

GCC/Compiler version

11.2

CUDA/cuDNN version

11.4.1 / cuDNN 8.2.2.26

GPU model and memory

8 * Tesla A100

Current Behaviour?

When running the test suite during/after the build the test `//tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test` fails on this system, while it seemingly passes on another system.
It seems to be very flaky or dependent on the number of CPUs/GPUs. This system has 96 cores and 8 GPUs.
Output is something like
> //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test (2/20 cached) FAILED in 18 out of 20 in 13.5s

Standalone code to reproduce the issue

Run the following build command: `bazel test --config=noaws --config=nogcp --config=nohdfs --compilation_mode=opt --config=opt --subcommands --verbose_failures --jobs=1 --distinct_host_configuration=false --test_output=errors --build_tests_only --local_test_jobs=1 -- //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test`

Relevant log output

======================================================================
FAIL: testAddWorkerMidJob_test_mode_eager_tfapiversion_2 (__main__.FaultToleranceTest)
FaultToleranceTest.testAddWorkerMidJob_test_mode_eager_tfapiversion_2
testAddWorkerMidJob_test_mode_eager_tfapiversion_2(mode='eager', tf_api_version=2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/software/TensorFlow/2.7.1-foss-2021b-CUDA-11.4.1/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, **testcase_params)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 366, in decorated
    execute_test_method()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 349, in execute_test_method
    test_method(**kwargs_to_pass)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py", line 220, in testAddWorkerMidJob
    self.assertCountEqual(2 * list(range(num_elements)), results)
AssertionError: Element counts were not equal:
@Flamefire
Copy link
Contributor Author

Build output and Test log file of a test with --test_env=CUDA_VISIBLE_DEVICES='1,2,3,4'

See also c08fda5 which disables this test on mac

@mohantym mohantym added TF 2.7 Issues related to TF 2.7.0 subtype:bazel Bazel related Build_Installation issues subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues and removed type:bug Bug labels Jul 10, 2022
@mohantym
Copy link
Contributor

mohantym commented Jul 10, 2022

Hi @Flamefire !
Could you re-test with the below command and let us know (Removed --config from --config=opt and added -c opt as it was throwing no opt file in .rc file with the above command)

bazel test --config=noaws --config=nogcp --config=nohdfs --compilation_mode=opt -c opt --subcommands --verbose_failures --jobs=1 --distinct_host_configuration=false --test_output=errors --build_tests_only --local_test_jobs=1 -- //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test

Attached gist in 2.9 for reference.
Thank you!

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Jul 10, 2022
@Flamefire
Copy link
Contributor Author

So the change is: --config=opt replaced by -c opt, correct?

Same issue. I also checked the Bazel output, started a new bash and set up the environment for this test as per the Bazel output, changed the CWD to .../execroot/org_tensorflow/bazel-out/k8-opt/bin and ran tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test. Same error.

Ran 41 tests in 19.142s

FAILED (failures=1, errors=6, skipped=22)

From what I can tell the test is faulty. See for example this error log:

ERROR: testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder (__main__.FaultToleranceTest)
FaultToleranceTest.testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder
testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder(mode='eager', tf_api_version=2, use_same_port=True, fault_tolerant_mode=True, work_dir='tmp_work_dir_placeholder')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 800, in __next__
    return self._next_internal()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 783, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/ops/gen_dataset_ops.py", line 2845, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/software/TensorFlow/2.7.1-foss-2021b-CUDA-11.4.1/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, **testcase_params)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 366, in decorated
    execute_test_method()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 349, in execute_test_method
    test_method(**kwargs_to_pass)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py", line 245, in testRestartWorker
    val = next(iterator).numpy()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 802, in __next__
    raise StopIteration
StopIteration

So we have a raise StopIteration in a __next__ function which is a valid (Python) way to signal the end-of-sequence. Going up the stack to the val = next(iterator).numpy() line we have this code https://github.com/tensorflow/tensorflow/blob/v2.7.1/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py#L242-L247 or in master

# There may have been some elements prefetched from the first worker
# before it was stopped.
while True:
val = next(iterator).numpy()
if val == 0:
break

So there the comment clearly states "There may have been some elements prefetched from the first worker before it was stopped." yet it unconditionally(!) calls next(iterator) in a while True loop

Not only is that code better written as while next(iterator).numpy() != 0 but the case that the iterator is exhausted before it returns 0 isn't handled at all! So better would be:

# For loops handle StopIteration and call `next` already
for val in iterator:
    if val.numpy() == 0:
        break

To further investigate this I added print statements to the loops at

and to print out the values from next(iterator)

The first loop prints out tensors with 0-49, the 2nd 50-99. So it looks like the comment "The dataset starts over now that we read from the new worker." doesn't seem to hold: The dataset continues till the end.

I'm not sure where the mistake is and what the expected behavior should be. To me it looks like the fault tolerance works very well: The dataset is not disturbed by the worker restart (if that really happens)

I hope that helps

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jul 14, 2022
@mohantym
Copy link
Contributor

mohantym commented Jul 15, 2022

Hi @gadagashwini !
Could you look at this issue, Attached 2.7 , 2.8, 2.9 and nightly gist for reference (--config=opt is not working rather - c opt works)
Thank you!

@mohantym mohantym assigned gadagashwini and unassigned mohantym Jul 15, 2022
@gadagashwini gadagashwini added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 18, 2022
@mohantym mohantym changed the title fault_tolerance_test fails one some systems fault_tolerance_test fails on some systems Nov 22, 2022
@SuryanarayanaY SuryanarayanaY self-assigned this Mar 22, 2023
@SuryanarayanaY
Copy link
Collaborator

Hi @Flamefire ,

The PR proposed by you has been merged.Is it good to to close the issue now? Please check and confirm. Thanks!

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Mar 22, 2023
@Flamefire
Copy link
Contributor Author

@SuryanarayanaY I haven't proposed a PR to tensorflow to fix this issue so this is (likely) still an issue which needs investigation and a fix. See my comment above:

I'm not sure where the mistake is and what the expected behavior should be. To me it looks like the fault tolerance works very well: The dataset is not disturbed by the worker restart (if that really happens)

If you were referring to easybuilders/easybuild-easyconfigs#15882: This is a PR against something using TF were in order to run the tests I disabled the test which is seemingly broken as reported in this issue. So that is only a workaround and not even to TF so the issue in TF is unaffected by that PR.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 22, 2023
@SuryanarayanaY
Copy link
Collaborator

@Flamefire ,

Could you please confirm whether this issue is specific to RHEL7 or to your particular system with the mentioned CPUs and GPUs?I can see you mentioned that the test passes in other system.

Please also confirm if this is still an issue with latest version ?

Also currently TF supports Ubuntu Linux instructions officially as per this source.

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Apr 10, 2023
@Flamefire
Copy link
Contributor Author

@SuryanarayanaY It is not specific to the system except that it seemingly needs a specific level of possible (or not possible) parallelism to run into the behavior I described in #56717 (comment)

I can see you mentioned that the test passes in other system.

Yes this test seems to be flaky where the potential parallelism (i.e. number of CPU cores) might affect the likelihood of this occurring.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype:bazel Bazel related Build_Installation issues subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.7 Issues related to TF 2.7.0 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

5 participants