fault_tolerance_test fails on some systems #56717

Flamefire · 2022-07-08T14:38:12Z

Click to expand!

Issue Type

Bug

Source

source

Tensorflow Version

2.7.1

Custom Code

No

OS Platform and Distribution

Linux RHEL 7

Mobile device

No response

Python version

3.9.6

Bazel version

3.7.2

GCC/Compiler version

11.2

CUDA/cuDNN version

11.4.1 / cuDNN 8.2.2.26

GPU model and memory

8 * Tesla A100

Current Behaviour?

When running the test suite during/after the build the test `//tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test` fails on this system, while it seemingly passes on another system.
It seems to be very flaky or dependent on the number of CPUs/GPUs. This system has 96 cores and 8 GPUs.
Output is something like
> //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test (2/20 cached) FAILED in 18 out of 20 in 13.5s

Standalone code to reproduce the issue

Run the following build command: `bazel test --config=noaws --config=nogcp --config=nohdfs --compilation_mode=opt --config=opt --subcommands --verbose_failures --jobs=1 --distinct_host_configuration=false --test_output=errors --build_tests_only --local_test_jobs=1 -- //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test`

Relevant log output

======================================================================
FAIL: testAddWorkerMidJob_test_mode_eager_tfapiversion_2 (__main__.FaultToleranceTest)
FaultToleranceTest.testAddWorkerMidJob_test_mode_eager_tfapiversion_2
testAddWorkerMidJob_test_mode_eager_tfapiversion_2(mode='eager', tf_api_version=2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/software/TensorFlow/2.7.1-foss-2021b-CUDA-11.4.1/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, **testcase_params)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 366, in decorated
    execute_test_method()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 349, in execute_test_method
    test_method(**kwargs_to_pass)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmpBD23m_-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py", line 220, in testAddWorkerMidJob
    self.assertCountEqual(2 * list(range(num_elements)), results)
AssertionError: Element counts were not equal:

Flamefire · 2022-07-08T14:40:50Z

Build output and Test log file of a test with --test_env=CUDA_VISIBLE_DEVICES='1,2,3,4'

See also c08fda5 which disables this test on mac

mohantym · 2022-07-10T12:51:28Z

Hi @Flamefire !
Could you re-test with the below command and let us know (Removed --config from --config=opt and added -c opt as it was throwing no opt file in .rc file with the above command)

bazel test --config=noaws --config=nogcp --config=nohdfs --compilation_mode=opt -c opt --subcommands --verbose_failures --jobs=1 --distinct_host_configuration=false --test_output=errors --build_tests_only --local_test_jobs=1 -- //tensorflow/python/data/experimental/kernel_tests/service:fault_tolerance_test

Attached gist in 2.9 for reference.
Thank you!

Flamefire · 2022-07-14T14:55:05Z

So the change is: --config=opt replaced by -c opt, correct?

Same issue. I also checked the Bazel output, started a new bash and set up the environment for this test as per the Bazel output, changed the CWD to .../execroot/org_tensorflow/bazel-out/k8-opt/bin and ran tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test. Same error.

Ran 41 tests in 19.142s

FAILED (failures=1, errors=6, skipped=22)

From what I can tell the test is faulty. See for example this error log:

ERROR: testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder (__main__.FaultToleranceTest)
FaultToleranceTest.testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder
testRestartWorker_test_mode_eager_tfapiversion_2_usesameport_True_faulttolerantmode_True_workdir_tmpworkdirplaceholder(mode='eager', tf_api_version=2, use_same_port=True, fault_tolerant_mode=True, work_dir='tmp_work_dir_placeholder')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 800, in __next__
    return self._next_internal()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 783, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/ops/gen_dataset_ops.py", line 2845, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/software/TensorFlow/2.7.1-foss-2021b-CUDA-11.4.1/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, **testcase_params)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 366, in decorated
    execute_test_method()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/framework/test_combinations.py", line 349, in execute_test_method
    test_method(**kwargs_to_pass)
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py", line 245, in testRestartWorker
    val = next(iterator).numpy()
  File "/dev/shm/s3248973-EasyBuild/TensorFlow/2.7.1/foss-2021b-CUDA-11.4.1/tmp5xgzay-bazel-tf/69a307561ec5a7cdace7b5c5a8971ea2/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 802, in __next__
    raise StopIteration
StopIteration

So we have a raise StopIteration in a __next__ function which is a valid (Python) way to signal the end-of-sequence. Going up the stack to the val = next(iterator).numpy() line we have this code https://github.com/tensorflow/tensorflow/blob/v2.7.1/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py#L242-L247 or in master

tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py

Lines 239 to 244 in 51fe2b1

    
           # There may have been some elements prefetched from the first worker 
        
           # before it was stopped. 
        
           while True: 
        
             val = next(iterator).numpy() 
        
             if val == 0: 
        
               break

So there the comment clearly states "There may have been some elements prefetched from the first worker before it was stopped." yet it unconditionally(!) calls next(iterator) in a while True loop

Not only is that code better written as while next(iterator).numpy() != 0 but the case that the iterator is exhausted before it returns 0 isn't handled at all! So better would be:

# For loops handle StopIteration and call `next` already
for val in iterator:
    if val.numpy() == 0:
        break

To further investigate this I added print statements to the loops at

tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py

Line 234 in 51fe2b1

self.assertEqual(i, next(iterator).numpy())

and

tensorflow/tensorflow/python/data/experimental/kernel_tests/service/fault_tolerance_test.py

Line 242 in 51fe2b1

val = next(iterator).numpy()

to print out the values from next(iterator)

The first loop prints out tensors with 0-49, the 2nd 50-99. So it looks like the comment "The dataset starts over now that we read from the new worker." doesn't seem to hold: The dataset continues till the end.

I'm not sure where the mistake is and what the expected behavior should be. To me it looks like the fault tolerance works very well: The dataset is not disturbed by the worker restart (if that really happens)

I hope that helps

mohantym · 2022-07-15T03:52:47Z

Hi @gadagashwini !
Could you look at this issue, Attached 2.7 , 2.8, 2.9 and nightly gist for reference (--config=opt is not working rather - c opt works)
Thank you!

SuryanarayanaY · 2023-03-22T16:19:01Z

Hi @Flamefire ,

The PR proposed by you has been merged.Is it good to to close the issue now? Please check and confirm. Thanks!

Flamefire · 2023-03-22T17:22:32Z

@SuryanarayanaY I haven't proposed a PR to tensorflow to fix this issue so this is (likely) still an issue which needs investigation and a fix. See my comment above:

I'm not sure where the mistake is and what the expected behavior should be. To me it looks like the fault tolerance works very well: The dataset is not disturbed by the worker restart (if that really happens)

If you were referring to easybuilders/easybuild-easyconfigs#15882: This is a PR against something using TF were in order to run the tests I disabled the test which is seemingly broken as reported in this issue. So that is only a workaround and not even to TF so the issue in TF is unaffected by that PR.

SuryanarayanaY · 2023-04-10T15:07:37Z

@Flamefire ,

Could you please confirm whether this issue is specific to RHEL7 or to your particular system with the mentioned CPUs and GPUs?I can see you mentioned that the test passes in other system.

Please also confirm if this is still an issue with latest version ?

Also currently TF supports Ubuntu Linux instructions officially as per this source.

Flamefire · 2023-04-10T15:12:51Z

@SuryanarayanaY It is not specific to the system except that it seemingly needs a specific level of possible (or not possible) parallelism to run into the behavior I described in #56717 (comment)

I can see you mentioned that the test passes in other system.

Yes this test seems to be flaky where the potential parallelism (i.e. number of CPU cores) might affect the likelihood of this occurring.

google-ml-butler bot added the type:bug Bug label Jul 8, 2022

google-ml-butler bot assigned mohantym Jul 8, 2022

mohantym added TF 2.7 Issues related to TF 2.7.0 subtype:bazel Bazel related Build_Installation issues subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues and removed type:bug Bug labels Jul 10, 2022

mohantym added the stat:awaiting response Status - Awaiting response from author label Jul 10, 2022

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jul 14, 2022

mohantym assigned gadagashwini and unassigned mohantym Jul 15, 2022

gadagashwini assigned angerson and unassigned gadagashwini Jul 18, 2022

gadagashwini added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 18, 2022

Flamefire mentioned this issue Jul 20, 2022

exclude (flaky) fault_tolerance_test and fix non-x86 build for TensorFlow 2.7.1 easybuilders/easybuild-easyconfigs#15882

Merged

mohantym changed the title ~~fault_tolerance_test fails one some systems~~ fault_tolerance_test fails on some systems Nov 22, 2022

SuryanarayanaY self-assigned this Mar 22, 2023

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Mar 22, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 22, 2023

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Apr 10, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fault_tolerance_test fails on some systems #56717

fault_tolerance_test fails on some systems #56717

Flamefire commented Jul 8, 2022 •

edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

Flamefire commented Jul 8, 2022

mohantym commented Jul 10, 2022 •

edited

Flamefire commented Jul 14, 2022

mohantym commented Jul 15, 2022 •

edited

SuryanarayanaY commented Mar 22, 2023

Flamefire commented Mar 22, 2023

SuryanarayanaY commented Apr 10, 2023

Flamefire commented Apr 10, 2023

fault_tolerance_test fails on some systems #56717

fault_tolerance_test fails on some systems #56717

Comments

Flamefire commented Jul 8, 2022 • edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

Flamefire commented Jul 8, 2022

mohantym commented Jul 10, 2022 • edited

Flamefire commented Jul 14, 2022

mohantym commented Jul 15, 2022 • edited

SuryanarayanaY commented Mar 22, 2023

Flamefire commented Mar 22, 2023

SuryanarayanaY commented Apr 10, 2023

Flamefire commented Apr 10, 2023

Flamefire commented Jul 8, 2022 •

edited by google-ml-butler bot

mohantym commented Jul 10, 2022 •

edited

mohantym commented Jul 15, 2022 •

edited