[Data] Retry open files with expotential backoff #38773

c21 · 2023-08-23T02:48:58Z

Why are these changes needed?

This PR is to retry open files call (for both read and write), with expotential backoff internally in Ray Data task. The motivation is to avoid throw throttling exception for users when reading/writing many files to remote storage, such as S3.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang · 2023-08-23T21:47:08Z

python/ray/data/datasource/file_based_datasource.py

+OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS = 32
+
+# The max number of retry attempts for opening file.
+OPEN_FILE_RETRY_MAX_ATTEMPTS = 10


Can we make these configurable by the user?

I feel these configurations are advanced, so hesitate to put them in DataContext now. Users can change these Python variable directly, such as FILE_SIZE_FETCH_PARALLELIZATION_THRESHOLD and PATHS_PER_FILE_SIZE_FETCH_TASK above. I think we can add them into DataContext, if we see more users ask for this.

stephanie-wang · 2023-08-23T21:47:35Z

python/ray/data/datasource/file_based_datasource.py

+    import random
+    import time
+
+    for i in range(OPEN_FILE_RETRY_MAX_ATTEMPTS):


Can we rewrite this to guard against OPEN_FILE_RETRY_MAX_ATTEMPTS set to 0? The name is also slightly confusing (can't tell if it's the total number of attempts or number of retries).

Let me rename to OPEN_FILE_MAX_ATTEMPTS, and throw exception if it'set smaller than 1.

Signed-off-by: Cheng Su <scnju13@gmail.com>

amogkam · 2023-08-23T23:24:31Z

python/ray/data/datasource/file_based_datasource.py

+
+    if OPEN_FILE_MAX_ATTEMPTS < 1:
+        raise ValueError(
+            "OPEN_FILE_MAX_ATTEMPTS cannot be negative or 0, but get: "


Suggested change

"OPEN_FILE_MAX_ATTEMPTS cannot be negative or 0, but get: "

"OPEN_FILE_MAX_ATTEMPTS cannot be negative or 0. Get: "

raulchen · 2023-08-23T23:14:30Z

python/ray/data/datasource/file_based_datasource.py

+                    (2 ** (i + 1)) * random.random(),
+                    OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS,
+                )
+                logger.get_logger().debug(


should this be an info? and also print the file name?

should this be an info?

I think the point of the PR is to not pollute user console, so we do not print the stack trace, and doing retry automatically.

I thought about making an info at the beginning, but I think users probably don't care about it. Think about the AWS cli, it never prints out if it's retried aws s3 cp <...>

print the file name?

File name is also included in log, anything I am missing?

raulchen · 2023-08-23T23:16:54Z

python/ray/data/tests/test_formats.py

@@ -162,6 +166,33 @@ def test_write_datasource(ray_start_regular_shared):
    assert ray.get(output.data_sink.get_rows_written.remote()) == 10


+def test_open_file_with_retry(ray_start_regular_shared):


it's kind of weird to put this test in test_formats.py

good catch, moved to test_file_based_datasource.py.

raulchen · 2023-08-23T23:18:55Z

python/ray/data/tests/test_formats.py

+    ray.data.datasource.file_based_datasource.OPEN_FILE_MAX_ATTEMPTS = 3
+    counter = Counter()
+    with pytest.raises(OSError):
+        _open_file_with_retry("dummy", lambda: counter.foo(4))


Also test the case where the function can succeed after retry?

@raulchen above assertion on line 184 already tests it. Any other situation you are thinking about?

raulchen · 2023-08-23T23:24:10Z

python/ray/data/tests/test_formats.py

+        _open_file_with_retry("dummy", lambda: counter.foo(4))
+    ray.data.datasource.file_based_datasource.OPEN_FILE_MAX_ATTEMPTS = (
+        original_max_attempts
+    )


nit, patch this attribute, so it can restore the original value even if the test fails.
Example:

def test_resource_constrained_triggers_autoscaling(monkeypatch): RESOURCE_REQUEST_TIMEOUT = 5 monkeypatch.setattr( ray.data._internal.execution.autoscaling_requester, "RESOURCE_REQUEST_TIMEOUT", RESOURCE_REQUEST_TIMEOUT, )

Changed to try: finally: to restore for now. I feel that already simple enough.

raulchen · 2023-08-23T23:25:39Z

python/ray/data/tests/test_formats.py

@@ -162,6 +166,33 @@ def test_write_datasource(ray_start_regular_shared):
    assert ray.get(output.data_sink.get_rows_written.remote()) == 10


+def test_open_file_with_retry(ray_start_regular_shared):
+    class Counter:


rename it to something like FlakyFileOpener? also max_attempts can be set in __init__.

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2023-08-23T23:54:13Z

python/ray/data/datasource/file_based_datasource.py

+            if is_retryable and i + 1 < OPEN_FILE_MAX_ATTEMPTS:
+                # Retry with binary expoential backoff with random jitter.
+                backoff = min(
+                    (2 ** (i + 1)) * random.random(),


This is same as AWS retry behavior - https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html .

c21 · 2023-08-24T05:46:38Z

This is P0 bugfix issue for our users, who encountered this S3 issue in 2.6. cc @zhe-thoughts for review.

c21 · 2023-08-24T05:46:49Z

The test failure is not related here.

zhe-thoughts

P0 fix awaited by a user, OK to merge

* Retry open files with expotential backoff Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

* Retry open files with expotential backoff Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani and raulchen as code owners August 23, 2023 02:48

c21 assigned stephanie-wang, bveeramani, raulchen and amogkam Aug 23, 2023

stephanie-wang reviewed Aug 23, 2023

View reviewed changes

c21 added 2 commits August 23, 2023 15:49

Retry open files with expotential backoff

84d7fab

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comment

65099ad

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the task-retry branch from bbabf73 to 65099ad Compare August 23, 2023 22:52

amogkam approved these changes Aug 23, 2023

View reviewed changes

raulchen approved these changes Aug 23, 2023

View reviewed changes

c21 added 2 commits August 23, 2023 16:41

Address comments

9d58cdd

Signed-off-by: Cheng Su <scnju13@gmail.com>

Remove change in test_formats.py

3e85c9f

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 commented Aug 23, 2023

View reviewed changes

c21 assigned zhe-thoughts Aug 24, 2023

zhe-thoughts approved these changes Aug 24, 2023

View reviewed changes

zhe-thoughts merged commit 2c1e0c1 into ray-project:master Aug 24, 2023
2 checks passed

c21 deleted the task-retry branch August 24, 2023 05:59

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[Data] Retry open files with expotential backoff (ray-project#38773)

6d22a4f

* Retry open files with expotential backoff Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

This was referenced Nov 8, 2023

[data] Retry failed write tasks internally #37813

Closed

[Data] Add fault tolerance to remote tasks #41084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Retry open files with expotential backoff #38773

[Data] Retry open files with expotential backoff #38773

c21 commented Aug 23, 2023

stephanie-wang Aug 23, 2023

c21 Aug 23, 2023

stephanie-wang Aug 23, 2023

c21 Aug 23, 2023

c21 Aug 23, 2023

amogkam Aug 23, 2023

c21 Aug 23, 2023

raulchen Aug 23, 2023

c21 Aug 23, 2023 •

edited

Loading

c21 Aug 23, 2023

raulchen Aug 23, 2023

c21 Aug 23, 2023

raulchen Aug 23, 2023

c21 Aug 23, 2023

raulchen Aug 23, 2023

c21 Aug 23, 2023

raulchen Aug 23, 2023

c21 Aug 23, 2023

c21 Aug 23, 2023

c21 commented Aug 24, 2023

c21 commented Aug 24, 2023

zhe-thoughts left a comment

	"OPEN_FILE_MAX_ATTEMPTS cannot be negative or 0, but get: "
	"OPEN_FILE_MAX_ATTEMPTS cannot be negative or 0. Get: "

		@@ -162,6 +166,33 @@ def test_write_datasource(ray_start_regular_shared):
		assert ray.get(output.data_sink.get_rows_written.remote()) == 10


		def test_open_file_with_retry(ray_start_regular_shared):

[Data] Retry open files with expotential backoff #38773

[Data] Retry open files with expotential backoff #38773

Conversation

c21 commented Aug 23, 2023

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Aug 24, 2023

c21 commented Aug 24, 2023

zhe-thoughts left a comment

Choose a reason for hiding this comment

c21 Aug 23, 2023 •

edited

Loading