[Data] Add fault tolerance to remote tasks #41084

bveeramani · 2023-11-13T02:27:25Z

Why are these changes needed?

Fault tolerance is table stakes for Ray Data, and this PR adds the feature for batch inference. To ensure that tasks are retried in the case of system failures (like nodes crashing), this PR configures max_retries for all remote tasks. It also adds a chaos release test.

Successful release test run: https://buildkite.com/ray-project/release/builds/1001#018bc754-e4e0-4e19-bceb-6954bb95fc81

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani · 2023-11-13T07:50:54Z

python/ray/_private/test_utils.py

@@ -1394,6 +1394,7 @@ def get_and_run_node_killer(
    lifetime=None,
    no_start=False,
    max_nodes_to_kill=2,
+    node_kill_delay_s=0,


The node killer kills nodes at random times:

ray/python/ray/_private/test_utils.py

Line 1446 in 8af874e

sleep_interval = random.random() * self.node_kill_interval_s

If a node is killed while runtime env is initialized, the program will fail, even if Ray Data is fault tolerant. To avoid this sort of flakiness, I've added a delay.

bveeramani · 2023-11-13T07:52:01Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -352,7 +352,7 @@ def _apply_default_remote_args(ray_remote_args: Dict[str, Any]) -> Dict[str, Any
            "max_task_retries" not in ray_remote_args
            and ray_remote_args.get("max_restarts") != 0
        ):
-            ray_remote_args["max_task_retries"] = 5
+            ray_remote_args["max_task_retries"] = -1


Defaults are specified all over the code base. We should really consolidate them at some point. Maybe something like #39797.

I took a stab at it, but it's a bit more involved than I expected, so I'm leaving it to a future PR.

bveeramani · 2023-11-13T07:54:46Z

python/ray/data/_internal/remote_fn.py

@@ -18,12 +18,12 @@ def cached_remote_fn(fn: Any, **ray_remote_args) -> Any:
    """
    if fn not in CACHED_FUNCTIONS:
        default_ray_remote_args = {
-            "retry_exceptions": True,


This default was set two years ago by #18296. But, I think we should get rid of it at this point for a couple reasons:

It's obsolete. [Data] Retry open files with expotential backoff #38773 added more specific retries.

It can lead to bad UX. For example, if a non-transient error occurs, Ray will repeatedly retry the task, and you won't realize there's an error until your program potentially times out (or it never does!). Ideally, your program would immediately fail with the non-transient error.

When I was doing some large-scale workloads, I depend on this flag to retry failed tasks due to AWS throttling or other temporary network issues.
If we remove it, we should allow retrying per op. E.g., set it to true only for read/write.

@raulchen what were the specific errors you ran into? Shouldn't throttling be handled by #38773?

One example is AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: curlCode: 28, Timeout was reached.
I guess it's because write is not covered by that PR.

Weird. We should be performing retries for writes:

ray/python/ray/data/datasource/file_datasink.py

Lines 148 to 155 in 8af874e

with _open_file_with_retry(

write_path,

lambda: self.filesystem.open_output_stream(

write_path, **self.open_stream_args

),

) as file:

self.write_row_to_file(row, file)

@raulchen did you run into this issue after #38773 was merged? If so, I can add retry_exceptions for reads and writes.

Yes, I got that error with last week's nightly.
After a second thought, I feel it's not a good idea to simply add retry_exceptions for reads and writes.
One opposite example is if users pass in a list of parquet files for read, where the files have different schemas. In this case, the read tasks will retry indefinitely without printing any info.
Maybe it's better to just add retry for IO-related code. I think #38773 already covered read. We'll need to also do something similar for writes.
It's ok to do that in a follow-up PR.

@bveeramani created an issue here #41211

Fault tolerance is table stakes for Ray Data, and this PR adds the feature for batch inference. To ensure that tasks are retried in the case of system failures (like nodes crashing), this PR configures max_retries for all remote tasks. It also adds a chaos release test. --------- Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Add fault tolerance to Ray Data

fc4ca42

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and Zandew as code owners November 13, 2023 02:27

bveeramani marked this pull request as draft November 13, 2023 02:27

bveeramani added 4 commits November 12, 2023 19:16

Fix relative path

37f8798

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Fix bug

7093f60

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Update retry_exceptions

c7386b1

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Remove old test

420d2dd

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani commented Nov 13, 2023

View reviewed changes

bveeramani marked this pull request as ready for review November 13, 2023 07:57

bveeramani assigned amogkam and raulchen Nov 13, 2023

raulchen approved these changes Nov 14, 2023

View reviewed changes

bveeramani merged commit 29aea3d into ray-project:master Nov 14, 2023
17 of 23 checks passed

bveeramani deleted the fault-tolerance branch November 14, 2023 19:56

raulchen mentioned this pull request Nov 16, 2023

[data] allow execution to continue when some tasks fail #41213

Closed

bveeramani mentioned this pull request Nov 17, 2023

[Data] Add batch inference chaos test #37477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add fault tolerance to remote tasks #41084

[Data] Add fault tolerance to remote tasks #41084

bveeramani commented Nov 13, 2023 •

edited

Loading

bveeramani Nov 13, 2023 •

edited

Loading

bveeramani Nov 13, 2023

bveeramani Nov 13, 2023

raulchen Nov 13, 2023

bveeramani Nov 13, 2023

raulchen Nov 13, 2023 •

edited

Loading

bveeramani Nov 13, 2023

raulchen Nov 14, 2023

raulchen Nov 16, 2023

	with _open_file_with_retry(
	write_path,
	lambda: self.filesystem.open_output_stream(
	write_path, **self.open_stream_args
	),
	) as file:
	self.write_row_to_file(row, file)

[Data] Add fault tolerance to remote tasks #41084

[Data] Add fault tolerance to remote tasks #41084

Conversation

bveeramani commented Nov 13, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

bveeramani Nov 13, 2023

Choose a reason for hiding this comment

bveeramani Nov 13, 2023

Choose a reason for hiding this comment

raulchen Nov 13, 2023

Choose a reason for hiding this comment

bveeramani Nov 13, 2023

Choose a reason for hiding this comment

raulchen Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

bveeramani Nov 13, 2023

Choose a reason for hiding this comment

raulchen Nov 14, 2023

Choose a reason for hiding this comment

raulchen Nov 16, 2023

Choose a reason for hiding this comment

bveeramani commented Nov 13, 2023 •

edited

Loading

bveeramani Nov 13, 2023 •

edited

Loading

raulchen Nov 13, 2023 •

edited

Loading