[data] allow task failures during execution #41226

raulchen · 2023-11-17T00:32:40Z

Why are these changes needed?

Add an option to allow task failures during dataset execution. Data from the failed tasks will be dropped. This can be useful to avoid the entire execution failing because of a small number of unexpected exceptions (e.g., due to corrupted data) that may happen after the execution has been running for long time.

Related issue number

close #41213

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang

Can we add the failures to the stats?

stephanie-wang · 2023-11-17T18:34:27Z

python/ray/data/context.py

@@ -227,6 +227,11 @@ def __init__(
        # The additonal ray remote args that should be added to
        # the task-pool-based data tasks.
        self._task_pool_data_task_remote_args: Dict[str, Any] = {}
+        # Max number of task failures that are allowed before aborting


Can we call this max_allowed_block_failures or something similar? Since it is not actually tasks.

Also prefer the term "errors" over "failures" since failures is related to system-level exceptions. Actually, does this also apply to tasks that failed due to system-level exceptions? The comment should make it clear what counts.

sounds good

stephanie-wang · 2023-11-17T18:34:38Z

python/ray/data/context.py

+        # Max number of task failures that are allowed before aborting
+        # the Dataset execution. Data of the failed tasks will be dropped.
+        # Unlimited if negative.
+        # By default, not failures are allowed.


Suggested change

# By default, not failures are allowed.

# By default, no failures are allowed.

stephanie-wang · 2023-11-17T18:36:17Z

python/ray/data/_internal/execution/streaming_executor_state.py

+                        max_blocks_to_read_per_op[state] -= num_blocks_read
+                except Exception as e:
+                    error_message = (
+                        f'An exception occurred in a task of operator "{state.op}".'


It would be good to print the exception traceback here.

Yeah, it's printed below

bveeramani · 2023-11-17T22:16:19Z

python/ray/data/context.py

+        # the Dataset execution. Data of the failed tasks will be dropped.
+        # Unlimited if negative.
+        # By default, not failures are allowed.
+        self.max_allowed_task_failures = 0


For consistency with the DataContext options, should we expose this in the constructor and define a constant to represent the default?

I thought of that. But now the parameter list is already too long, and in practice we never create the DataContext objects with parameters. Thus, I prefer to not add new parameters.

bveeramani · 2023-11-17T22:16:30Z

python/ray/data/context.py

@@ -227,6 +227,11 @@ def __init__(
        # The additonal ray remote args that should be added to
        # the task-pool-based data tasks.
        self._task_pool_data_task_remote_args: Dict[str, Any] = {}
+        # Max number of task failures that are allowed before aborting


bveeramani · 2023-11-17T22:17:26Z

python/ray/data/tests/test_streaming_executor.py

+    _run(5, 0, 0, False)
+    _run(5, 0, 1, True)
+    _run(5, 2, 1, False)
+    _run(5, 2, 2, False)
+    _run(5, 2, 3, True)
+    _run(5, -1, 5, False)


Nit: Might be cleaner to use pytest.mark.parametrize here

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-11-21T22:54:33Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

@@ -76,6 +76,8 @@ class OpRuntimeMetrics:
    num_tasks_have_outputs: int = field(default=0, metadata={"map_only": True})
    # Number of finished tasks.
    num_tasks_finished: int = field(default=0, metadata={"map_only": True})
+    # Number of failed tasks.
+    num_tasks_failed: int = field(default=0, metadata={"map_only": True})


Keep using the term "tasks" here for consistency with other metrics.

Can we add this to the Dataset.stats() output?

raulchen · 2023-11-21T22:57:06Z

python/ray/data/tests/test_streaming_executor.py

+    assert (
+        "During handling of the above exception, another exception occurred"
+        not in out_str
+    ), out_str


This PR also makes the error stack trace more concise. cc @stephanie-wang @c21

Previously it will print:

Traceback (most recent call last): File "python/ray/_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync File "python/ray/_raylet.pyx", line 4653, in ray._raylet.CoreWorker.try_read_next_object_ref_stream File "python/ray/_raylet.pyx", line 447, in ray._raylet.check_status ray.exceptions.ObjectRefStreamEndOfStreamError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_data_ready meta = ray.get(next(self._streaming_gen)) File "python/ray/_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__ File "python/ray/_raylet.pyx", line 365, in ray._raylet.StreamingObjectRefGenerator._next_sync StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 8, in <module> File "/Users/chenh/code/ray/python/ray/data/dataset.py", line 2459, in take_all for row in self.iter_rows(): File "/Users/chenh/code/ray/python/ray/data/iterator.py", line 225, in _wrapped_iterator for batch in batch_iterable: File "/Users/chenh/code/ray/python/ray/data/iterator.py", line 183, in _create_iterator for batch in iterator: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 176, in iter_batches next_batch = next(async_batch_iter) File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 851, in make_async_gen raise next_item File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 828, in execute_computation for item in fn(thread_safe_generator): File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 167, in _async_iter_batches yield from extract_data_from_batch(batch_iter) File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 210, in extract_data_from_batch for batch in batch_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 306, in restore_original_order for batch in batch_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 218, in threadpool_computations_format_collate yield from formatted_batch_iter File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 158, in format_batches for batch in block_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 117, in blocks_to_batches for block in block_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 54, in resolve_block_refs for block_ref in block_ref_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 254, in prefetch_batches_locally for block_ref, metadata in block_ref_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 808, in __next__ return next(self.it) File "/Users/chenh/code/ray/python/ray/data/_internal/execution/legacy_compat.py", line 54, in execute_to_legacy_block_iterator for bundle in bundle_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__ return self.get_next() File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 154, in get_next raise item File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 223, in run while self._scheduling_loop_step(self._topology) and not self._shutdown: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 271, in _scheduling_loop_step num_errored_blocks = process_completed_tasks( File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 416, in process_completed_tasks raise e File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 383, in process_completed_tasks num_blocks_read = task.on_data_ready( File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/physical_operator.py", line 93, in on_data_ready raise ex File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/physical_operator.py", line 89, in on_data_ready ray.get(block_ref) File "/Users/chenh/code/ray/python/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/Users/chenh/code/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/Users/chenh/code/ray/python/ray/_private/worker.py", line 2595, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::Map(map)() (pid=94579, ip=127.0.0.1) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 371, in __call__ for data in iter: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 196, in __call__ yield from self._row_fn(input, ctx) File "/Users/chenh/code/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 234, in transform_fn out_row = fn(row) File "/Users/chenh/code/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 120, in fn return op_fn(item, *fn_args, **fn_kwargs) File "<string>", line 6, in map ValueError: foo

Now it prints

Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 8, in <module> File "/Users/chenh/code/ray/python/ray/data/dataset.py", line 2459, in take_all for row in self.iter_rows(): File "/Users/chenh/code/ray/python/ray/data/iterator.py", line 225, in _wrapped_iterator for batch in batch_iterable: File "/Users/chenh/code/ray/python/ray/data/iterator.py", line 183, in _create_iterator for batch in iterator: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 176, in iter_batches next_batch = next(async_batch_iter) File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 851, in make_async_gen raise next_item File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 828, in execute_computation for item in fn(thread_safe_generator): File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 167, in _async_iter_batches yield from extract_data_from_batch(batch_iter) File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 210, in extract_data_from_batch for batch in batch_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 306, in restore_original_order for batch in batch_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 218, in threadpool_computations_format_collate yield from formatted_batch_iter File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 158, in format_batches for batch in block_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 117, in blocks_to_batches for block in block_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/util.py", line 54, in resolve_block_refs for block_ref in block_ref_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/block_batching/iter_batches.py", line 254, in prefetch_batches_locally for block_ref, metadata in block_ref_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/util.py", line 808, in __next__ return next(self.it) File "/Users/chenh/code/ray/python/ray/data/_internal/execution/legacy_compat.py", line 54, in execute_to_legacy_block_iterator for bundle in bundle_iter: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__ return self.get_next() File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 154, in get_next raise item File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 223, in run while self._scheduling_loop_step(self._topology) and not self._shutdown: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor.py", line 271, in _scheduling_loop_step num_errored_blocks = process_completed_tasks( File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 416, in process_completed_tasks raise e from None File "/Users/chenh/code/ray/python/ray/data/_internal/execution/streaming_executor_state.py", line 383, in process_completed_tasks num_blocks_read = task.on_data_ready( File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/physical_operator.py", line 93, in on_data_ready raise ex from None File "/Users/chenh/code/ray/python/ray/data/_internal/execution/interfaces/physical_operator.py", line 89, in on_data_ready ray.get(block_ref) File "/Users/chenh/code/ray/python/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/Users/chenh/code/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/Users/chenh/code/ray/python/ray/_private/worker.py", line 2595, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::Map(map)() (pid=95483, ip=127.0.0.1) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 371, in __call__ for data in iter: File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 196, in __call__ yield from self._row_fn(input, ctx) File "/Users/chenh/code/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 234, in transform_fn out_row = fn(row) File "/Users/chenh/code/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 120, in fn return op_fn(item, *fn_args, **fn_kwargs) File "<string>", line 6, in map ValueError: foo

raulchen · 2023-11-21T22:59:10Z

@stephanie-wang @bveeramani All comments are addressed. thanks!

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang · 2023-11-27T21:59:19Z

LGTM, but can you add the failure stats to Dataset.stats()?

raulchen · 2023-11-27T23:45:18Z

LGTM, but can you add the failure stats to Dataset.stats()?

All the metrics with export_metric=True are already part of stats().

Add an option to allow task failures during dataset execution. Data from the failed tasks will be dropped. This can be useful to avoid the entire execution failing because of a small number of unexpected exceptions (e.g., due to corrupted data) that may happen after the execution has been running for long time. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and Zandew as code owners November 17, 2023 00:32

raulchen marked this pull request as draft November 17, 2023 00:36

raulchen marked this pull request as ready for review November 17, 2023 01:14

raulchen assigned ericl, stephanie-wang, c21, amogkam and bveeramani Nov 17, 2023

stephanie-wang reviewed Nov 17, 2023

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 17, 2023

bveeramani reviewed Nov 17, 2023

View reviewed changes

raulchen added 10 commits November 21, 2023 12:52

allow task failures during execution

b56959f

Signed-off-by: Hao Chen <chenh1024@gmail.com>

add unit test

20d43b4

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

af6a0e8

Signed-off-by: Hao Chen <chenh1024@gmail.com>

comment

85ac774

Signed-off-by: Hao Chen <chenh1024@gmail.com>

record num_failures

21b9249

Signed-off-by: Hao Chen <chenh1024@gmail.com>

error messge

da142ee

Signed-off-by: Hao Chen <chenh1024@gmail.com>

parameterize test

f7a2568

Signed-off-by: Hao Chen <chenh1024@gmail.com>

rename

fe46bf8

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

9bd75c4

Signed-off-by: Hao Chen <chenh1024@gmail.com>

record metrics

aa73353

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen force-pushed the allow-task-failure branch from 178146a to aa73353 Compare November 21, 2023 22:22

lint

879ccf9

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen added 2 commits November 21, 2023 14:30

fix

5017314

Signed-off-by: Hao Chen <chenh1024@gmail.com>

test stacktrace

b8fc642

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen commented Nov 21, 2023

View reviewed changes

raulchen removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 21, 2023

bveeramani approved these changes Nov 21, 2023

View reviewed changes

raulchen added 3 commits November 22, 2023 10:18

fix test_stats

fde6dc9

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

08351bf

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Merge branch 'master' into allow-task-failure

d7fdcbb

Merge branch 'master' into allow-task-failure

5f5e95a

c21 approved these changes Nov 27, 2023

View reviewed changes

raulchen merged commit ca51322 into ray-project:master Nov 28, 2023
15 of 16 checks passed

raulchen deleted the allow-task-failure branch November 28, 2023 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] allow task failures during execution #41226

[data] allow task failures during execution #41226

raulchen commented Nov 17, 2023 •

edited

Loading

stephanie-wang left a comment

stephanie-wang Nov 17, 2023

bveeramani Nov 17, 2023

raulchen Nov 20, 2023

stephanie-wang Nov 17, 2023

stephanie-wang Nov 17, 2023

raulchen Nov 20, 2023

bveeramani Nov 17, 2023

raulchen Nov 20, 2023

bveeramani Nov 17, 2023

bveeramani Nov 17, 2023

raulchen Nov 21, 2023

stephanie-wang Nov 27, 2023

raulchen Nov 21, 2023

raulchen commented Nov 21, 2023

stephanie-wang commented Nov 27, 2023

raulchen commented Nov 27, 2023

	# By default, not failures are allowed.
	# By default, no failures are allowed.

[data] allow task failures during execution #41226

[data] allow task failures during execution #41226

Conversation

raulchen commented Nov 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented Nov 21, 2023

stephanie-wang commented Nov 27, 2023

raulchen commented Nov 27, 2023

raulchen commented Nov 17, 2023 •

edited

Loading