Skip to content

[Data] Support UDF retries in case of transient exceptions#63023

Merged
edoakes merged 49 commits into
ray-project:masterfrom
ayushk7102:04_25_trans_retries
May 13, 2026
Merged

[Data] Support UDF retries in case of transient exceptions#63023
edoakes merged 49 commits into
ray-project:masterfrom
ayushk7102:04_25_trans_retries

Conversation

@ayushk7102
Copy link
Copy Markdown
Contributor

@ayushk7102 ayushk7102 commented Apr 29, 2026

Description

Adds support for retrying UDF exceptions in Ray Data map tasks. Previously, any exception raised inside a map_batches / map UDF would immediately fail the task. This PR allows users to configure which exceptions should trigger a retry, enabling more resilient pipelines for transient errors (e.g. rate limits, flaky external services).

Two new DataContext fields control the behavior of UDF retries:

  • retried_map_errors: False (default, no retries), True (retry any exception), or a List[str] (retry only when the exception message contains one of the input substrings).
  • max_map_retries: Maximum retry attempts per task. Default is 3.

Retries use the existing iterate_with_retry utility fn with exponential backoff. We unwrap UserCodeException.__cause__ so that the original UDF error message is matched, not the Ray Data wrapper for all exceptions arising from user code.

Example usage:

To retry a transform in the case of a rate limit error which has the following stack trace:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current 
quota, please check your plan and billing details.', 'type': 'insufficient_quota', 
'param': None, 'code': 'insufficient_quota'}}

We can set the following parameters in DataContext

  ctx = ray.data.DataContext.get_current()
  ctx.retried_map_errors = ["RateLimit", "429"]
  ctx.max_map_retries = 5

  ds.map_batches(my_udf).take_all()

Additional information

Implementation

  • In _map_task, read retried_map_errors from the context. If set, wrap the transform pipeline in a factory function and pass it to the existing iterate_with_retry utility instead of iterating directly
  • iterate_with_retry catches exceptions, checks if the message matches any of the input patterns, and retries with backoff in max_map_retries attempts. We extend iterate_with_retry to check for e.__cause__ to unwrap the UserCodeException into the actual exception from the UDF
  • 4 unit tests added in test_map.py to test retries exhausted, successful, retry all and non-matching exceptions

@ayushk7102 ayushk7102 requested a review from a team as a code owner April 29, 2026 18:33
@ayushk7102 ayushk7102 changed the title 04 25 trans retries [Data] Support UDF retries in case of transient exceptions Apr 29, 2026
@ayushk7102 ayushk7102 added the go add ONLY when ready to merge, run all tests label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a retry mechanism for User Defined Functions (UDFs) within Ray Data map tasks, allowing for transient error recovery based on configurable exception patterns. Key changes include the addition of retried_udf_errors and max_udf_retries to the DataContext, updates to the iterate_with_retry utility to include exception causes in pattern matching, and the integration of this retry logic into the MapOperator. Feedback focuses on refactoring duplicated iterator creation logic in map_operator.py to improve maintainability and simplifying the exception string construction in util.py.

Comment thread python/ray/data/_internal/execution/operators/map_operator.py Outdated
Comment thread python/ray/data/_internal/util.py Outdated
Comment thread python/ray/data/_internal/util.py Outdated
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments:

  1. I'm not that big of a fan of putting udf in the naming
  2. Is there really a use case where you want to actually separate the read task exception classes from the map task exception classes?

ayushk7102 and others added 2 commits April 30, 2026 13:57
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Comment thread python/ray/data/_internal/execution/operators/map_operator.py
@ayushk7102
Copy link
Copy Markdown
Contributor Author

ayushk7102 commented May 4, 2026

  1. Is there really a use case where you want to actually separate the read task exception classes from the map task exception classes?

I think it makes sense to deduplicate behaviour by unifying the transient failures in map tasks and read tasks. The only counter-example I can think of is when a user explicitly wants RateLimit errors to be retried in the IO stage, and not in the map_batches call, in which case we would be making wasted effort to retry work when the correct behaviour would be to fail for users

@ayushk7102
Copy link
Copy Markdown
Contributor Author

  1. Is there really a use case where you want to actually separate the read task exception classes from the map task exception classes?

That being said, if we decide to unify io errors and map tasks, I think that would be better as a follow-up PR to limit scope. retried_io_errors is implemented by many datasource/sinks and this PR would then expand to include a bunch of renaming

ayushk7102 added 3 commits May 3, 2026 23:51
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
@ayushk7102 ayushk7102 requested a review from richardliaw May 4, 2026 16:03
Comment thread python/ray/data/context.py
Comment thread python/ray/data/_internal/util.py Outdated
Comment thread python/ray/data/_internal/util.py Outdated
…to retry.py

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
@ayushk7102 ayushk7102 self-assigned this May 12, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 4c44074. Configure here.

Comment thread python/ray/_common/retry.py Outdated
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, left some nits for core side.

Comment thread python/ray/_common/retry.py Outdated
Comment thread python/ray/_common/retry.py Outdated
Comment thread python/ray/_common/retry.py Outdated
Comment thread python/ray/_common/retry.py Outdated
Comment thread python/ray/_common/tests/test_retry.py
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nit comments for the core side change.

Comment thread python/ray/_common/retry.py Outdated
Comment thread python/ray/_common/tests/test_retry.py
@ayushk7102 ayushk7102 requested review from Kunchd and MengjinYan May 12, 2026 22:39
…ches_error and format_exception, updated docstrings

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@edoakes edoakes merged commit 0ae4317 into ray-project:master May 13, 2026
6 checks passed
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…ct#63023)

## Description
Adds support for retrying UDF exceptions in Ray Data map tasks.
Previously, any exception raised inside a `map_batches` / `map` UDF
would immediately fail the task. This PR allows users to configure which
exceptions should trigger a retry, enabling more resilient pipelines for
transient errors (e.g. rate limits, flaky external services).

Two new `DataContext` fields control the behavior of UDF retries:
- `retried_map_errors`: False (default, no retries), True (retry any
exception), or a `List[str]` (retry only when the exception message
contains one of the input substrings).
- `max_map_retries`: Maximum retry attempts per task. Default is 3.

Retries use the existing `iterate_with_retry` utility fn with
exponential backoff. We unwrap `UserCodeException.__cause__` so that the
original UDF error message is matched, not the Ray Data wrapper for all
exceptions arising from user code.

  Example usage:

To retry a transform in the case of a rate limit error which has the
following stack trace:
```
openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current 
quota, please check your plan and billing details.', 'type': 'insufficient_quota', 
'param': None, 'code': 'insufficient_quota'}}
```
We can set the following parameters in `DataContext`
```
  ctx = ray.data.DataContext.get_current()
  ctx.retried_map_errors = ["RateLimit", "429"]
  ctx.max_map_retries = 5

  ds.map_batches(my_udf).take_all()
```

## Additional information
### Implementation
- In `_map_task`, read `retried_map_errors` from the context. If set,
wrap the transform pipeline in a factory function and pass it to the
existing `iterate_with_retry` utility instead of iterating directly
- `iterate_with_retry` catches exceptions, checks if the message matches
any of the input patterns, and retries with backoff in `max_map_retries`
attempts. We extend `iterate_with_retry` to check for `e.__cause__` to
unwrap the `UserCodeException` into the actual exception from the UDF
- 4 unit tests added in `test_map.py` to test retries exhausted,
successful, retry all and non-matching exceptions

---------

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…ct#63023)

## Description
Adds support for retrying UDF exceptions in Ray Data map tasks.
Previously, any exception raised inside a `map_batches` / `map` UDF
would immediately fail the task. This PR allows users to configure which
exceptions should trigger a retry, enabling more resilient pipelines for
transient errors (e.g. rate limits, flaky external services).

Two new `DataContext` fields control the behavior of UDF retries:
- `retried_map_errors`: False (default, no retries), True (retry any
exception), or a `List[str]` (retry only when the exception message
contains one of the input substrings).
- `max_map_retries`: Maximum retry attempts per task. Default is 3.

Retries use the existing `iterate_with_retry` utility fn with
exponential backoff. We unwrap `UserCodeException.__cause__` so that the
original UDF error message is matched, not the Ray Data wrapper for all
exceptions arising from user code.

  Example usage:

To retry a transform in the case of a rate limit error which has the
following stack trace:
```
openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current
quota, please check your plan and billing details.', 'type': 'insufficient_quota',
'param': None, 'code': 'insufficient_quota'}}
```
We can set the following parameters in `DataContext`
```
  ctx = ray.data.DataContext.get_current()
  ctx.retried_map_errors = ["RateLimit", "429"]
  ctx.max_map_retries = 5

  ds.map_batches(my_udf).take_all()
```

## Additional information
### Implementation
- In `_map_task`, read `retried_map_errors` from the context. If set,
wrap the transform pipeline in a factory function and pass it to the
existing `iterate_with_retry` utility instead of iterating directly
- `iterate_with_retry` catches exceptions, checks if the message matches
any of the input patterns, and retries with backoff in `max_map_retries`
attempts. We extend `iterate_with_retry` to check for `e.__cause__` to
unwrap the `UserCodeException` into the actual exception from the UDF
- 4 unit tests added in `test_map.py` to test retries exhausted,
successful, retry all and non-matching exceptions

---------

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ct#63023)

## Description
Adds support for retrying UDF exceptions in Ray Data map tasks.
Previously, any exception raised inside a `map_batches` / `map` UDF
would immediately fail the task. This PR allows users to configure which
exceptions should trigger a retry, enabling more resilient pipelines for
transient errors (e.g. rate limits, flaky external services).

Two new `DataContext` fields control the behavior of UDF retries:
- `retried_map_errors`: False (default, no retries), True (retry any
exception), or a `List[str]` (retry only when the exception message
contains one of the input substrings).
- `max_map_retries`: Maximum retry attempts per task. Default is 3.

Retries use the existing `iterate_with_retry` utility fn with
exponential backoff. We unwrap `UserCodeException.__cause__` so that the
original UDF error message is matched, not the Ray Data wrapper for all
exceptions arising from user code.

  Example usage:

To retry a transform in the case of a rate limit error which has the
following stack trace:
```
openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current 
quota, please check your plan and billing details.', 'type': 'insufficient_quota', 
'param': None, 'code': 'insufficient_quota'}}
```
We can set the following parameters in `DataContext`
```
  ctx = ray.data.DataContext.get_current()
  ctx.retried_map_errors = ["RateLimit", "429"]
  ctx.max_map_retries = 5

  ds.map_batches(my_udf).take_all()
```

## Additional information
### Implementation
- In `_map_task`, read `retried_map_errors` from the context. If set,
wrap the transform pipeline in a factory function and pass it to the
existing `iterate_with_retry` utility instead of iterating directly
- `iterate_with_retry` catches exceptions, checks if the message matches
any of the input patterns, and retries with backoff in `max_map_retries`
attempts. We extend `iterate_with_retry` to check for `e.__cause__` to
unwrap the `UserCodeException` into the actual exception from the UDF
- 4 unit tests added in `test_map.py` to test retries exhausted,
successful, retry all and non-matching exceptions

---------

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants