Skip to content

Conversation

harshit-anyscale
Copy link
Contributor

@harshit-anyscale harshit-anyscale commented Aug 14, 2025

Summary

This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs:

  1. failed_task_queue – for tasks that fail during normal execution.
  2. unprocessable_task_queue – for tasks that cannot be processed (e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the RFC document.

Changes in this PR

  1. Integrated Celery signals (task_failure, task_unknown) to handle task failures.
  2. Added helper functions for moving tasks into the correct DLQ.
  3. Introduced tests to verify DLQ routing logic across different failure scenarios.
  4. Added a persistence test to ensure tasks are retried at-least-once as per the RFC’s NFR requirements.

Follow-up work (to be added in a separate PR)

Additional tests will be added in the next PR to keep this one focused and manageable. These will cover:

  1. Task processor metrics
  2. Task processor health checks
  3. Task cancellation (cancel_task)
  4. Multiple task consumers in a single Serve application
  5. Ensuring failed tasks are retried exactly max_retry + 1 times

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale self-assigned this Aug 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dead-letter queue (DLQ) functionality for the task processor. Failed tasks and unknown tasks are now moved to configurable DLQs for later inspection. The changes include:

  • Using Celery signals (task_failure, task_unknown) to handle task failures.
  • New helper functions for creating queue configs and moving tasks.
  • Configuration options for failed and unprocessable task queues.
  • New tests to verify the DLQ logic for various failure scenarios.

My review focuses on ensuring the robustness of the DLQ mechanism, particularly around data serialization, and on improving the maintainability of the new tests. I've identified a couple of critical issues where failed task data could be lost due to serialization errors, and I've provided suggestions to fix them. I also recommend refactoring the new tests to reduce code duplication.

Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale changed the title Add tests and DLQ business logic add tests and DLQ business logic Aug 18, 2025
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale marked this pull request as ready for review August 18, 2025 14:16
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner August 18, 2025 14:16
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Aug 18, 2025
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Aug 18, 2025
@harshit-anyscale harshit-anyscale requested a review from zcin August 19, 2025 14:19
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM!

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@abrarsheikh
Copy link
Contributor

In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read.

@harshit-anyscale
Copy link
Contributor Author

In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read.

refactored them to make it less verbose but retain all the checks & functionalities

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Copy link
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation of the test is not deterministic and need to be improved. Let's work on that in the follow up PR.

@zcin zcin merged commit 1c637ac into master Sep 5, 2025
5 checks passed
@zcin zcin deleted the add-dlq-and-tests branch September 5, 2025 17:02
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
### Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for
async inference. Users can configure two DLQs:
1. `failed_task_queue` – for tasks that fail during normal execution.
2. `unprocessable_task_queue` – for tasks that cannot be processed
(e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the
unprocessable_task_queue, while other failures will go to the
failed_task_queue. The detailed behavior is defined in the [RFC
document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0).

### Changes in this PR
1. Integrated Celery signals (task_failure, task_unknown) to handle task
failures.
2. Added helper functions for moving tasks into the correct DLQ.
3. Introduced tests to verify DLQ routing logic across different failure
scenarios.
4. Added a persistence test to ensure tasks are retried at-least-once as
per the [RFC’s NFR
requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x).

### Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused
and manageable. These will cover:

1. Task processor metrics
2. Task processor health checks
3. Task cancellation (cancel_task)
4. Multiple task consumers in a single Serve application
5. Ensuring failed tasks are retried exactly max_retry + 1 times

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
### Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for
async inference. Users can configure two DLQs:
1. `failed_task_queue` – for tasks that fail during normal execution.
2. `unprocessable_task_queue` – for tasks that cannot be processed
(e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the
unprocessable_task_queue, while other failures will go to the
failed_task_queue. The detailed behavior is defined in the [RFC
document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0).

### Changes in this PR
1. Integrated Celery signals (task_failure, task_unknown) to handle task
failures.
2. Added helper functions for moving tasks into the correct DLQ.
3. Introduced tests to verify DLQ routing logic across different failure
scenarios.
4. Added a persistence test to ensure tasks are retried at-least-once as
per the [RFC’s NFR
requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x).

### Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused
and manageable. These will cover:

1. Task processor metrics
2. Task processor health checks
3. Task cancellation (cancel_task)
4. Multiple task consumers in a single Serve application
5. Ensuring failed tasks are retried exactly max_retry + 1 times

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
### Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for
async inference. Users can configure two DLQs:
1. `failed_task_queue` – for tasks that fail during normal execution.
2. `unprocessable_task_queue` – for tasks that cannot be processed
(e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the
unprocessable_task_queue, while other failures will go to the
failed_task_queue. The detailed behavior is defined in the [RFC
document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0).

### Changes in this PR
1. Integrated Celery signals (task_failure, task_unknown) to handle task
failures.
2. Added helper functions for moving tasks into the correct DLQ.
3. Introduced tests to verify DLQ routing logic across different failure
scenarios.
4. Added a persistence test to ensure tasks are retried at-least-once as
per the [RFC’s NFR
requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x).

### Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused
and manageable. These will cover:

1. Task processor metrics
2. Task processor health checks
3. Task cancellation (cancel_task)
4. Multiple task consumers in a single Serve application
5. Ensuring failed tasks are retried exactly max_retry + 1 times

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
### Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for
async inference. Users can configure two DLQs:
1. `failed_task_queue` – for tasks that fail during normal execution.
2. `unprocessable_task_queue` – for tasks that cannot be processed
(e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the
unprocessable_task_queue, while other failures will go to the
failed_task_queue. The detailed behavior is defined in the [RFC
document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0).

### Changes in this PR
1. Integrated Celery signals (task_failure, task_unknown) to handle task
failures.
2. Added helper functions for moving tasks into the correct DLQ.
3. Introduced tests to verify DLQ routing logic across different failure
scenarios.
4. Added a persistence test to ensure tasks are retried at-least-once as
per the [RFC’s NFR
requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x).

### Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused
and manageable. These will cover:

1. Task processor metrics
2. Task processor health checks
3. Task cancellation (cancel_task)
4. Multiple task consumers in a single Serve application
5. Ensuring failed tasks are retried exactly max_retry + 1 times

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants