add tests and DLQ business logic #55608

harshit-anyscale · 2025-08-14T11:19:30Z

Summary

This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs:

failed_task_queue – for tasks that fail during normal execution.
unprocessable_task_queue – for tasks that cannot be processed (e.g., deserialization failures or missing handlers).

All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the RFC document.

Changes in this PR

Integrated Celery signals (task_failure, task_unknown) to handle task failures.
Added helper functions for moving tasks into the correct DLQ.
Introduced tests to verify DLQ routing logic across different failure scenarios.
Added a persistence test to ensure tasks are retried at-least-once as per the RFC’s NFR requirements.

Follow-up work (to be added in a separate PR)

Additional tests will be added in the next PR to keep this one focused and manageable. These will cover:

Task processor metrics
Task processor health checks
Task cancellation (cancel_task)
Multiple task consumers in a single Serve application
Ensuring failed tasks are retried exactly max_retry + 1 times

Signed-off-by: harshit <harshit@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces dead-letter queue (DLQ) functionality for the task processor. Failed tasks and unknown tasks are now moved to configurable DLQs for later inspection. The changes include:

Using Celery signals (task_failure, task_unknown) to handle task failures.
New helper functions for creating queue configs and moving tasks.
Configuration options for failed and unprocessable task queues.
New tests to verify the DLQ logic for various failure scenarios.

My review focuses on ensuring the robustness of the DLQ mechanism, particularly around data serialization, and on improving the maintainability of the new tests. I've identified a couple of critical issues where failed task data could be lost due to serialization errors, and I've provided suggestions to fix them. I also recommend refactoring the new tests to reduce code duplication.

python/ray/serve/task_processor.py

python/ray/serve/tests/test_task_processor.py

Signed-off-by: harshit <harshit@anyscale.com>

…tests

Signed-off-by: harshit <harshit@anyscale.com>

python/ray/serve/task_processor.py

Signed-off-by: harshit <harshit@anyscale.com>

…tests

zcin

mostly LGTM!

python/ray/serve/task_processor.py

python/ray/serve/tests/test_task_processor.py

Signed-off-by: harshit <harshit@anyscale.com>

python/ray/serve/tests/test_task_processor.py

python/ray/serve/task_processor.py

python/ray/serve/tests/test_task_processor.py

abrarsheikh · 2025-08-22T05:32:14Z

In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read.

…tests

Signed-off-by: harshit <harshit@anyscale.com>

…tests

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale · 2025-08-28T10:26:23Z

In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read.

refactored them to make it less verbose but retain all the checks & functionalities

Signed-off-by: harshit <harshit@anyscale.com>

python/ray/serve/tests/test_task_processor.py

python/ray/serve/task_processor.py

Signed-off-by: harshit <harshit@anyscale.com>

…tests

python/ray/serve/task_processor.py

python/ray/serve/tests/test_task_processor.py

abrarsheikh

the current implementation of the test is not deterministic and need to be improved. Let's work on that in the follow up PR.

### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>

### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

harshit-anyscale added 2 commits August 14, 2025 11:10

add DLQ implementation

09dc33e

Signed-off-by: harshit <harshit@anyscale.com>

add DLQ implementation

30c3f48

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale self-assigned this Aug 14, 2025

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

python/ray/serve/task_processor.py Outdated Show resolved Hide resolved

python/ray/serve/task_processor.py Show resolved Hide resolved

python/ray/serve/tests/test_task_processor.py Outdated Show resolved Hide resolved

add tests for tasks persistence across restarts

135645d

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale changed the title ~~Add tests and DLQ business logic~~ add tests and DLQ business logic Aug 18, 2025

harshit-anyscale added 4 commits August 18, 2025 09:39

review changes

7bfaff3

Signed-off-by: harshit <harshit@anyscale.com>

merge master

ca47973

Merge branch 'master' of github.com:ray-project/ray into add-dlq-and-…

2b65567

…tests

shift test to medium

b5ba218

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale marked this pull request as ready for review August 18, 2025 14:16

harshit-anyscale requested a review from a team as a code owner August 18, 2025 14:16

add comments

a721985

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested review from abrarsheikh, zcin and akyang-anyscale August 18, 2025 14:21

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Aug 18, 2025

ray-gardener bot added the serve Ray Serve Related Issue label Aug 18, 2025

zcin reviewed Aug 19, 2025

View reviewed changes

harshit-anyscale added 2 commits August 19, 2025 08:03

review changes

a8da41d

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into add-dlq-and-…

685c0d2

…tests

harshit-anyscale requested a review from zcin August 19, 2025 14:19

zcin reviewed Aug 19, 2025

View reviewed changes

python/ray/serve/task_processor.py Outdated Show resolved Hide resolved

abrarsheikh reviewed Aug 19, 2025

View reviewed changes

harshit-anyscale added 2 commits August 20, 2025 11:43

review changes

df1a5b2

Signed-off-by: harshit <harshit@anyscale.com>

use signal actor instead of time sleep

c180110

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested review from abrarsheikh and zcin August 21, 2025 11:30

abrarsheikh reviewed Aug 21, 2025

View reviewed changes

harshit-anyscale added 4 commits August 26, 2025 11:51

Merge branch 'master' of github.com:ray-project/ray into add-dlq-and-…

d0eda50

…tests

review changes

04b81e5

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into add-dlq-and-…

d951cae

…tests

refactor changes

f2296ec

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested a review from abrarsheikh August 28, 2025 10:26

harshit-anyscale added 2 commits September 1, 2025 19:36

review changes

6858e95

Signed-off-by: harshit <harshit@anyscale.com>

review changes

28ac40e

Signed-off-by: harshit <harshit@anyscale.com>

abrarsheikh reviewed Sep 4, 2025

View reviewed changes

python/ray/serve/tests/test_task_processor.py Show resolved Hide resolved

python/ray/serve/task_processor.py Outdated Show resolved Hide resolved

python/ray/serve/task_processor.py Show resolved Hide resolved

harshit-anyscale added 2 commits September 4, 2025 16:09

review changes

5b1d3d6

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into add-dlq-and-…

aeefd5b

…tests

harshit-anyscale requested a review from abrarsheikh September 4, 2025 16:11

akyang-anyscale reviewed Sep 5, 2025

View reviewed changes

python/ray/serve/task_processor.py Show resolved Hide resolved

python/ray/serve/task_processor.py Show resolved Hide resolved

python/ray/serve/tests/test_task_processor.py Show resolved Hide resolved

abrarsheikh approved these changes Sep 5, 2025

View reviewed changes

akyang-anyscale approved these changes Sep 5, 2025

View reviewed changes

zcin merged commit 1c637ac into master Sep 5, 2025
5 checks passed

zcin deleted the add-dlq-and-tests branch September 5, 2025 17:02

add tests and DLQ business logic #55608

add tests and DLQ business logic #55608

Uh oh!

Conversation

harshit-anyscale commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes in this PR

Follow-up work (to be added in a separate PR)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh commented Aug 22, 2025

Uh oh!

harshit-anyscale commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented Aug 14, 2025 •

edited

Loading