-
Notifications
You must be signed in to change notification settings - Fork 6.9k
add tests and DLQ business logic #55608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces dead-letter queue (DLQ) functionality for the task processor. Failed tasks and unknown tasks are now moved to configurable DLQs for later inspection. The changes include:
- Using Celery signals (
task_failure
,task_unknown
) to handle task failures. - New helper functions for creating queue configs and moving tasks.
- Configuration options for failed and unprocessable task queues.
- New tests to verify the DLQ logic for various failure scenarios.
My review focuses on ensuring the robustness of the DLQ mechanism, particularly around data serialization, and on improving the maintainability of the new tests. I've identified a couple of critical issues where failed task data could be lost due to serialization errors, and I've provided suggestions to fix them. I also recommend refactoring the new tests to reduce code duplication.
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM!
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read. |
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
refactored them to make it less verbose but retain all the checks & functionalities |
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current implementation of the test is not deterministic and need to be improved. Let's work on that in the follow up PR.
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs:
failed_task_queue
– for tasks that fail during normal execution.unprocessable_task_queue
– for tasks that cannot be processed (e.g., deserialization failures or missing handlers).All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the RFC document.
Changes in this PR
Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: