feat(log-ingestor): Allow an SQS listener job to spawn multiple concurrent tasks to consume notifications from the same queue (resolves #1977). by LinZhihao-723 · Pull Request #1989 · y-scope/clp

LinZhihao-723 · 2026-02-13T05:13:16Z

Description

This PR adds support for spawning multiple coroutines to process a single SQS queue. This enables higher message-processing throughput.

To adapt the multi-listener design, this PR also exposes the max waiting time as a configurable option so that coroutines can avoid frequent void responses.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Add unit tests to cover multiple listener tasks.
Ensure all workflows pass.
Ensure invalid concurrent config will be rejected.
Ensure that with 16 tasks per job, the ingestion speed is significantly improved.

Summary by CodeRabbit

New Features
- SQS listener supports configurable concurrency and wait time with per-task validation and improved per-task logging.
Bug Fixes / Behaviour
- Listener shutdown now completes cleanly even when individual tasks fail; invalid listener configs yield clear Bad Request responses.
Tests
- Tests expanded to exercise multiple concurrency levels and noise-object scenarios.
Chores
- Client wrappers made clonable and API surfaces adjusted to accept validated config references.

coderabbitai · 2026-02-13T05:13:35Z

Walkthrough

Adds validated SQS listener config with concurrency and wait-time fields and validation, implements multi-task SQS listener orchestration (Task/TaskHandle per-task lifecycle), tightens AwsClientManager trait to require Clone, updates job creation to validate configs and map validation errors to API responses, and adapts tests.

Changes

Cohort / File(s)	Summary
Configuration `components/clp-rust-utils/src/job_config/ingestion.rs`	Added `num_concurrent_listener_tasks: u16` and `wait_time_sec: u16` to `SqsListenerConfig`, introduced `ConfigError` enum, `ValidatedSqsListenerConfig` wrapper with `validate_and_create` and accessor, plus default provider functions.
AWS client manager & wrappers `components/log-ingestor/src/aws_client_manager.rs`, `components/log-ingestor/tests/aws_config.rs`	Added `Clone` bound to `AwsClientManagerType` trait and derived `Clone` for `SqsClientWrapper`, `S3ClientWrapper`, and test `AwsConfig`.
Multi-task SQS listener `components/log-ingestor/src/ingestion_job/sqs_listener.rs`	Replaced single-task listener with multi-task orchestration: added `TaskId`, `TaskHandle`, per-task IDs, `ValidatedSqsListenerConfig` usage, SqsListener holding multiple task handles, spawn and `shutdown_and_join` that cancels + joins all tasks, and adjusted wait-time typing/usage.
Job creation & manager `components/log-ingestor/src/ingestion_job_manager.rs`, `components/log-ingestor/src/ingestion_job.rs`	`create_sqs_listener_job` now validates `SqsListenerConfig` via `ValidatedSqsListenerConfig::validate_and_create`, added `InvalidConfig(ConfigError)` error variant, spawn calls updated to pass borrowed refs; `ingestion_job::shutdown_and_join` no longer propagates SqsListener error.
API routes / error mapping `components/log-ingestor/src/routes.rs`	Mapped new `InvalidConfig(_)` validation error to `BAD_REQUEST` and updated create_sqs_listener_job error description to include invalid concurrency.
Tests `components/log-ingestor/tests/test_ingestion_job.rs`	Refactored tests to use `ValidatedSqsListenerConfig`, added `run_sqs_listener_test` helper, introduced `NUM_NOISE_OBJECTS`, looped tests over concurrency values (1,4,16,32), adjusted spawn and client wrapper call sites to pass references, and adapted shutdown handling.
Manifests / misc `manifest_file`, `Cargo.toml`	Small manifest edits to accommodate derive/trait bound changes.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant JobMgr as IngestionJobManager
    participant SqsListener
    participant TaskHandle
    participant Task as Task<SqsClientManager>
    participant SqsClient as SqsClientWrapper

    User->>JobMgr: create_sqs_listener_job(raw_config)
    JobMgr->>JobMgr: ValidatedSqsListenerConfig::validate_and_create(raw_config)
    rect rgba(76, 175, 80, 0.5)
        JobMgr->>SqsListener: spawn(job_id, &sqs_client_manager, &config, &sender)
        Note over SqsListener: For each concurrent task (1..N)
        loop Create N TaskHandles
            SqsListener->>Task: instantiate Task{id, client_manager, config, sender}
            SqsListener->>TaskHandle: TaskHandle::spawn(Task, job_id)
            TaskHandle->>Task: tokio::spawn(task.run())
            Task-->>TaskHandle: JoinHandle<Result<()>>
            TaskHandle->>SqsListener: push to task_handles
        end
    end

    rect rgba(33, 150, 243, 0.5)
        loop Each Task
            Task->>SqsClient: ReceiveMessage(wait_time_seconds)
            alt Messages
                Task->>Task: process messages (log job_id, task_id)
            else No messages / error
                Task->>Task: log outcome with job_id, task_id
            end
        end
    end

    rect rgba(244, 67, 54, 0.5)
        User->>SqsListener: shutdown_and_join()
        loop For each TaskHandle
            SqsListener->>TaskHandle: cancel_token.cancel()
            TaskHandle->>Task: cancellation observed
            TaskHandle->>TaskHandle: await join_handle
            Note over TaskHandle: Log cancellation/exit for task_id
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

[log-ingestor] Support multiple SQS listening coroutines in an SQS listener job. #1977: Implements configurable multi-coroutine SQS listening (num_concurrent_listener_tasks, TaskHandle orchestration, per-task IDs), directly addressing the feature request.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (11 files): ⚔️ `components/clp-rust-utils/src/job_config/ingestion.rs` (content) ⚔️ `components/core/src/clp_s/indexer/CMakeLists.txt` (content) ⚔️ `components/log-ingestor/src/aws_client_manager.rs` (content) ⚔️ `components/log-ingestor/src/compression/compression_job_submitter.rs` (content) ⚔️ `components/log-ingestor/src/ingestion_job.rs` (content) ⚔️ `components/log-ingestor/src/ingestion_job/sqs_listener.rs` (content) ⚔️ `components/log-ingestor/src/ingestion_job_manager.rs` (content) ⚔️ `components/log-ingestor/src/routes.rs` (content) ⚔️ `components/log-ingestor/tests/aws_config.rs` (content) ⚔️ `components/log-ingestor/tests/test_ingestion_job.rs` (content) ⚔️ `docs/src/_static/generated/log-ingestor-openapi.json` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: enabling multiple concurrent SQS listener tasks for a single queue, which is the primary objective of this changeset across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch concurrent-listeners
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@components/clp-rust-utils/src/job_config/ingestion.rs`:
- Around line 53-76: The schema/docs conflict: decide whether wait_time_sec
(field wait_time_sec, default default_sqs_wait_time_sec) should be clamped to 20
or rejected; either remove or raise the #[schema(maximum = 20)] and update the
doc to say values >20 will be truncated to 20 (and implement clamping where
config is loaded), or keep the #[schema(maximum = 20)] and change the doc to say
values >20 will be rejected. Also eliminate duplicated magic numbers for
num_concurrent_listener_tasks (field num_concurrent_listener_tasks, default
default_num_concurrent_listener_tasks) by introducing shared constants (e.g.,
MIN_CONCURRENT_LISTENER_TASKS and MAX_CONCURRENT_LISTENER_TASKS) and use those
constants in the schema annotations, documentation text, and the runtime
validation in ingestion_job_manager.rs so the bounds stay in sync.

In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 40-41: The error string for the InvalidNumConcurrentListenerTasks
variant contains a stray trailing backtick; update the #[error(...)] attribute
on the InvalidNumConcurrentListenerTasks enum variant to remove the extra
backtick so the format becomes a clean message (e.g., change `"Invalid
`num_concurrent_listener_tasks`: {0}`"` to `"Invalid
`num_concurrent_listener_tasks`: {0}"` or remove all backticks), ensuring the
error macro and variant name InvalidNumConcurrentListenerTasks remain unchanged.

In `@components/log-ingestor/src/ingestion_job/sqs_listener.rs`:
- Around line 235-251: Move the assert that config.num_concurrent_listener_tasks
!= 0 to before allocating task_handles, and replace the lossless integer casts
with explicit from conversions: use
Vec::with_capacity(usize::from(config.num_concurrent_listener_tasks)) instead of
with_capacity(config.num_concurrent_listener_tasks as usize), and construct
Task.id with TaskId::from(task_id) (and similarly use usize::from(...) for any
other counts) so the code uses usize::from(...) and TaskId::from(...) instead of
as casts; keep the loop and TaskHandle::spawn usage unchanged.

In `@components/log-ingestor/tests/test_ingestion_job.rs`:
- Around line 222-253: The test uses num_tasks values [1,4,16,64] but 64 exceeds
the validated API limit (32) enforced by the manager; change the 64 to a value ≤
32 (or add an explicit comment that 64 is an intentional internal stress test)
and make the test deterministic by isolating SQS state between iterations—either
purge the test queue or use unique job_id/prefix per iteration so deterministic
object keys ({prefix}/{idx:05}.log) from a prior run cannot satisfy later runs;
update the call sites in run_sqs_listener_test and the SqsListenerConfig loop
accordingly and document the intent if you keep >32 so reviewers know it
bypasses ingestion_job_manager.rs validation.

components/clp-rust-utils/src/job_config/ingestion.rs

components/log-ingestor/src/ingestion_job_manager.rs

components/log-ingestor/src/ingestion_job/sqs_listener.rs

components/log-ingestor/tests/test_ingestion_job.rs

hoophalab

Some nitpicks.

Validations:

Successfully ingested logs from AWS SQS
tests:rust-all passes

components/log-ingestor/src/ingestion_job_manager.rs

hoophalab · 2026-02-14T03:20:35Z

components/clp-rust-utils/src/job_config/ingestion.rs

+        /// AWS SQS enforces a maximum wait time of 20 seconds. Any configured value greater than
+        /// 20 seconds will be truncated to 20 seconds.


Coderabbit is probably correct? We reject values > 20 rather than truncate them.

Suggested change

/// AWS SQS enforces a maximum wait time of 20 seconds. Any configured value greater than

/// 20 seconds will be truncated to 20 seconds.

/// AWS SQS enforces a maximum wait time of 20 seconds.

This is tricky. With our current set up, there's no way to enforce the config to be valid values all the way down to the actual listener job spawn. That means we need to do validations manually. However, this is different from the other two validations we perform at present:

The other two validations we perform are on the top-level, however it won't affect the actual ingestion job execution if an "invalid" value is given.

A custom endpoint won't propagate into the SQS job execution, since it's handled by the client manager.

A number larger than 32 given to the job also won't affect the execution. The SQS job can handle any arbitrary number of coroutines; it's our top-level decision to not allow a number > 32.

The wait time is different. A wait time larger than 20 will cause the operation to fail. Since there's no way to enforce config validation all the way down to the SQS job execution, we need to truncate inside the actual task to make sure it's under 20 anyway. That said, I think as long as we clearly document this truncation in the OpenAPI schema, it should be fair.

I can add the logic to reject >20 sec config on the top-level, but I will keep the truncation logic inside the SQS job just in case.

Maybe we should just adopt c philosophy: if the input isn't valid, then the behavior is undefined. You cannot validate everything in every function/using types.

So we should just reject >20 in routes.rs, and add a debug assert when receive message

I don't agree on the C philosophy in general, lol. UB is the root of evil.
Actually, I have a better idea for handling this. We can wrap the validated config with a special type and pass this type all the way down to the ingestion job.

Type is not the solution to everything. Even tokio implementation has undefined behaviors in their documents.

components/log-ingestor/src/ingestion_job_manager.rs

components/log-ingestor/src/ingestion_job/sqs_listener.rs

components/log-ingestor/src/ingestion_job_manager.rs

components/log-ingestor/src/ingestion_job/sqs_listener.rs

Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/log-ingestor/src/ingestion_job/sqs_listener.rs (1)

298-304: 🧹 Nitpick | 🔵 Trivial

Consider returning &Uuid or Uuid instead of String from get_id.

Callers that need the UUID as a string can call .to_string() themselves, while callers that need the Uuid type benefit from avoiding an allocation. Also, the get_ prefix is non-idiomatic in Rust — the convention is just id().

🤖 Fix all issues with AI agents

In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 162-167: The hard-coded upper bound 32 used when validating
config.num_concurrent_listener_tasks should be extracted into a shared constant
to avoid divergence with the config definition; add a public constant (e.g.,
MAX_CONCURRENT_LISTENER_TASKS) next to SqsListenerConfig in the config module,
replace the literal 32 in ingestion_job_manager.rs with that constant, and
update any other places (and tests) that rely on the value so both the
validation here and the config definition reference the same symbol instead of a
magic number; ensure Error::InvalidNumConcurrentListenerTasks still reports the
provided value unchanged.

In `@components/log-ingestor/src/ingestion_job/sqs_listener.rs`:
- Around line 262-294: There is a duplicated nested match awaiting the same
JoinHandle (use-after-move) in the loop over self.task_handles; remove the outer
match and replace with a single match on task_handle.join_handle.await that
handles the three arms (Ok(Ok(())) success logging, Ok(Err(_)) warn, and
Err(err) panic warn). Locate the loop iterating over self.task_handles in
sqs_listener.rs and update the handling around task_handle.join_handle.await so
the JoinHandle is awaited exactly once and the three cases are handled directly.
- Around line 256-296: The shutdown_and_join method contains a duplicated nested
match over task_handle.join_handle.await causing a compile error; remove the
outer match that binds Ok(task_result) and keep the inner match that handles
Ok(Ok(())), Ok(Err(_)), and Err(err) for the JoinHandle<Result<()>> result.
Iterate over self.task_handles (cancel tokens loop stays) and then consume
task_handles in the second loop, matching directly on
task_handle.join_handle.await inside shutdown_and_join so task_id and self.id
logging remain unchanged.

components/log-ingestor/src/ingestion_job_manager.rs

components/log-ingestor/src/ingestion_job/sqs_listener.rs

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/log-ingestor/src/ingestion_job/sqs_listener.rs (1)

47-69: 🧹 Nitpick | 🔵 Trivial

MAX_WAIT_TIME_SEC duplicated as defense-in-depth is acceptable but could reference the shared constant.

The local MAX_WAIT_TIME_SEC on line 49 duplicates the 20-second bound already enforced by ValidatedSqsListenerConfig. As defense-in-depth this is reasonable, but if you extract the validation bounds into constants (as suggested in the config file review), this could reference the same constant to stay in sync.

🤖 Fix all issues with AI agents

In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 156-159: Update the doc comment for the function that currently
mentions Error::InvalidNumConcurrentListenerTasks to instead reference the
actual error variants returned by validation: Error::InvalidConfig(ConfigError)
and Error::InvalidSqsWaitTime; ensure the sentence that lists forwarded failures
from Self::create_s3_ingestion_job includes both of these validation errors and
clearly states they come from config/validation checks so readers can locate the
real variants (e.g., mention Error::InvalidConfig(ConfigError) and
Error::InvalidSqsWaitTime alongside the forwarded create_s3_ingestion_job
errors).

In `@components/log-ingestor/src/routes.rs`:
- Around line 211-212: The API description string for the SQS listener error
currently only references invalid concurrent listener tasks but omits the
wait_time_sec validation; update the description used in routes.rs (the
description string that references ConfigError) to either mention both
ConfigError::InvalidNumConcurrentListenerTasks and
ConfigError::InvalidSqsWaitTime explicitly or replace it with a more general
phrase such as "invalid configuration (e.g., invalid number of concurrent
listener tasks or invalid wait_time_sec)". Ensure the new text clearly signals
both possible validation failures so the documented error matches the
ConfigError variants.

components/log-ingestor/src/ingestion_job_manager.rs

components/log-ingestor/src/routes.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/log-ingestor/src/ingestion_job_manager.rs (1)
146-186: ⚠️ Potential issue | 🟡 Minor

Doc comment for create_sqs_listener_job is incomplete — CustomEndpointUrlNotSupported is not listed.

The # Errors section (lines 154–157) documents forwarded errors from create_s3_ingestion_job and validate_and_create, but omits Error::CustomEndpointUrlNotSupported which is explicitly returned at line 165. While the omission may predate this PR, the doc was partially rewritten here (line 157), so it's a good time to complete it.
📝 Suggested doc fix
 /// # Errors
 ///
 /// Returns an error if:
 ///
+/// * [`Error::CustomEndpointUrlNotSupported`] if a custom endpoint URL is given.
 /// * Forwards [`Self::create_s3_ingestion_job`]'s return values on failure.
 /// * Forwards [`ValidatedSqsListenerConfig::validate_and_create`]'s return values on failure.

Implementation done.

260b3f7

LinZhihao-723 requested a review from a team as a code owner February 13, 2026 05:13

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

LinZhihao-723 requested a review from hoophalab February 13, 2026 05:37

hoophalab reviewed Feb 14, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 14, 2026

Update S3 scanner's shutdown_and_join to not return Result<()> #1995

Closed

Apply suggestions from code review

ad7cf6e

Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

components/log-ingestor/src/ingestion_job_manager.rs Outdated Show resolved Hide resolved

components/log-ingestor/src/ingestion_job/sqs_listener.rs Show resolved Hide resolved

components/log-ingestor/src/ingestion_job/sqs_listener.rs Show resolved Hide resolved

Fix.

1ae4401

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

components/log-ingestor/src/ingestion_job_manager.rs Outdated Show resolved Hide resolved

components/log-ingestor/src/routes.rs Show resolved Hide resolved

Fix

4d579f9

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

LinZhihao-723 requested a review from hoophalab February 14, 2026 22:33

hoophalab approved these changes Feb 16, 2026

View reviewed changes

Merge branch 'main' into concurrent-listeners

7e89763

LinZhihao-723 merged commit 5f2e11b into y-scope:main Feb 16, 2026
22 checks passed

junhaoliao added this to the February 2026 milestone Feb 26, 2026

		/// AWS SQS enforces a maximum wait time of 20 seconds. Any configured value greater than
		/// 20 seconds will be truncated to 20 seconds.

Conversation

LinZhihao-723 commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hoophalab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hoophalab Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

hoophalab Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

hoophalab Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LinZhihao-723 commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

hoophalab left a comment •

edited

Loading

hoophalab Feb 16, 2026 •

edited

Loading