Skip to content

feat(log-ingestor): Allow an SQS listener job to spawn multiple concurrent tasks to consume notifications from the same queue (resolves #1977).#1989

Merged
LinZhihao-723 merged 5 commits intoy-scope:mainfrom
LinZhihao-723:concurrent-listeners
Feb 16, 2026

Conversation

@LinZhihao-723
Copy link
Member

@LinZhihao-723 LinZhihao-723 commented Feb 13, 2026

Description

This PR adds support for spawning multiple coroutines to process a single SQS queue. This enables higher message-processing throughput.

To adapt the multi-listener design, this PR also exposes the max waiting time as a configurable option so that coroutines can avoid frequent void responses.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Add unit tests to cover multiple listener tasks.
  • Ensure all workflows pass.
  • Ensure invalid concurrent config will be rejected.
  • Ensure that with 16 tasks per job, the ingestion speed is significantly improved.

Summary by CodeRabbit

  • New Features

    • SQS listener supports configurable concurrency and wait time with per-task validation and improved per-task logging.
  • Bug Fixes / Behaviour

    • Listener shutdown now completes cleanly even when individual tasks fail; invalid listener configs yield clear Bad Request responses.
  • Tests

    • Tests expanded to exercise multiple concurrency levels and noise-object scenarios.
  • Chores

    • Client wrappers made clonable and API surfaces adjusted to accept validated config references.

@LinZhihao-723 LinZhihao-723 requested a review from a team as a code owner February 13, 2026 05:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

Walkthrough

Adds validated SQS listener config with concurrency and wait-time fields and validation, implements multi-task SQS listener orchestration (Task/TaskHandle per-task lifecycle), tightens AwsClientManager trait to require Clone, updates job creation to validate configs and map validation errors to API responses, and adapts tests.

Changes

Cohort / File(s) Summary
Configuration
components/clp-rust-utils/src/job_config/ingestion.rs
Added num_concurrent_listener_tasks: u16 and wait_time_sec: u16 to SqsListenerConfig, introduced ConfigError enum, ValidatedSqsListenerConfig wrapper with validate_and_create and accessor, plus default provider functions.
AWS client manager & wrappers
components/log-ingestor/src/aws_client_manager.rs, components/log-ingestor/tests/aws_config.rs
Added Clone bound to AwsClientManagerType trait and derived Clone for SqsClientWrapper, S3ClientWrapper, and test AwsConfig.
Multi-task SQS listener
components/log-ingestor/src/ingestion_job/sqs_listener.rs
Replaced single-task listener with multi-task orchestration: added TaskId, TaskHandle, per-task IDs, ValidatedSqsListenerConfig usage, SqsListener holding multiple task handles, spawn and shutdown_and_join that cancels + joins all tasks, and adjusted wait-time typing/usage.
Job creation & manager
components/log-ingestor/src/ingestion_job_manager.rs, components/log-ingestor/src/ingestion_job.rs
create_sqs_listener_job now validates SqsListenerConfig via ValidatedSqsListenerConfig::validate_and_create, added InvalidConfig(ConfigError) error variant, spawn calls updated to pass borrowed refs; ingestion_job::shutdown_and_join no longer propagates SqsListener error.
API routes / error mapping
components/log-ingestor/src/routes.rs
Mapped new InvalidConfig(_) validation error to BAD_REQUEST and updated create_sqs_listener_job error description to include invalid concurrency.
Tests
components/log-ingestor/tests/test_ingestion_job.rs
Refactored tests to use ValidatedSqsListenerConfig, added run_sqs_listener_test helper, introduced NUM_NOISE_OBJECTS, looped tests over concurrency values (1,4,16,32), adjusted spawn and client wrapper call sites to pass references, and adapted shutdown handling.
Manifests / misc
manifest_file, Cargo.toml
Small manifest edits to accommodate derive/trait bound changes.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant JobMgr as IngestionJobManager
    participant SqsListener
    participant TaskHandle
    participant Task as Task<SqsClientManager>
    participant SqsClient as SqsClientWrapper

    User->>JobMgr: create_sqs_listener_job(raw_config)
    JobMgr->>JobMgr: ValidatedSqsListenerConfig::validate_and_create(raw_config)
    rect rgba(76, 175, 80, 0.5)
        JobMgr->>SqsListener: spawn(job_id, &sqs_client_manager, &config, &sender)
        Note over SqsListener: For each concurrent task (1..N)
        loop Create N TaskHandles
            SqsListener->>Task: instantiate Task{id, client_manager, config, sender}
            SqsListener->>TaskHandle: TaskHandle::spawn(Task, job_id)
            TaskHandle->>Task: tokio::spawn(task.run())
            Task-->>TaskHandle: JoinHandle<Result<()>>
            TaskHandle->>SqsListener: push to task_handles
        end
    end

    rect rgba(33, 150, 243, 0.5)
        loop Each Task
            Task->>SqsClient: ReceiveMessage(wait_time_seconds)
            alt Messages
                Task->>Task: process messages (log job_id, task_id)
            else No messages / error
                Task->>Task: log outcome with job_id, task_id
            end
        end
    end

    rect rgba(244, 67, 54, 0.5)
        User->>SqsListener: shutdown_and_join()
        loop For each TaskHandle
            SqsListener->>TaskHandle: cancel_token.cancel()
            TaskHandle->>Task: cancellation observed
            TaskHandle->>TaskHandle: await join_handle
            Note over TaskHandle: Log cancellation/exit for task_id
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (11 files):

⚔️ components/clp-rust-utils/src/job_config/ingestion.rs (content)
⚔️ components/core/src/clp_s/indexer/CMakeLists.txt (content)
⚔️ components/log-ingestor/src/aws_client_manager.rs (content)
⚔️ components/log-ingestor/src/compression/compression_job_submitter.rs (content)
⚔️ components/log-ingestor/src/ingestion_job.rs (content)
⚔️ components/log-ingestor/src/ingestion_job/sqs_listener.rs (content)
⚔️ components/log-ingestor/src/ingestion_job_manager.rs (content)
⚔️ components/log-ingestor/src/routes.rs (content)
⚔️ components/log-ingestor/tests/aws_config.rs (content)
⚔️ components/log-ingestor/tests/test_ingestion_job.rs (content)
⚔️ docs/src/_static/generated/log-ingestor-openapi.json (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: enabling multiple concurrent SQS listener tasks for a single queue, which is the primary objective of this changeset across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch concurrent-listeners
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@components/clp-rust-utils/src/job_config/ingestion.rs`:
- Around line 53-76: The schema/docs conflict: decide whether wait_time_sec
(field wait_time_sec, default default_sqs_wait_time_sec) should be clamped to 20
or rejected; either remove or raise the #[schema(maximum = 20)] and update the
doc to say values >20 will be truncated to 20 (and implement clamping where
config is loaded), or keep the #[schema(maximum = 20)] and change the doc to say
values >20 will be rejected. Also eliminate duplicated magic numbers for
num_concurrent_listener_tasks (field num_concurrent_listener_tasks, default
default_num_concurrent_listener_tasks) by introducing shared constants (e.g.,
MIN_CONCURRENT_LISTENER_TASKS and MAX_CONCURRENT_LISTENER_TASKS) and use those
constants in the schema annotations, documentation text, and the runtime
validation in ingestion_job_manager.rs so the bounds stay in sync.

In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 40-41: The error string for the InvalidNumConcurrentListenerTasks
variant contains a stray trailing backtick; update the #[error(...)] attribute
on the InvalidNumConcurrentListenerTasks enum variant to remove the extra
backtick so the format becomes a clean message (e.g., change `"Invalid
`num_concurrent_listener_tasks`: {0}`"` to `"Invalid
`num_concurrent_listener_tasks`: {0}"` or remove all backticks), ensuring the
error macro and variant name InvalidNumConcurrentListenerTasks remain unchanged.

In `@components/log-ingestor/src/ingestion_job/sqs_listener.rs`:
- Around line 235-251: Move the assert that config.num_concurrent_listener_tasks
!= 0 to before allocating task_handles, and replace the lossless integer casts
with explicit from conversions: use
Vec::with_capacity(usize::from(config.num_concurrent_listener_tasks)) instead of
with_capacity(config.num_concurrent_listener_tasks as usize), and construct
Task.id with TaskId::from(task_id) (and similarly use usize::from(...) for any
other counts) so the code uses usize::from(...) and TaskId::from(...) instead of
as casts; keep the loop and TaskHandle::spawn usage unchanged.

In `@components/log-ingestor/tests/test_ingestion_job.rs`:
- Around line 222-253: The test uses num_tasks values [1,4,16,64] but 64 exceeds
the validated API limit (32) enforced by the manager; change the 64 to a value ≤
32 (or add an explicit comment that 64 is an intentional internal stress test)
and make the test deterministic by isolating SQS state between iterations—either
purge the test queue or use unique job_id/prefix per iteration so deterministic
object keys ({prefix}/{idx:05}.log) from a prior run cannot satisfy later runs;
update the call sites in run_sqs_listener_test and the SqsListenerConfig loop
accordingly and document the intent if you keep >32 so reviewers know it
bypasses ingestion_job_manager.rs validation.

Copy link
Contributor

@hoophalab hoophalab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks.

Validations:

  1. Successfully ingested logs from AWS SQS
  2. tests:rust-all passes

Comment on lines +70 to +71
/// AWS SQS enforces a maximum wait time of 20 seconds. Any configured value greater than
/// 20 seconds will be truncated to 20 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coderabbit is probably correct? We reject values > 20 rather than truncate them.

Suggested change
/// AWS SQS enforces a maximum wait time of 20 seconds. Any configured value greater than
/// 20 seconds will be truncated to 20 seconds.
/// AWS SQS enforces a maximum wait time of 20 seconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky. With our current set up, there's no way to enforce the config to be valid values all the way down to the actual listener job spawn. That means we need to do validations manually. However, this is different from the other two validations we perform at present:

  • The other two validations we perform are on the top-level, however it won't affect the actual ingestion job execution if an "invalid" value is given.
    • A custom endpoint won't propagate into the SQS job execution, since it's handled by the client manager.
    • A number larger than 32 given to the job also won't affect the execution. The SQS job can handle any arbitrary number of coroutines; it's our top-level decision to not allow a number > 32.
  • The wait time is different. A wait time larger than 20 will cause the operation to fail. Since there's no way to enforce config validation all the way down to the SQS job execution, we need to truncate inside the actual task to make sure it's under 20 anyway. That said, I think as long as we clearly document this truncation in the OpenAPI schema, it should be fair.

I can add the logic to reject >20 sec config on the top-level, but I will keep the truncation logic inside the SQS job just in case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just adopt c philosophy: if the input isn't valid, then the behavior is undefined. You cannot validate everything in every function/using types.

So we should just reject >20 in routes.rs, and add a debug assert when receive message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree on the C philosophy in general, lol. UB is the root of evil.
Actually, I have a better idea for handling this. We can wrap the validated config with a special type and pass this type all the way down to the ingestion job.

Copy link
Contributor

@hoophalab hoophalab Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type is not the solution to everything. Even tokio implementation has undefined behaviors in their documents.

Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/log-ingestor/src/ingestion_job/sqs_listener.rs (1)

298-304: 🧹 Nitpick | 🔵 Trivial

Consider returning &Uuid or Uuid instead of String from get_id.

Callers that need the UUID as a string can call .to_string() themselves, while callers that need the Uuid type benefit from avoiding an allocation. Also, the get_ prefix is non-idiomatic in Rust — the convention is just id().

🤖 Fix all issues with AI agents
In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 162-167: The hard-coded upper bound 32 used when validating
config.num_concurrent_listener_tasks should be extracted into a shared constant
to avoid divergence with the config definition; add a public constant (e.g.,
MAX_CONCURRENT_LISTENER_TASKS) next to SqsListenerConfig in the config module,
replace the literal 32 in ingestion_job_manager.rs with that constant, and
update any other places (and tests) that rely on the value so both the
validation here and the config definition reference the same symbol instead of a
magic number; ensure Error::InvalidNumConcurrentListenerTasks still reports the
provided value unchanged.

In `@components/log-ingestor/src/ingestion_job/sqs_listener.rs`:
- Around line 262-294: There is a duplicated nested match awaiting the same
JoinHandle (use-after-move) in the loop over self.task_handles; remove the outer
match and replace with a single match on task_handle.join_handle.await that
handles the three arms (Ok(Ok(())) success logging, Ok(Err(_)) warn, and
Err(err) panic warn). Locate the loop iterating over self.task_handles in
sqs_listener.rs and update the handling around task_handle.join_handle.await so
the JoinHandle is awaited exactly once and the three cases are handled directly.
- Around line 256-296: The shutdown_and_join method contains a duplicated nested
match over task_handle.join_handle.await causing a compile error; remove the
outer match that binds Ok(task_result) and keep the inner match that handles
Ok(Ok(())), Ok(Err(_)), and Err(err) for the JoinHandle<Result<()>> result.
Iterate over self.task_handles (cancel tokens loop stays) and then consume
task_handles in the second loop, matching directly on
task_handle.join_handle.await inside shutdown_and_join so task_id and self.id
logging remain unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/log-ingestor/src/ingestion_job/sqs_listener.rs (1)

47-69: 🧹 Nitpick | 🔵 Trivial

MAX_WAIT_TIME_SEC duplicated as defense-in-depth is acceptable but could reference the shared constant.

The local MAX_WAIT_TIME_SEC on line 49 duplicates the 20-second bound already enforced by ValidatedSqsListenerConfig. As defense-in-depth this is reasonable, but if you extract the validation bounds into constants (as suggested in the config file review), this could reference the same constant to stay in sync.

🤖 Fix all issues with AI agents
In `@components/log-ingestor/src/ingestion_job_manager.rs`:
- Around line 156-159: Update the doc comment for the function that currently
mentions Error::InvalidNumConcurrentListenerTasks to instead reference the
actual error variants returned by validation: Error::InvalidConfig(ConfigError)
and Error::InvalidSqsWaitTime; ensure the sentence that lists forwarded failures
from Self::create_s3_ingestion_job includes both of these validation errors and
clearly states they come from config/validation checks so readers can locate the
real variants (e.g., mention Error::InvalidConfig(ConfigError) and
Error::InvalidSqsWaitTime alongside the forwarded create_s3_ingestion_job
errors).

In `@components/log-ingestor/src/routes.rs`:
- Around line 211-212: The API description string for the SQS listener error
currently only references invalid concurrent listener tasks but omits the
wait_time_sec validation; update the description used in routes.rs (the
description string that references ConfigError) to either mention both
ConfigError::InvalidNumConcurrentListenerTasks and
ConfigError::InvalidSqsWaitTime explicitly or replace it with a more general
phrase such as "invalid configuration (e.g., invalid number of concurrent
listener tasks or invalid wait_time_sec)". Ensure the new text clearly signals
both possible validation failures so the documented error matches the
ConfigError variants.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/log-ingestor/src/ingestion_job_manager.rs (1)

146-186: ⚠️ Potential issue | 🟡 Minor

Doc comment for create_sqs_listener_job is incomplete — CustomEndpointUrlNotSupported is not listed.

The # Errors section (lines 154–157) documents forwarded errors from create_s3_ingestion_job and validate_and_create, but omits Error::CustomEndpointUrlNotSupported which is explicitly returned at line 165. While the omission may predate this PR, the doc was partially rewritten here (line 157), so it's a good time to complete it.

📝 Suggested doc fix
 /// # Errors
 ///
 /// Returns an error if:
 ///
+/// * [`Error::CustomEndpointUrlNotSupported`] if a custom endpoint URL is given.
 /// * Forwards [`Self::create_s3_ingestion_job`]'s return values on failure.
 /// * Forwards [`ValidatedSqsListenerConfig::validate_and_create`]'s return values on failure.

@LinZhihao-723 LinZhihao-723 merged commit 5f2e11b into y-scope:main Feb 16, 2026
22 checks passed
@junhaoliao junhaoliao added this to the February 2026 milestone Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants