Skip to content

feat(clp-package): Add support for ingesting from custom S3 endpoints using log-ingestor; Add support for streaming search results from custom S3 endpoints using the API server.#1776

Merged
LinZhihao-723 merged 15 commits intoy-scope:mainfrom
sudheergajula:log-ingestion
Dec 20, 2025

Conversation

@sudheergajula
Copy link
Contributor

@sudheergajula sudheergajula commented Dec 15, 2025

Summary:
This PR includes fixes for following:

  1. ingestion.rs to have place holder for endpoint.
  2. client.rs to create s3 client based on endpoint and normalise endpoint-url.

Test Plan - Tested log ingestion with custom s3 endpoints.
Payload :
curl -X POST http://host:3002/s3_scanner \ -H "Content-Type: application/json" \ -d '{ "region": "us-east-1", "bucket_name": "logs-pub", "key_prefix": "logs/", "dataset": "default", "timestamp_key": "ts", "unstructured": false, "scanning_interval_sec": 1800, "start_after": "2025-01-01T00:00:00Z", "endpoint_url": "http://minio.com:9000" }'

Description

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

  • New Features

    • Ingestion and client configs now accept an optional region and an optional custom endpoint URL; a global default AWS region is provided.
  • Bug Fixes / Behavior

    • Custom endpoints are threaded through ingestion, scanning and compression; conflict detection now considers endpoint. SQS listener rejects unsupported custom endpoints; S3 operations validate region/endpoint presence and surface clear errors.
  • Documentation

    • OpenAPI updated to document new BAD_REQUEST cases.
  • Tests

    • Tests updated for optional region and endpoint handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@sudheergajula sudheergajula requested a review from a team as a code owner December 15, 2025 08:29
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 15, 2025

Walkthrough

S3 region changed from a required String to an optional NonEmptyString and a new optional endpoint_url (Option) was added; these types and the new AWS_DEFAULT_REGION constant are propagated through clients, ingestion job management, API usage, tests, and manifests.

Changes

Cohort / File(s) Summary
Config: ingestion/base
components/clp-rust-utils/src/job_config/ingestion.rs
BaseConfig.region: Stringregion: Option<NonEmptyString>; added endpoint_url: Option<NonEmptyString> with serde(default).
Core config types & tests
components/clp-rust-utils/src/clp_config/s3_config.rs, components/clp-rust-utils/tests/clp_config_test.rs
S3Config.region_code: StringOption<NonEmptyString>; added endpoint_url: Option<NonEmptyString>; tests updated for nullable serialization.
AWS defaults & lib exports
components/clp-rust-utils/src/aws.rs, components/clp-rust-utils/src/lib.rs
Added pub const AWS_DEFAULT_REGION: &str = "us-east-1" and pub mod aws;.
S3 client changes
components/clp-rust-utils/src/s3/client.rs
create_new_client now accepts endpoint: Option<&NonEmptyString>; region set inline with .region(Some(Region::new(region_id.to_string()))); endpoint mapped via ToString::to_string.
SQS client changes
components/clp-rust-utils/src/sqs/client.rs
create_new_client signature updated to accept endpoint: Option<&NonEmptyString>; region set inline; endpoint handled via ToString::to_string.
AWS client manager (log-ingestor)
components/log-ingestor/src/aws_client_manager.rs
SqsClientWrapper::create and S3ClientWrapper::create now accept region: Option<&NonEmptyString>; S3ClientWrapper::create also accepts endpoint_url: Option<&NonEmptyString> and forwards to clients (defaults to AWS_DEFAULT_REGION when None).
Ingestion job manager
components/log-ingestor/src/ingestion_job_manager.rs
Added error variants CustomEndpointUrlNotSupported(String) and MissingRegionCode; require either endpoint_url or region for S3 jobs; reject SQS jobs with custom endpoint; include endpoint_url in conflict checks and persisted entries.
Compression job wiring
components/log-ingestor/src/compression/compression_job_submitter.rs
Propagates ingestion_job_config.endpoint_url into constructed S3 input config.
API server client usage
components/api-server/src/client.rs
fetch_results_from_s3 now returns Result<Stream>; validates region/endpoint presence and uses s3_config.region_code.as_ref().map_or(AWS_DEFAULT_REGION, NonEmptyString::as_str) and s3_config.endpoint_url.as_ref() when creating S3 client.
Tests / test helpers
components/log-ingestor/tests/*, components/log-ingestor/tests/aws_config.rs, components/log-ingestor/tests/test_ingestion_job.rs
Tests updated to construct NonEmptyString for region/endpoint, pass Some(&...) / .as_ref(), and populate BaseConfig.endpoint_url; aws test helper changed to use NonEmptyString.
Cargo manifest
components/api-server/Cargo.toml
Added dependency non-empty-string = { version = "0.2.6", features = ["serde"] }.
Routes / error mapping
components/log-ingestor/src/routes.rs
Map CustomEndpointUrlNotSupported and MissingRegionCode to HTTP 400 and added BAD_REQUEST documentation for related endpoints.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Verify all call sites consistently use Option<&NonEmptyString> and that borrows compile without unnecessary clones.
  • Confirm serde default behaviour and tests for nullable endpoint_url/region produce expected payloads.
  • Ensure AWS_DEFAULT_REGION is applied wherever region is None (client builders, API usage).
  • Review ingestion conflict detection and persistence to ensure endpoint_url comparisons handle None vs Some correctly.
  • Check new error variants and routes mapping for consistent messages and OpenAPI documentation.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: adding support for custom S3 endpoints in both log-ingestor and API server components, which is the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/log-ingestor/src/aws_client_manager.rs (1)

51-56: Consider adding endpoint support to SqsClientWrapper::create for consistency.

S3ClientWrapper::create now accepts an endpoint_url parameter, but SqsClientWrapper::create still hardcodes None. This asymmetry means SQS listener jobs created via create_sqs_listener_job cannot use custom endpoints (e.g., LocalStack), even though the underlying sqs::create_new_client supports it. The test on Line 165 works around this by calling the low-level function directly.

-    pub async fn create(region: &str, access_key_id: &str, secret_access_key: &str) -> Self {
+    pub async fn create(region: &str, access_key_id: &str, secret_access_key: &str, endpoint_url: Option<&str>) -> Self {
         let sqs_client =
-            clp_rust_utils::sqs::create_new_client(region, access_key_id, secret_access_key, None)
+            clp_rust_utils::sqs::create_new_client(region, access_key_id, secret_access_key, endpoint_url)
                 .await;
         Self::from(sqs_client)
     }

This would also require updating create_sqs_listener_job in ingestion_job_manager.rs to pass config.base.endpoint_url.as_deref().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67e6831 and fab04f0.

📒 Files selected for processing (5)
  • components/clp-rust-utils/src/job_config/ingestion.rs (1 hunks)
  • components/clp-rust-utils/src/s3/client.rs (1 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (1 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (1 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/log-ingestor/src/ingestion_job_manager.rs
  • components/log-ingestor/src/aws_client_manager.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
🧬 Code graph analysis (1)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (18-43)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (14-34)
🔇 Additional comments (6)
components/clp-rust-utils/src/s3/client.rs (1)

37-40: LGTM!

The conditional handling of the optional endpoint is clean and correctly integrates with the existing builder pattern.

components/log-ingestor/tests/test_ingestion_job.rs (2)

183-184: LGTM!

The endpoint_url field is correctly populated for testing custom S3 endpoints.


259-260: LGTM!

The endpoint_url is correctly set for the S3 scanner test configuration.

components/log-ingestor/src/aws_client_manager.rs (1)

77-82: LGTM on S3 endpoint support.

The endpoint_url parameter is correctly added and forwarded to the underlying client creation.

components/log-ingestor/src/ingestion_job_manager.rs (2)

106-119: LGTM on S3 scanner endpoint propagation.

The endpoint_url is correctly extracted from the configuration and passed to the S3 client wrapper.


132-150: SQS listener doesn't use custom endpoint.

If the SqsClientWrapper::create is updated to support endpoint_url as suggested in aws_client_manager.rs, this function should also pass config.base.endpoint_url.as_deref() for consistency with S3 scanner jobs.

@sudheergajula sudheergajula force-pushed the log-ingestion branch 4 times, most recently from 4b372db to a5a03f9 Compare December 15, 2025 08:56
@hoophalab hoophalab self-requested a review December 15, 2025 18:24
@LinZhihao-723 LinZhihao-723 self-requested a review December 15, 2025 19:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/log-ingestor/src/ingestion_job_manager.rs (1)

107-119: Pass endpoint_url to SqsClientWrapper::create for consistency with S3 and support for SQS-compatible endpoints.

The S3 scanner job passes endpoint_url to the client wrapper (line 111), but create_sqs_listener_job (lines 134–137) does not pass config.base.endpoint_url to SqsClientWrapper::create. Since SqsListenerConfig inherits endpoint_url via BaseConfig and the underlying clp_rust_utils::sqs::create_new_client accepts an optional endpoint parameter, SqsClientWrapper::create should also accept and pass through the endpoint URL. This ensures consistency with S3-compatible stores that provide both S3 and SQS APIs (e.g., LocalStack).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eff5b96 and 80073c4.

📒 Files selected for processing (5)
  • components/clp-rust-utils/src/job_config/ingestion.rs (1 hunks)
  • components/clp-rust-utils/src/s3/client.rs (1 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (1 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (1 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
  • components/log-ingestor/src/aws_client_manager.rs
🧬 Code graph analysis (1)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (18-43)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (14-34)
🔇 Additional comments (5)
components/log-ingestor/tests/test_ingestion_job.rs (2)

176-186: LGTM!

The endpoint configuration is correctly populated in the BaseConfig for the SQS listener test. The .clone() is necessary since aws_config.endpoint is used again later in the test.


251-265: LGTM!

The endpoint configuration is correctly populated in the BaseConfig for the S3 scanner test. The move (without .clone()) is appropriate here since this is the final use of aws_config.endpoint.

components/clp-rust-utils/src/s3/client.rs (1)

37-40: LGTM!

The conditional endpoint handling is correctly implemented. The use of force_path_style(true) (Line 36) is appropriate for S3-compatible endpoints like MinIO and LocalStack.

components/log-ingestor/src/aws_client_manager.rs (1)

77-82: LGTM!

The S3ClientWrapper::create method correctly accepts and forwards the optional endpoint_url parameter to the underlying client creation function. This public API change is appropriately propagated throughout the codebase.

components/clp-rust-utils/src/job_config/ingestion.rs (1)

17-18: The endpoint_url field correctly supports custom S3-compatible stores. However, the region field remains required (String, not Option) even when using custom endpoints. In the AWS SDK for Rust, when a custom endpoint is used, the Region is only used for signing and is not used to route the request. This is the expected behaviour: region is needed for request signing, while the optional endpoint_url parameter determines where the request is actually routed. The current implementation correctly reflects this AWS SDK pattern.

Comment on lines +45 to +50
fn normalize_endpoint(endpoint: &str) -> String {
match endpoint {
ep if ep.trim().starts_with("http://") || ep.trim().starts_with("https://") => ep.into(),
ep => format!("https://{ep}"),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix case-sensitivity and trimming inconsistencies in scheme detection.

Two issues in the endpoint normalization:

  1. Case-sensitivity: The scheme check won't match uppercase variants like "HTTP://" or "HTTPS://", causing them to receive an additional "https://" prefix.
  2. Trim inconsistency: The function uses trim() in the check but returns the original untrimmed ep. An endpoint like " http://example.com " would pass the check but be returned with leading/trailing whitespace intact.

Apply this diff to fix both issues:

 fn normalize_endpoint(endpoint: &str) -> String {
-    match endpoint {
-        ep if ep.trim().starts_with("http://") || ep.trim().starts_with("https://") => ep.into(),
-        ep => format!("https://{ep}"),
-    }
+    let trimmed = endpoint.trim();
+    let lower = trimmed.to_ascii_lowercase();
+    if lower.starts_with("http://") || lower.starts_with("https://") {
+        trimmed.to_string()
+    } else {
+        format!("https://{trimmed}")
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn normalize_endpoint(endpoint: &str) -> String {
match endpoint {
ep if ep.trim().starts_with("http://") || ep.trim().starts_with("https://") => ep.into(),
ep => format!("https://{ep}"),
}
}
fn normalize_endpoint(endpoint: &str) -> String {
let trimmed = endpoint.trim();
let lower = trimmed.to_ascii_lowercase();
if lower.starts_with("http://") || lower.starts_with("https://") {
trimmed.to_string()
} else {
format!("https://{trimmed}")
}
}
🤖 Prompt for AI Agents
In components/clp-rust-utils/src/s3/client.rs around lines 45 to 50, the
endpoint normalization is case-sensitive and returns the untrimmed input; fix by
first trimming the input into a local variable, perform the scheme check against
the lowercased trimmed value (e.g., starts_with "http://" or "https://"), and
return the trimmed endpoint (either as-is when it already contains a scheme or
prefixed with "https://" when it does not) so uppercase schemes and surrounding
whitespace are handled correctly.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sudheergajula, thanks for your contribution!
As #1767 has been merged, I realize we need to modify/update a few more places in the code base to make log-ingestor fully support custom S3 endpoints. Would you mind if I directly push the rest of modifications into this PR so we can make it in the end-of-year release?

Comment on lines +45 to +50
fn normalize_endpoint(endpoint: &str) -> String {
match endpoint {
ep if ep.trim().starts_with("http://") || ep.trim().starts_with("https://") => ep.into(),
ep => format!("https://{ep}"),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method might not be necessary. A URL without http|https isn't considered an endpoint URL by AWS-CLI. I think we should leave this to users to provide a valid scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this was added to handle the http:// prefix that generate_s3_virtual_hosted_style_url, i think now it is addressed by generate_s3_url

Sudheer Gajula added 2 commits December 17, 2025 11:07
…ages.

Summary:
This PR includes fixes for following:
  1. ingestion.rs to have place holder for endpoint.
  2. client.rs to create s3 client based on endpoint and normalise endpoint-url.

Test Plan - Tested log ingestion with custom s3 endpoints
Copy link
Contributor Author

@sudheergajula sudheergajula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed review comments

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/log-ingestor/src/ingestion_job_manager.rs (1)

132-139: Critical: SQS client creation is missing endpoint_url parameter.

The create_sqs_listener_job method creates an SQS client without passing config.base.endpoint_url, even though:

  1. The config contains endpoint_url in config.base
  2. The underlying clp_rust_utils::sqs::create_new_client supports an endpoint parameter
  3. The S3 client creation (line 111) correctly passes the endpoint_url

This inconsistency will prevent SQS listeners from working with custom S3-compatible endpoints like MinIO or LocalStack.

Apply this diff to pass the endpoint_url to the SQS client:

     let sqs_client_manager = SqsClientWrapper::create(
         config.base.region.as_str(),
         self.inner.aws_credentials.access_key_id.as_str(),
         self.inner.aws_credentials.secret_access_key.as_str(),
+        config.base.endpoint_url.as_ref(),
     )
     .await;

Note: This also requires updating SqsClientWrapper::create in aws_client_manager.rs to accept the endpoint_url parameter.

components/log-ingestor/src/aws_client_manager.rs (1)

52-57: Critical: SqsClientWrapper::create is missing endpoint_url parameter.

The SqsClientWrapper::create method doesn't accept an endpoint_url parameter, even though:

  1. The underlying clp_rust_utils::sqs::create_new_client supports it (line 54 shows it's called with None)
  2. The parallel S3ClientWrapper::create correctly accepts and forwards endpoint_url (line 78)

This prevents SQS clients from connecting to custom endpoints like MinIO or LocalStack.

Apply this diff to add endpoint_url support:

-    pub async fn create(region: &str, access_key_id: &str, secret_access_key: &str) -> Self {
+    pub async fn create(
+        region: &str,
+        access_key_id: &str,
+        secret_access_key: &str,
+        endpoint_url: Option<&NonEmptyString>,
+    ) -> Self {
         let sqs_client =
-            clp_rust_utils::sqs::create_new_client(region, access_key_id, secret_access_key, None)
+            clp_rust_utils::sqs::create_new_client(region, access_key_id, secret_access_key, endpoint_url)
                 .await;
         Self::from(sqs_client)
     }
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80073c4 and 674db3b.

📒 Files selected for processing (7)
  • components/clp-rust-utils/src/job_config/ingestion.rs (1 hunks)
  • components/clp-rust-utils/src/s3/client.rs (3 hunks)
  • components/clp-rust-utils/src/sqs/client.rs (3 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (2 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (1 hunks)
  • components/log-ingestor/tests/aws_config.rs (2 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (6 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/log-ingestor/src/aws_client_manager.rs
  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/clp-rust-utils/src/sqs/client.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
📚 Learning: 2025-09-17T22:51:15.765Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1304
File: components/webui/client/src/sql-parser/index.ts:98-115
Timestamp: 2025-09-17T22:51:15.765Z
Learning: In the CLP webui codebase, when designing APIs that accept optional string parameters, developers should omit the value (pass undefined) rather than pass empty strings. The API contract should be clear that empty strings are not valid inputs - only undefined should be used to indicate omission of optional clauses.

Applied to files:

  • components/clp-rust-utils/src/job_config/ingestion.rs
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
🧬 Code graph analysis (2)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (19-43)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (15-37)
components/log-ingestor/tests/test_ingestion_job.rs (1)
components/clp-rust-utils/src/types.rs (1)
  • from_string (26-28)
🔇 Additional comments (5)
components/clp-rust-utils/src/sqs/client.rs (1)

6-6: LGTM! Consistent endpoint type handling.

The changes properly update the endpoint parameter to use NonEmptyString and apply conditional endpoint configuration, matching the pattern used in the S3 client.

Also applies to: 19-19, 33-35

components/clp-rust-utils/src/s3/client.rs (1)

6-6: LGTM! Consistent endpoint type handling.

The endpoint parameter type change to NonEmptyString and conditional configuration logic mirror the SQS client implementation, providing consistent handling across AWS clients.

Also applies to: 23-23, 38-40

components/log-ingestor/tests/aws_config.rs (1)

13-13: LGTM! Proper validation for test configuration.

The endpoint field is correctly changed to Option<NonEmptyString> with appropriate error handling for empty values.

Also applies to: 67-68

components/clp-rust-utils/src/job_config/ingestion.rs (1)

20-24: LGTM! Well-documented and properly configured field.

The endpoint_url field is correctly added with:

  • Clear documentation explaining its purpose for custom S3-compatible stores
  • Appropriate serde(default) for optional deserialization
  • Consistent schema annotation matching other optional string fields
components/log-ingestor/tests/test_ingestion_job.rs (1)

165-166: LGTM! Consistent endpoint handling in tests.

The test code correctly uses .as_ref() for endpoint parameters and properly populates endpoint_url in the configuration structures, aligning with the new endpoint handling throughout the codebase.

Also applies to: 183-183, 201-202, 247-248, 259-259

@sudheergajula
Copy link
Contributor Author

Hi @sudheergajula, thanks for your contribution! As #1767 has been merged, I realize we need to modify/update a few more places in the code base to make log-ingestor fully support custom S3 endpoints. Would you mind if I directly push the rest of modifications into this PR so we can make it in the end-of-year release?

I don't have any objections

@LinZhihao-723 LinZhihao-723 changed the title refactor(clp-package): Fix to support endpoint-url for custom s3 stor… feat(log-ingestor): Add support for ingesting from custom S3 endpoints. Dec 19, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 674db3b and f82e682.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • docs/src/_static/generated/log-ingestor-openapi.json is excluded by !**/generated/**
📒 Files selected for processing (15)
  • components/api-server/Cargo.toml (1 hunks)
  • components/api-server/src/client.rs (2 hunks)
  • components/clp-rust-utils/src/aws.rs (1 hunks)
  • components/clp-rust-utils/src/clp_config/s3_config.rs (1 hunks)
  • components/clp-rust-utils/src/job_config/ingestion.rs (1 hunks)
  • components/clp-rust-utils/src/lib.rs (1 hunks)
  • components/clp-rust-utils/src/s3/client.rs (3 hunks)
  • components/clp-rust-utils/src/sqs/client.rs (3 hunks)
  • components/clp-rust-utils/tests/clp_config_test.rs (3 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (3 hunks)
  • components/log-ingestor/src/compression/compression_job_submitter.rs (1 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (6 hunks)
  • components/log-ingestor/src/routes.rs (2 hunks)
  • components/log-ingestor/tests/aws_config.rs (2 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/api-server/src/client.rs
  • components/clp-rust-utils/src/aws.rs
  • components/clp-rust-utils/src/clp_config/s3_config.rs
  • components/clp-rust-utils/tests/clp_config_test.rs
  • components/clp-rust-utils/src/s3/client.rs
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/api-server/src/client.rs
  • components/clp-rust-utils/src/clp_config/s3_config.rs
  • components/clp-rust-utils/src/sqs/client.rs
  • components/log-ingestor/src/aws_client_manager.rs
  • components/log-ingestor/tests/aws_config.rs
  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
  • components/clp-rust-utils/tests/clp_config_test.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/src/ingestion_job_manager.rs
📚 Learning: 2025-09-17T22:51:15.765Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1304
File: components/webui/client/src/sql-parser/index.ts:98-115
Timestamp: 2025-09-17T22:51:15.765Z
Learning: In the CLP webui codebase, when designing APIs that accept optional string parameters, developers should omit the value (pass undefined) rather than pass empty strings. The API contract should be clear that empty strings are not valid inputs - only undefined should be used to indicate omission of optional clauses.

Applied to files:

  • components/clp-rust-utils/src/job_config/ingestion.rs
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/log-ingestor/tests/test_ingestion_job.rs
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
🧬 Code graph analysis (4)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (19-40)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (15-34)
components/log-ingestor/tests/test_ingestion_job.rs (1)
components/clp-rust-utils/src/types.rs (1)
  • from_string (26-28)
components/clp-rust-utils/tests/clp_config_test.rs (1)
components/clp-rust-utils/src/types.rs (1)
  • from_static_str (14-16)
components/log-ingestor/src/ingestion_job_manager.rs (1)
components/log-ingestor/src/aws_client_manager.rs (2)
  • create (53-66)
  • create (87-101)
🔇 Additional comments (19)
components/clp-rust-utils/src/aws.rs (1)

1-1: LGTM! Default region constant aligns with AWS conventions.

The choice of us-east-1 as the default region is appropriate and consistent with established practices for handling S3 configurations without explicit region specifications.

Based on learnings, this default ensures compatibility with legacy endpoints and S3 client creation paths that require a region string.

components/log-ingestor/src/compression/compression_job_submitter.rs (1)

69-69: LGTM! Endpoint URL properly propagated.

The endpoint_url is correctly forwarded from the ingestion job configuration to the S3 input configuration, enabling custom S3 endpoint support for compression jobs.

components/log-ingestor/src/routes.rs (2)

92-94: LGTM! Error handling correctly maps new variant.

The CustomEndpointUrlNotSupported error is appropriately mapped to HTTP 400 BAD_REQUEST, ensuring proper client feedback when custom endpoints are unsupported for specific job types.


200-205: LGTM! OpenAPI documentation updated.

The BAD_REQUEST response documentation clearly communicates the limitation regarding custom endpoint URLs for SQS listener jobs.

components/clp-rust-utils/src/clp_config/s3_config.rs (1)

8-10: LGTM! Optional fields support custom S3 endpoints.

The changes correctly make region_code optional and add endpoint_url, enabling support for custom S3-compatible services (MinIO, LocalStack, etc.) that don't require AWS region codes.

Based on learnings, this aligns with PR #1767's approach where region_code is optional because custom endpoints use path-style URLs.

components/api-server/src/client.rs (1)

335-338: LGTM! Default region handling is correct.

The use of map_or(AWS_DEFAULT_REGION, ...) properly ensures a valid region string is always provided to the S3 client, falling back to us-east-1 when region_code is absent.

Based on learnings, this approach correctly addresses the requirement for a default region when none is specified.

components/log-ingestor/tests/aws_config.rs (2)

13-16: LGTM! Test configuration uses validated types.

The update to NonEmptyString for endpoint and region fields ensures type-level validation and consistency with the production configuration structures.


67-71: LGTM! Validation ensures non-empty values.

The use of NonEmptyString::new(...).map_err(...) provides clear error messages when environment variables are set to empty strings, improving test configuration robustness.

components/clp-rust-utils/src/s3/client.rs (1)

6-6: LGTM!

The implementation correctly integrates the NonEmptyString wrapper for the endpoint parameter. The use of ToString::to_string for mapping is appropriate, and the direct Region construction is clean.

Also applies to: 19-40

components/clp-rust-utils/tests/clp_config_test.rs (1)

12-23: LGTM!

The test correctly reflects the updated S3Config structure with region_code as Option<NonEmptyString> and the new endpoint_url field. Serialization expectations are properly updated for both MessagePack and JSON formats.

Also applies to: 46-63

components/clp-rust-utils/src/job_config/ingestion.rs (1)

17-26: LGTM!

The addition of optional region and endpoint_url fields is well-implemented. The documentation clearly explains their purpose, and the serde(default) attributes ensure backward compatibility. This aligns with the learning that custom S3-compatible endpoints don't require AWS region codes.

Based on learnings, region is optional for custom S3 endpoints (MinIO, LocalStack).

components/clp-rust-utils/src/sqs/client.rs (1)

6-6: LGTM!

The implementation mirrors the S3 client changes and correctly integrates the NonEmptyString wrapper for the endpoint parameter. The pattern is consistent across both client types.

Also applies to: 15-34

components/log-ingestor/src/ingestion_job_manager.rs (3)

37-38: LGTM!

The CustomEndpointUrlNotSupported error variant is appropriately added to handle cases where custom endpoint URLs are not yet supported (e.g., SQS listener jobs).


109-122: LGTM!

The implementation correctly handles custom endpoint URLs:

  • S3 scanner jobs properly forward the endpoint_url to the client wrapper
  • SQS listener jobs explicitly reject custom endpoint URLs with a clear error message, which is appropriate since SQS listener support for custom endpoints is not yet implemented

Also applies to: 135-159


225-275: LGTM!

The conflict detection logic is correctly updated to include endpoint_url comparison. This ensures that ingestion jobs targeting different S3-compatible endpoints (even with the same bucket/prefix) are properly distinguished and don't conflict. The IngestionJobTableEntry structure is appropriately updated to store both optional region and endpoint_url.

Also applies to: 306-315

components/log-ingestor/src/aws_client_manager.rs (2)

53-66: LGTM!

The SqsClientWrapper::create method correctly accepts an optional region parameter and properly defaults to AWS_DEFAULT_REGION when None. The use of map_or with NonEmptyString::as_str is clean and idiomatic.


87-101: LGTM!

The S3ClientWrapper::create method correctly handles both optional region and endpoint_url parameters. The region defaulting logic is consistent with the SQS wrapper, and the endpoint_url is properly forwarded to the underlying client factory.

components/log-ingestor/tests/test_ingestion_job.rs (2)

161-186: LGTM!

The test correctly reflects the updated API:

  • Client creation calls properly pass region.as_str() and Some(&aws_config.endpoint)
  • BaseConfig is updated with optional region and endpoint_url fields
  • Line 172 no longer uses .unwrap() (addressing the past review concern), instead using aws_config.endpoint directly

Also applies to: 197-203


243-264: LGTM!

The test_s3_scanner test is consistently updated following the same pattern as test_sqs_listener. The S3 client creation and configuration properly use the optional region and endpoint_url fields.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sudheergajula, sorry for the late reply.
I've updated this PR to make sure it can fully support custom endpoints. The major changes are as follow:

  • As changed in #1767, the S3 region can be None now. I've updated the job config and the CLP config accordingly. However, AWS SDK requires the region code to always be set. I added a constant for the default region (us-east-1) instead. I choose to let the caller decide which default region to use since the desired default might not always be us-east-1 .
  • #1767 also updated the compression job config to take an optional endpoint URL. I've updated the compression job submitter to use the given endpoint URL for compression when specified.
  • We don't want ingestion jobs to monitor the same key prefix (we call prefix conflict) to avoid duplicate ingestion. As we support endpoint URL now, we need to make sure the detection scope is within the same endpoint, not just the bucket name + region + dataset. The current check is a bit verbose. I've created an issue (#1805) to keep track of for future refactoring.
  • For SQS listener, I disable custom endpoint for now for simplicity (as the SQS endpoint might be different from the S3 endpoint). We will add support for custom endpoint in future PRs.
  • I also updated the generated OpenAPI doc as we updated the ingestion job config and the API return code.

What I've tested:

  • Start a minio service, and set credentials to be the same as my AWS credentials.
    Create two ingestion jobs and generate logs to the destinated buckets on different endpoints in parallel using the helper script I created here:
    • The first job monitors minio (the given bucket + prefix) using S3 scanner
      -The second job monitors AWS S3 using SQS listener (I configured a SQS queue to listen to the bucket)
  • Ensure all objects can be captured by the log ingestor.
  • Ensure all objects will be eventually submitted to CLP for compression by the log ingestor.
  • Ensure all compression jobs complete without errors, and these logs are searchable through CLP webui.

@hoophalab Please help me review the changes, thank you!

Copy link
Contributor

@hoophalab hoophalab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I posted a diff on my comments to save your time.

  1. Yes, we can use endpoint_url in API server. I did the change in diff
  2. Can we make region Option<NonEmptyString> and set it as us-east-1 in rust-utils rather than the actual user? I also tried passing None directly to s3 client. Minio doesn't support resolving the region automatically.
diff --git a/components/api-server/src/client.rs b/components/api-server/src/client.rs
index c2fd6eec..79ad212c 100644
--- a/components/api-server/src/client.rs
+++ b/components/api-server/src/client.rs
@@ -2,7 +2,6 @@ use std::pin::Pin;
 
 use async_stream::stream;
 use clp_rust_utils::{
-    aws::AWS_DEFAULT_REGION,
     clp_config::{
         AwsAuthentication,
         package::{
@@ -332,11 +331,8 @@ impl Client {
         let s3_client = clp_rust_utils::s3::create_new_client(
             credentials.access_key_id.as_str(),
             credentials.secret_access_key.as_str(),
-            s3_config
-                .region_code
-                .as_ref()
-                .map_or(AWS_DEFAULT_REGION, non_empty_string::NonEmptyString::as_str),
-            None,
+            s3_config.region_code.as_ref(),
+            s3_config.endpoint_url.as_ref(),
         )
         .await;
 
diff --git a/components/clp-rust-utils/src/s3/client.rs b/components/clp-rust-utils/src/s3/client.rs
index 73dfd14e..1d2ce062 100644
--- a/components/clp-rust-utils/src/s3/client.rs
+++ b/components/clp-rust-utils/src/s3/client.rs
@@ -1,3 +1,4 @@
+use crate::aws::AWS_DEFAULT_REGION;
 use aws_config::BehaviorVersion;
 use aws_sdk_s3::{
     Client,
@@ -19,7 +20,7 @@ use non_empty_string::NonEmptyString;
 pub async fn create_new_client(
     access_key_id: &str,
     secret_access_key: &str,
-    region_id: &str,
+    region_id: Option<&NonEmptyString>,
     endpoint: Option<&NonEmptyString>,
 ) -> Client {
     let credential = Credentials::new(
@@ -31,9 +32,13 @@ pub async fn create_new_client(
     );
     let base_config = aws_config::defaults(BehaviorVersion::latest()).load().await;
     let mut config_builder = Builder::from(&base_config)
-        .region(Region::new(region_id.to_string()))
         .credentials_provider(credential)
         .force_path_style(true);
+    config_builder.set_region(Some(Region::new(if let Some(id) = region_id {
+        id.to_string()
+    } else {
+        AWS_DEFAULT_REGION.to_owned()
+    })));
     config_builder.set_endpoint_url(endpoint.map(std::string::ToString::to_string));
     let config = config_builder.build();
     Client::from_conf(config)
diff --git a/components/clp-rust-utils/src/sqs/client.rs b/components/clp-rust-utils/src/sqs/client.rs
index a6596390..5cb5d974 100644
--- a/components/clp-rust-utils/src/sqs/client.rs
+++ b/components/clp-rust-utils/src/sqs/client.rs
@@ -1,3 +1,4 @@
+use crate::aws::AWS_DEFAULT_REGION;
 use aws_config::BehaviorVersion;
 use aws_sdk_sqs::{
     Client,
@@ -15,7 +16,7 @@ use non_empty_string::NonEmptyString;
 pub async fn create_new_client(
     access_key_id: &str,
     secret_access_key: &str,
-    region_id: &str,
+    region_id: Option<&NonEmptyString>,
     endpoint: Option<&NonEmptyString>,
 ) -> Client {
     let credential = Credentials::new(
@@ -26,9 +27,12 @@ pub async fn create_new_client(
         "clp-credential-provider",
     );
     let base_config = aws_config::defaults(BehaviorVersion::latest()).load().await;
-    let mut config_builder = Builder::from(&base_config)
-        .credentials_provider(credential)
-        .region(Region::new(region_id.to_string()));
+    let mut config_builder = Builder::from(&base_config).credentials_provider(credential);
+    config_builder.set_region(Some(Region::new(if let Some(id) = region_id {
+        id.to_string()
+    } else {
+        AWS_DEFAULT_REGION.to_owned()
+    })));
     config_builder.set_endpoint_url(endpoint.map(std::string::ToString::to_string));
     Client::from_conf(config_builder.build())
 }
diff --git a/components/log-ingestor/src/aws_client_manager.rs b/components/log-ingestor/src/aws_client_manager.rs
index 8e59606f..ab2cdd47 100644
--- a/components/log-ingestor/src/aws_client_manager.rs
+++ b/components/log-ingestor/src/aws_client_manager.rs
@@ -2,7 +2,6 @@ use anyhow::Result;
 use async_trait::async_trait;
 use aws_sdk_s3::Client as S3Client;
 use aws_sdk_sqs::Client as SqsClient;
-use clp_rust_utils::aws::AWS_DEFAULT_REGION;
 use non_empty_string::NonEmptyString;
 
 /// A marker trait for AWS client types.
@@ -55,13 +54,9 @@ impl SqsClientWrapper {
         access_key_id: &str,
         secret_access_key: &str,
     ) -> Self {
-        let sqs_client = clp_rust_utils::sqs::create_new_client(
-            access_key_id,
-            secret_access_key,
-            region.map_or(AWS_DEFAULT_REGION, NonEmptyString::as_str),
-            None,
-        )
-        .await;
+        let sqs_client =
+            clp_rust_utils::sqs::create_new_client(access_key_id, secret_access_key, region, None)
+                .await;
         Self::from(sqs_client)
     }
 }
@@ -93,7 +88,7 @@ impl S3ClientWrapper {
         let s3_client = clp_rust_utils::s3::create_new_client(
             access_key_id,
             secret_access_key,
-            region.map_or(AWS_DEFAULT_REGION, NonEmptyString::as_str),
+            region,
             endpoint_url,
         )
         .await;
diff --git a/components/log-ingestor/tests/test_ingestion_job.rs b/components/log-ingestor/tests/test_ingestion_job.rs
index 44f8c6c5..c7d93e10 100644
--- a/components/log-ingestor/tests/test_ingestion_job.rs
+++ b/components/log-ingestor/tests/test_ingestion_job.rs
@@ -161,7 +161,7 @@ async fn test_sqs_listener() -> Result<()> {
     let sqs_client = clp_rust_utils::sqs::create_new_client(
         aws_config.access_key_id.as_str(),
         aws_config.secret_access_key.as_str(),
-        aws_config.region.as_str(),
+        Some(&aws_config.region),
         Some(&aws_config.endpoint),
     )
     .await;
@@ -197,7 +197,7 @@ async fn test_sqs_listener() -> Result<()> {
     let s3_client = clp_rust_utils::s3::create_new_client(
         aws_config.access_key_id.as_str(),
         aws_config.secret_access_key.as_str(),
-        aws_config.region.as_str(),
+        Some(&aws_config.region),
         Some(&aws_config.endpoint),
     )
     .await;
@@ -243,7 +243,7 @@ async fn test_s3_scanner() -> Result<()> {
     let s3_client = clp_rust_utils::s3::create_new_client(
         aws_config.access_key_id.as_str(),
         aws_config.secret_access_key.as_str(),
-        aws_config.region.as_str(),
+        Some(&aws_config.region),
         Some(&aws_config.endpoint),
     )
     .await;

@hoophalab
Copy link
Contributor

Validation:

  1. successfully ingested logs from minio in s3 scanner.
  2. api server can query results read from minio.

@LinZhihao-723 LinZhihao-723 changed the title feat(log-ingestor): Add support for ingesting from custom S3 endpoints. feat(clp-package): Add support for ingesting from custom S3 endpoints using log-ingestor; Add support for streaming search results from custom S3 endpoints using the API server. Dec 19, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 320fb79 and 2494554.

📒 Files selected for processing (5)
  • components/api-server/src/client.rs (1 hunks)
  • components/clp-rust-utils/src/s3/client.rs (3 hunks)
  • components/clp-rust-utils/src/sqs/client.rs (3 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (3 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/api-server/src/client.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/src/aws_client_manager.rs
  • components/clp-rust-utils/src/sqs/client.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/api-server/src/client.rs
  • components/clp-rust-utils/src/s3/client.rs
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/log-ingestor/tests/test_ingestion_job.rs
🧬 Code graph analysis (2)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (21-45)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (17-38)
components/log-ingestor/tests/test_ingestion_job.rs (1)
components/clp-rust-utils/src/types.rs (1)
  • from_string (26-28)
🔇 Additional comments (8)
components/api-server/src/client.rs (1)

334-335: LGTM! Endpoint URL now properly propagated.

The changes correctly pass both the region and endpoint URL as references to the S3 client creation function. This addresses the previous issue where the endpoint parameter was hardcoded to None, which prevented custom S3-compatible endpoints from working for query result storage.

components/log-ingestor/tests/test_ingestion_job.rs (3)

164-187: LGTM! Consistent parameter handling throughout the test.

The test correctly uses borrowed references (Some(&aws_config.region), Some(&aws_config.endpoint)) for client creation and clones values for the BaseConfig struct. The direct usage of aws_config.endpoint in the format! macro at line 172 is appropriate since format! handles ownership internally.


197-203: LGTM! Consistent client creation pattern.

The S3 client creation follows the same correct pattern as the SQS client, passing borrowed references for both region and endpoint parameters.


243-264: LGTM! Test configuration properly updated.

The test_s3_scanner function correctly mirrors the parameter handling pattern from test_sqs_listener, ensuring consistency across the test suite.

components/log-ingestor/src/aws_client_manager.rs (1)

82-96: LGTM! S3 client wrapper properly supports custom endpoints.

The S3ClientWrapper::create method correctly accepts and propagates both the optional region and endpoint URL parameters to the underlying S3 client creation function, enabling full support for custom S3-compatible endpoints like MinIO and LocalStack.

components/clp-rust-utils/src/s3/client.rs (2)

6-25: LGTM! Function signature properly updated for optional parameters.

The imports and function signature correctly support optional region and endpoint parameters using NonEmptyString references. This aligns with the requirement that custom S3-compatible endpoints (MinIO, LocalStack) don't require AWS region codes.

Based on learnings, the region field is optional because custom endpoints use path-style URLs and don't require AWS region codes.


38-42: LGTM! Region fallback and endpoint handling implemented correctly.

The region configuration properly falls back to AWS_DEFAULT_REGION when no region is provided, which is appropriate for custom S3-compatible endpoints. The endpoint URL is correctly mapped to a String when present and passed to the config builder.

components/clp-rust-utils/src/sqs/client.rs (1)

6-36: LGTM! SQS client implementation consistent with S3 client.

The SQS client creation function mirrors the S3 client implementation, correctly handling optional region and endpoint parameters with the same fallback logic. The configuration builder pattern properly sets credentials, region (with AWS_DEFAULT_REGION fallback), and endpoint URL.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2494554 and 860a3cf.

⛔ Files ignored due to path filters (1)
  • docs/src/_static/generated/log-ingestor-openapi.json is excluded by !**/generated/**
📒 Files selected for processing (6)
  • components/api-server/src/client.rs (7 hunks)
  • components/clp-rust-utils/src/s3/client.rs (3 hunks)
  • components/log-ingestor/src/aws_client_manager.rs (3 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (6 hunks)
  • components/log-ingestor/src/routes.rs (3 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/log-ingestor/tests/test_ingestion_job.rs
  • components/log-ingestor/src/aws_client_manager.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/src/routes.rs
  • components/log-ingestor/src/ingestion_job_manager.rs
  • components/api-server/src/client.rs
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/log-ingestor/tests/test_ingestion_job.rs
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/log-ingestor/src/aws_client_manager.rs
  • components/clp-rust-utils/src/s3/client.rs
  • components/log-ingestor/src/routes.rs
  • components/log-ingestor/src/ingestion_job_manager.rs
  • components/api-server/src/client.rs
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/clp-rust-utils/src/s3/client.rs
🧬 Code graph analysis (3)
components/log-ingestor/tests/test_ingestion_job.rs (1)
components/clp-rust-utils/src/types.rs (1)
  • from_string (26-28)
components/log-ingestor/src/aws_client_manager.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (19-40)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (17-38)
components/api-server/src/client.rs (2)
components/clp-rust-utils/src/s3/client.rs (1)
  • create_new_client (19-40)
components/clp-rust-utils/src/sqs/client.rs (1)
  • create_new_client (17-38)
🔇 Additional comments (14)
components/api-server/src/client.rs (7)

5-5: LGTM!

The import of AWS_DEFAULT_REGION is correctly placed and necessary for the default region fallback logic when connecting to custom S3 endpoints.


165-165: LGTM!

The documentation correctly reflects that errors from fetch_results_from_s3 are now forwarded due to the Result return type.


202-202: LGTM!

Error propagation is now correctly implemented with .await?, enabling proper error handling for custom S3 endpoints.


329-329: LGTM!

The return type change to Result<impl Stream, ClientError> correctly enables early error return for validation failures.


338-343: LGTM!

The validation logic correctly enforces that a region code must be provided when using the default AWS S3 endpoint, while allowing custom S3-compatible endpoints (MinIO, LocalStack) to omit the region. Based on learnings, this aligns with the requirement that only AWS S3 endpoints require region codes.


345-355: LGTM!

The S3 client creation correctly handles both AWS and custom S3 endpoints:

  • Region defaults to AWS_DEFAULT_REGION when not provided (for custom S3 endpoints)
  • endpoint_url is properly propagated to enable custom S3-compatible services
  • Implementation is consistent with the pattern used in sqs/client.rs

Based on learnings, this correctly addresses the previous review concern about missing endpoint propagation.


366-393: LGTM!

The stream is correctly wrapped in Ok() to match the Result return type, enabling proper error handling for the validation logic.

components/log-ingestor/src/routes.rs (1)

92-95: LGTM! Error handling properly aligned with new endpoint/region validation.

The error mappings for CustomEndpointUrlNotSupported and MissingRegionCode correctly return BAD_REQUEST, and the OpenAPI documentation accurately describes when these errors occur. This aligns well with the validation logic in ingestion_job_manager.rs.

Also applies to: 156-160, 207-211

components/log-ingestor/src/ingestion_job_manager.rs (3)

114-122: LGTM! Proper validation for S3 scanner job creation.

The logic correctly enforces that either endpoint_url or region must be provided for S3 scanner jobs. When using custom S3-compatible endpoints (like MinIO), the region may not be required, while AWS S3 endpoints require a region. This validation aligns with the learnings from PR #1767.

Based on learnings, custom S3-compatible endpoints use path-style URLs and don't require AWS region codes.


144-149: LGTM! Clear rejection of unsupported SQS custom endpoints.

The explicit error for custom endpoint URLs on SQS listener jobs is appropriate since this feature is not yet implemented. The error message is descriptive and includes the problematic endpoint URL, making it easy for users to understand the limitation.


233-251: LGTM! Conflict detection properly includes endpoint_url.

The conflict detection now correctly considers endpoint_url in addition to region, bucket_name, dataset, and key_prefix. This ensures that jobs accessing the same bucket through different endpoints (e.g., AWS S3 vs. MinIO) are not incorrectly flagged as conflicting.

components/log-ingestor/tests/test_ingestion_job.rs (1)

161-186: LGTM! Tests properly updated for optional region and endpoint_url.

The tests correctly:

  • Wrap region and endpoint in Some() when creating clients.
  • Construct BaseConfig with Some(aws_config.region.clone()) and Some(aws_config.endpoint.clone()).
  • Pass borrowed references (&aws_config.region, &aws_config.endpoint) to client creation functions.

All test updates align with the new type signatures and properly exercise the custom endpoint functionality.

Also applies to: 197-203, 243-264

components/log-ingestor/src/aws_client_manager.rs (2)

53-62: LGTM! SQS client correctly omits endpoint parameter.

The SqsClientWrapper::create function does not expose an endpoint parameter and passes None to the underlying create_new_client. This is consistent with the explicit rejection of custom endpoint URLs for SQS listener jobs in ingestion_job_manager.rs (lines 144-149), where the error states "SQS listener ingestion jobs do not support custom endpoint URLs yet."


83-97: LGTM! S3 client properly handles optional region and custom endpoints.

The S3ClientWrapper::create function correctly:

  • Accepts region: Option<&NonEmptyString> and defaults to AWS_DEFAULT_REGION when None.
  • Passes endpoint_url through to enable custom S3-compatible endpoints (MinIO, LocalStack, etc.).

The default region handling ensures that even when users provide a custom endpoint without a region, the S3 client still has a valid region value (which custom endpoints typically ignore).

Based on learnings, custom S3-compatible endpoints don't require specific AWS region codes.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The further exploration shows that:

  • The region code must be set in Rust S3 client
  • When using the default AWS S3 endpoint, the correct region code must be given explicitly.

The latest commits fix this problem by checking whether both the endpoint and the region are None. If so, it will invalidate the given config. This validation is on the caller's end, as the caller may also decide which default region code to use (for example, when using minio, the region code must NOT be provided and the caller needs to feed a fake region code for it).

hoophalab
hoophalab previously approved these changes Dec 20, 2025
Copy link
Contributor

@hoophalab hoophalab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nitpick

Validations:

  1. Successfully ingested logs from minio adn aws s3.
  2. If request isn't correct, an error message returned.
  3. API server can fetch query results from minio.

Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sudheergajula Thank you for your contribution! We will merge this PR and it will be a part of the coming release.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4b56fb and 55cba9d.

📒 Files selected for processing (1)
  • components/api-server/src/client.rs (7 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 743
File: components/log-viewer-webui/server/src/plugins/S3Manager.ts:76-76
Timestamp: 2025-04-07T09:30:20.684Z
Learning: In the S3Manager plugin, an `isEnabled` option should be added and checked instead of checking if the region is null. This would make the plugin's behavior more explicit.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 743
File: components/log-viewer-webui/server/src/plugins/S3Manager.ts:76-76
Timestamp: 2025-04-07T09:30:20.684Z
Learning: In the S3Manager plugin, an `isEnabled` option should be added and checked instead of checking if the region is null. This would make the plugin's behavior more explicit.
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/routes/query.js:9-10
Timestamp: 2025-01-17T22:33:23.547Z
Learning: When initializing S3Manager in log-viewer-webui, both StreamFilesS3Region and StreamFilesS3PathPrefix settings must be non-null for S3 functionality to work correctly.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/api-server/src/client.rs
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/api-server/src/client.rs
🔇 Additional comments (6)
components/api-server/src/client.rs (6)

5-5: LGTM!

The import of AWS_DEFAULT_REGION is appropriate and necessary for the default region fallback logic used when region_code is not provided.


165-165: LGTM!

The documentation update and error propagation are correct. The function now properly awaits fetch_results_from_s3 and uses the ? operator to propagate function-level errors, which aligns with the updated return type of fetch_results_from_s3.

Also applies to: 202-202


329-329: LGTM!

The return type change to Result<impl Stream<...>, ClientError> is appropriate. It enables the function to return validation errors before stream creation, which is necessary for the AWS region validation logic.


338-343: LGTM!

The validation logic correctly enforces that a region code must be provided when using the default AWS S3 endpoint (when endpoint_url is None). Custom S3-compatible endpoints (MinIO, LocalStack, etc.) can proceed without a region code, as they don't require AWS region codes.

Based on learnings, this aligns with the S3 endpoint requirements established in PR #1767.


345-355: LGTM! Past issue successfully addressed.

The S3 client is now correctly constructed with:

  • Region: Uses the configured region_code if provided, otherwise falls back to AWS_DEFAULT_REGION. This ensures custom S3-compatible endpoints work even when region is not required.
  • Endpoint: Properly propagates s3_config.endpoint_url via as_ref(), enabling custom S3 endpoints (MinIO, LocalStack, etc.) to function correctly.

This change addresses the critical issue flagged in past review comments where the endpoint parameter was passed as None, preventing custom S3 endpoints from working for query result storage.


366-366: LGTM!

The stream is correctly wrapped in Ok(...) to match the function's Result<impl Stream<...>, ClientError> return type. This allows the function to return validation errors before stream creation while still yielding a stream on success.

Also applies to: 393-393

@LinZhihao-723 LinZhihao-723 merged commit c91f51d into y-scope:main Dec 20, 2025
33 of 34 checks passed
davidlion pushed a commit to davidlion/clp that referenced this pull request Jan 17, 2026
… using `log-ingestor`; Add support for streaming search results from custom S3 endpoints using the API server. (y-scope#1776)

Co-authored-by: Sudheer Gajula <sudheer.gajula@indexexchange.com>
Co-authored-by: LinZhihao-723 <zh.lin@mail.utoronto.ca>
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants