feat(clp-package): Add support for uploading query result files to S3.#1722
feat(clp-package): Add support for uploading query result files to S3.#1722hoophalab merged 7 commits intoy-scope:mainfrom
Conversation
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (2)📚 Learning: 2024-12-16T21:49:06.483ZApplied to files:
📚 Learning: 2024-11-15T20:07:22.256ZApplied to files:
🧬 Code graph analysis (1)components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (1)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.14.7)components/job-orchestration/job_orchestration/executor/query/fs_search_task.py�[1;31mruff failed�[0m Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
generate_s3_virtual_hosted_style_url(243-253)get_credential_env_vars(62-100)s3_put(281-309)components/clp-py-utils/clp_py_utils/clp_config.py (3)
StorageType(135-137)get_directory(639-640)get_directory(650-651)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build (macos-15)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2)
15-19: LGTM!The addition of
s3_putto the import group is correct and necessary for the new S3 upload functionality.
257-257: Verify that s3_config is guaranteed non-null when storage type is S3.The code accesses
storage_config.s3_configwithout an explicit null check. While the condition at line 254 checkingStorageType.S3should ensure the configuration exists, confirm that the configuration schema guaranteess3_configis present when the storage type is S3.
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
Outdated
Show resolved
Hide resolved
LinZhihao-723
left a comment
There was a problem hiding this comment.
Before we merge, can you confirm the behavior on a failure upload is expected (like logs and status)?
| storage_config = worker_config.stream_output.storage | ||
| if StorageType.S3 == storage_config.type and search_config.write_to_file: | ||
| s3_config = storage_config.s3_config | ||
| dest_path = f"{job_id}/{archive_id}" |
There was a problem hiding this comment.
Shall we add a "root" to the destination path like how we store stream output directory?
There was a problem hiding this comment.
The files are uploaded to s3://test-bucket/streams/1/61f39b1c-bc17-409b-8abc-ee7ffb383077 if the following config is used. s3_put adds key_prefix: "streams/" before dest_path similar to extract_stream_task. Not sure whether this is what you are asking for?
stream_output:
storage:
type: "s3"
staging_directory: "var/data/staged-streams" # Or a path of your choosing
s3_config:
region_code: "us-east-1"
bucket: "test-bucket"
key_prefix: "streams/"
aws_authentication:
type: "credentials"
credentials:
access_key_id: "..."
secret_access_key: "..."
|
I try to upload to a bucket that doesn't exist. The following logs are produced, and |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-11-15T20:07:22.256Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.
Applied to files:
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
generate_s3_virtual_hosted_style_url(243-253)get_credential_env_vars(62-100)s3_put(281-309)components/job-orchestration/job_orchestration/scheduler/constants.py (1)
QueryTaskStatus(64-74)components/clp-py-utils/clp_py_utils/clp_config.py (3)
StorageType(135-137)get_directory(639-640)get_directory(650-651)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: package-image
- GitHub Check: build (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2)
15-19: LGTM!The import of
s3_putis correctly added alongside the existing S3 utility imports.
29-29: LGTM!The import of
QueryTaskStatusis correctly added for use in the upload failure handling.
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
Show resolved
Hide resolved
| storage_config = worker_config.stream_output.storage | ||
| if StorageType.S3 == storage_config.type and search_config.write_to_file: | ||
| s3_config = storage_config.s3_config | ||
| dest_path = f"{job_id}/{archive_id}" |
Description
Currently, the query workers can write query results to a local filesystem by setting
"write_to_file": truein the API Server, which supports higher max_num_results than mongo db.However, in a multi node setup without seaweed fs, the API server cannot read query result files on the local filesystem of the query worker.
This PR adds support for uploading those query result files to S3 in the query worker.
There will be an upcoming PR today which reads results from S3 in the API server.
Checklist
breaking change.
Validation performed
stream_outputstorage tos3"write_to_file": trues3://test-bucket/streams/{search_job_id}(tested on localstack)
Summary by CodeRabbit
New Features
Bug Fixes / Reliability
✏️ Tip: You can customize this high-level summary in your review settings.