feat(spider-execution-manager): Add single-process supervisor pool for the task executor.#326
feat(spider-execution-manager): Add single-process supervisor pool for the task executor.#326LinZhihao-723 wants to merge 10 commits into
Conversation
WalkthroughThis pull request implements a process pool manager for executing TDL tasks in isolated, long-lived subprocesses. A new ChangesExecution System Implementation
Integration Test Infrastructure and Test Cases
Build Configuration and Manifest Updates
🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/huntsman/tdl-integration/tests/complex.rs (1)
105-106: ⚡ Quick winConsider using the
load()return value directly in other tests.Throughout the file, tests follow the pattern of discarding the
load()result and immediately callingget()to retrieve the same package. Now thatload()returns a package reference, these tests could use it directly for consistency and to eliminate redundant lookups.♻️ Example refactor for one test
fn version_is_compatible() -> anyhow::Result<()> { let path = lib_path(); let mut manager = TdlPackageManager::new(); - manager.load(&path)?; - let pkg = manager.get(PACKAGE_NAME).expect("package should be loaded"); + let pkg = manager.load(&path)?; assert_eq!(pkg.version(), Version::SPIDER_TDL);This pattern can be applied to tests at lines 102–110, 114–126, 130–167, 171–196, 200–224, 228–258, 262–300, 304–328, 332–346, and 350–364.
Also applies to: 117-118, 133-134, 174-175, 203-204, 231-232, 265-266, 307-308, 335-336, 353-354
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/huntsman/tdl-integration/tests/complex.rs` around lines 105 - 106, Tests currently discard the return value of manager.load(&path)? and call manager.get(PACKAGE_NAME) immediately after; change each test to use the PackageRef returned by manager.load(...) directly (e.g., let pkg = manager.load(&path)? ) instead of calling manager.get(...), replacing subsequent uses of the retrieved package with that local variable; update all occurrences mentioned (the blocks around manager.load/get and the listed line ranges) so they use the load() return value for consistency and to avoid redundant lookups.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/spider-task-executor/src/bin/spider_task_executor.rs`:
- Around line 8-10: The package resolution docs are inconsistent with the
loading behavior: update the comment that describes the path for an `Execute`
request so it matches the actual filename used by the loader; replace the
`${package}/${package}.so` description with the implemented
`${SPIDER_TDL_PACKAGE_DIR}/${package}/lib{package}.so` (i.e., mention the lib
prefix) so the docstring and the runtime lookup (which loads `lib{package}.so`)
are aligned.
In `@tests/huntsman/task-executor/tests/overhead_instrument.rs`:
- Around line 177-180: The test fails when SPIDER_TEST_INSTRUMENT_OUTPUT_DIR
isn't pre-created because the code calls File::create on path immediately;
before creating the file, call std::fs::create_dir_all(&output_dir) (or
equivalent) to ensure the directory exists and handle errors similarly (e.g.,
unwrap_or_else with a clear panic message). Update the block around output_dir,
OUTPUT_FILE, path, and the File::create call so the directory is created first,
then proceed to File::create(&path) and file.write_all(...).
- Around line 83-85: The percentile index math for p50/p95/p99 is using 1-based
nearest-rank logic on a 0-based vector, biasing results; update the index
computation to use 0-based nearest-rank by computing the 1-based rank =
(percentile * count + 50) / 100, converting to a 0-based index by subtracting 1,
and then clamping with .min(last) and ensuring no underflow (e.g., use
saturating_sub(1) or handle rank==0). Apply this change to the p50/p95/p99
calculations that reference samples, count, and last so indices are computed as
((count * 50 + 50) / 100).saturating_sub(1), ((count * 95 + 50) /
100).saturating_sub(1), and ((count * 99 + 50) / 100).saturating_sub(1) (with
proper casting to usize) before indexing samples.
---
Nitpick comments:
In `@tests/huntsman/tdl-integration/tests/complex.rs`:
- Around line 105-106: Tests currently discard the return value of
manager.load(&path)? and call manager.get(PACKAGE_NAME) immediately after;
change each test to use the PackageRef returned by manager.load(...) directly
(e.g., let pkg = manager.load(&path)? ) instead of calling manager.get(...),
replacing subsequent uses of the retrieved package with that local variable;
update all occurrences mentioned (the blocks around manager.load/get and the
listed line ranges) so they use the load() return value for consistency and to
avoid redundant lookups.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 40925990-1b4d-4733-8af9-8ee05f16d492
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (19)
Cargo.tomlcomponents/spider-execution-manager/Cargo.tomlcomponents/spider-execution-manager/src/lib.rscomponents/spider-execution-manager/src/process_pool.rscomponents/spider-task-executor/Cargo.tomlcomponents/spider-task-executor/src/bin/spider_task_executor.rscomponents/spider-task-executor/src/error.rscomponents/spider-task-executor/src/lib.rscomponents/spider-task-executor/src/manager.rscomponents/spider-task-executor/src/protocol.rstaskfiles/test.yamltests/huntsman/integration-test-tasks/Cargo.tomltests/huntsman/integration-test-tasks/src/lib.rstests/huntsman/task-executor/Cargo.tomltests/huntsman/task-executor/src/lib.rstests/huntsman/task-executor/tests/overhead_instrument.rstests/huntsman/task-executor/tests/test_executor.rstests/huntsman/task-executor/tests/test_process_pool.rstests/huntsman/tdl-integration/tests/complex.rs
| //! Package resolution: each `Execute` request names a TDL package; the executor looks for | ||
| //! `${SPIDER_TDL_PACKAGE_DIR}/${package}/${package}.so` and caches the loaded library by name. | ||
| //! |
There was a problem hiding this comment.
Fix package resolution docs to match the implemented filename.
Line 9 says ${package}/${package}.so, but Line 76 resolves lib{package}.so. Keep docs and behaviour aligned.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/spider-task-executor/src/bin/spider_task_executor.rs` around lines
8 - 10, The package resolution docs are inconsistent with the loading behavior:
update the comment that describes the path for an `Execute` request so it
matches the actual filename used by the loader; replace the
`${package}/${package}.so` description with the implemented
`${SPIDER_TDL_PACKAGE_DIR}/${package}/lib{package}.so` (i.e., mention the lib
prefix) so the docstring and the runtime lookup (which loads `lib{package}.so`)
are aligned.
| let path = pkg_dir.join(package).join(format!("lib{package}.so")); | ||
| manager.load(&path)? |
There was a problem hiding this comment.
Validate package names before joining filesystem paths.
package is used directly in path construction. Absolute paths or traversal components can escape SPIDER_TDL_PACKAGE_DIR and load unintended shared libraries.
Suggested fix
use std::{
- path::{Path, PathBuf},
+ path::{Component, Path, PathBuf},
time::Instant,
};
@@
fn run_task(
@@
) -> Result<Vec<u8>, ExecutorError> {
+ let package_path = Path::new(package);
+ if package_path.is_absolute()
+ || package_path
+ .components()
+ .any(|c| !matches!(c, Component::Normal(_)))
+ {
+ return Err(ExecutorError::InvalidLibrary(format!(
+ "invalid package name: {package}"
+ )));
+ }
+
let pkg = if let Some(pkg) = manager.get(package) {
pkg
} else {
let path = pkg_dir.join(package).join(format!("lib{package}.so"));
manager.load(&path)?| let p50 = samples[(count / 2).min(last)].as_secs_f64() * 1_000_000.0; | ||
| let p95 = samples[(count * 95 / 100).min(last)].as_secs_f64() * 1_000_000.0; | ||
| let p99 = samples[(count * 99 / 100).min(last)].as_secs_f64() * 1_000_000.0; |
There was a problem hiding this comment.
Fix percentile index math for 0-based sampling.
Current p50/p95/p99 indexes are biased because they don’t use a 0-based nearest-rank conversion. This skews reported overhead stats.
Proposed patch
- let p50 = samples[(count / 2).min(last)].as_secs_f64() * 1_000_000.0;
- let p95 = samples[(count * 95 / 100).min(last)].as_secs_f64() * 1_000_000.0;
- let p99 = samples[(count * 99 / 100).min(last)].as_secs_f64() * 1_000_000.0;
+ let pct_idx = |pct: usize| (((count * pct) + 99) / 100).saturating_sub(1).min(last);
+ let p50 = samples[pct_idx(50)].as_secs_f64() * 1_000_000.0;
+ let p95 = samples[pct_idx(95)].as_secs_f64() * 1_000_000.0;
+ let p99 = samples[pct_idx(99)].as_secs_f64() * 1_000_000.0;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let p50 = samples[(count / 2).min(last)].as_secs_f64() * 1_000_000.0; | |
| let p95 = samples[(count * 95 / 100).min(last)].as_secs_f64() * 1_000_000.0; | |
| let p99 = samples[(count * 99 / 100).min(last)].as_secs_f64() * 1_000_000.0; | |
| let pct_idx = |pct: usize| (((count * pct) + 99) / 100).saturating_sub(1).min(last); | |
| let p50 = samples[pct_idx(50)].as_secs_f64() * 1_000_000.0; | |
| let p95 = samples[pct_idx(95)].as_secs_f64() * 1_000_000.0; | |
| let p99 = samples[pct_idx(99)].as_secs_f64() * 1_000_000.0; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/huntsman/task-executor/tests/overhead_instrument.rs` around lines 83 -
85, The percentile index math for p50/p95/p99 is using 1-based nearest-rank
logic on a 0-based vector, biasing results; update the index computation to use
0-based nearest-rank by computing the 1-based rank = (percentile * count + 50) /
100, converting to a 0-based index by subtracting 1, and then clamping with
.min(last) and ensuring no underflow (e.g., use saturating_sub(1) or handle
rank==0). Apply this change to the p50/p95/p99 calculations that reference
samples, count, and last so indices are computed as ((count * 50 + 50) /
100).saturating_sub(1), ((count * 95 + 50) / 100).saturating_sub(1), and ((count
* 99 + 50) / 100).saturating_sub(1) (with proper casting to usize) before
indexing samples.
| let path = output_dir.join(OUTPUT_FILE); | ||
| let mut file = | ||
| File::create(&path).unwrap_or_else(|err| panic!("create {} failed: {err}", path.display())); | ||
| file.write_all(preamble.as_bytes()).expect("write preamble"); |
There was a problem hiding this comment.
Create the output directory before writing the markdown file.
If SPIDER_TEST_INSTRUMENT_OUTPUT_DIR is set but not pre-created, file creation fails and the benchmark test aborts unnecessarily.
Proposed patch
let path = output_dir.join(OUTPUT_FILE);
+ std::fs::create_dir_all(&output_dir)
+ .unwrap_or_else(|err| panic!("create {} failed: {err}", output_dir.display()));
let mut file =
File::create(&path).unwrap_or_else(|err| panic!("create {} failed: {err}", path.display()));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let path = output_dir.join(OUTPUT_FILE); | |
| let mut file = | |
| File::create(&path).unwrap_or_else(|err| panic!("create {} failed: {err}", path.display())); | |
| file.write_all(preamble.as_bytes()).expect("write preamble"); | |
| let path = output_dir.join(OUTPUT_FILE); | |
| std::fs::create_dir_all(&output_dir) | |
| .unwrap_or_else(|err| panic!("create {} failed: {err}", output_dir.display())); | |
| let mut file = | |
| File::create(&path).unwrap_or_else(|err| panic!("create {} failed: {err}", path.display())); | |
| file.write_all(preamble.as_bytes()).expect("write preamble"); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/huntsman/task-executor/tests/overhead_instrument.rs` around lines 177 -
180, The test fails when SPIDER_TEST_INSTRUMENT_OUTPUT_DIR isn't pre-created
because the code calls File::create on path immediately; before creating the
file, call std::fs::create_dir_all(&output_dir) (or equivalent) to ensure the
directory exists and handle errors similarly (e.g., unwrap_or_else with a clear
panic message). Update the block around output_dir, OUTPUT_FILE, path, and the
File::create call so the directory is created first, then proceed to
File::create(&path) and file.write_all(...).
Description
This PR depends on #325.
New crate
This PR introduces a new crate
spider-execution-manager. Its first module,process_pool, is a single-process supervisor that owns onespider-task-executorsubprocess and serializes access to it. The public surface is one factory (ProcessPool::new) and one async API (ProcessPool::execute(request, hard_timeout) -> Result<Outcome, InternalError>). Multi-slot pools are deliberately deferred — the EM main loop only needs one process today, and the smaller surface keeps the consumer's integration minimal.Lifecycle and concurrency
The pool spawns the initial executor at construction. Concurrent
executecallers queue on a singletokio::Mutexaround the handle slot; the guard is held for the entire call, so the mutex is the entire concurrency gate (no separate semaphore). The parent is the sole authority for the hard timeout — if either the parent's timer fires or the child exits / closes stdout, the dead handle is dropped (Command::kill_on_drop(true)SIGKILLs the OS process), a fresh executor is spawned in the same critical section, and the next caller sees a healthy process. The four channel-level failure modes inside the protocol exchange (framed write failure, framed read failure, stdout EOF, response decode failure) all collapse intoOutcome::ExecutorCrash, which feeds the same respawn branch asOutcome::Timeout.Logging and observability
Each spawn allocates a monotonic
executor_idand routes the child's stderr to<log_dir>/<em_id>-<executor_id>.login create-or-append mode (rather than inheriting the parent's stderr). Per-spawn filenames mean each respawn naturally rotates onto a fresh file; a size/time rotator at the executor side is the documented next step. All pool-emitted tracing events carryexecutor_idas a typedu64field so JSON subscribers render it as a number; ID newtypes that lackDisplayuse debug formatting.Error model
InternalErroris the pool's own failure surface, distinct from a task-levelOutcome. Four variants cover the cases where the pool can't serve the current call:NotRunning(a prior respawn failed and the pool is unrecoverable),ExecutorCreationFailure(any I/O step during spawn),EncodeTaskContext, andEncodeTaskInputs. All non-NotRunningvariants usethiserror's#[from]so the body propagates with?throughout. There are no panics in either the happy path or the error path ofexecute.Cancellation safety (known limitation)
executeholds the mutex guard across.awaitpoints. If the caller's future is dropped mid-flight (e.g. a higher-layer timeout,tokio::select!losing the branch, ctrl-C), the guard releases but the executor keeps running and will eventually push a response into the stdout pipe — which the nextexecutecall would read as the first frame of its own request. The protocol no longer carriesinstance_id, so there's no automatic mismatch detection. The EM main loop will own these futures to completion, so this isn't reachable in practice today; documented as a follow-up to harden via a Drop guard that flags the handle dirty on cancellation.Test coverage
End-to-end tests added at
tests/huntsman/task-executor/tests/test_process_pool.rs, against the real task-executor binary using the existing harness:fibonacci_succeeds— happy path.always_fail_reports_task_error— in-task failure surfaces a decodableExecutorError.always_panic_returns_crash_then_respawns— panicking task returnsOutcome::ExecutorCrash; a follow-upexecuteon the same pool succeeds, proving the respawn.hard_timeout_kills_then_respawns— long-running task plus a short hard timeout returnsOutcome::Timeout; a follow-upexecutesucceeds, proving the SIGKILL + respawn.The respawn assertion is functional — the follow-up call returning
Outcome::Successis the visible evidence that a fresh process is alive and serving. No internal pool state is inspected.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Release Notes
New Features
Tests