Skip to content

feat(sandbox): hardened sandbox executor with proactive streaming and resource limits (Closes #326)#408

Open
ask-z4ch wants to merge 8 commits into
utksh1:mainfrom
ask-z4ch:fix/sandbox-executor
Open

feat(sandbox): hardened sandbox executor with proactive streaming and resource limits (Closes #326)#408
ask-z4ch wants to merge 8 commits into
utksh1:mainfrom
ask-z4ch:fix/sandbox-executor

Conversation

@ask-z4ch
Copy link
Copy Markdown
Contributor

@ask-z4ch ask-z4ch commented May 29, 2026

Description

The existing subprocess execution path had no sandbox resource limits — runaway plugins could exhaust memory, hang indefinitely, or produce unbounded output. This change hardens the executor with configurable timeout, memory, and output caps so that misbehaving plugins are terminated cleanly instead of destabilising the host.

Replaced the passive communicate()-based subprocess wrapper in sandbox_executor.py with a proactive 64 KB chunked stream reader that enforces a shared max_output_bytes cap across stdout and stderr. Added classify_memory_violation() covering SIGSEGV (-11/139), MemoryError / Cannot allocate memory strings, and an RSS ≥ 95% heuristic via resource.getrusage. Timeout is enforced via asyncio.wait_for with a SIGTERM → 3 s grace → SIGKILL escalation. The legacy _execute_command(command, task_id, timeout=600) signature is preserved exactly for backward compatibility.

Related Issues

Closes #326

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested

  1. pytest testing/backend/test_sandbox_executor.py -q — 16 / 16 pass
  2. pytest testing/backend/test_sandbox_blocking_issues.py -q — 10 / 10 pass
  3. pytest testing/backend -q -m "not benchmark" — 527 passed, 2 failed (pre-existing trio / aiosqlite teardown issues, unrelated to this change)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally

GSSoC 2026 — In compliance with the GSSoC 2026 AI Conduct rules, I disclose that AI tools were used solely as learning and boilerplate aids. The final logic was fully reviewed, tested, and manually adapted to match human styling and clean-code design.

@ask-z4ch ask-z4ch closed this May 29, 2026
@ask-z4ch ask-z4ch reopened this May 29, 2026
@utksh1 utksh1 added level:advanced 55 pts difficulty label for advanced contributor PRs type:feature Feature work category bonus label type:security Security work category bonus label area:backend Backend API, database, or service work area:security Security-sensitive implementation or tests type:testing Testing work category bonus label labels May 29, 2026
Copy link
Copy Markdown
Owner

@utksh1 utksh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right direction and I kept this PR open over the duplicate, but I found blockers before merge. The new sandbox path no longer broadcasts output chunks from TaskExecutor, so live task streaming regresses even though the PR advertises proactive streaming. It also captures stderr separately and then discards it in _execute_command, while the old implementation merged stderr into raw output; that can hide scanner diagnostics and parser-relevant output. Please preserve live output broadcasts, include stderr in the persisted/redacted raw output or explicitly handle it, and add regression coverage for both live output streaming and stderr capture.

Copy link
Copy Markdown
Owner

@utksh1 utksh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed latest state. This is now conflicting with current main after the stream/phase executor merges, and the earlier blockers still need to be resolved: preserve live output broadcasts and include stderr in persisted/redacted raw output instead of dropping it. Please rebase on current main, keep the new bounded listener/phase behavior intact, and add regression coverage for live output plus stderr capture.

ask-z4ch added 7 commits May 29, 2026 22:57
…ry classification, legacy compat

Rewrite sandbox_executor.py per security specification:
- Proactive 64KB chunked stream reader for stdout+stderr
- Output limit detection stops reading, terminates process
- Timeout enforcement via SIGTERM->3s->SIGKILL escalation
- Memory exhaustion heuristics: SIGSEGV, MemoryError, RSS>=95%
- Platform guard: RLIMIT_AS only on Linux, graceful no-op elsewhere
- Cancellation guard ensures subprocess is never orphaned

Core Executor Adapter:
- _execute_command preserved: (cmd, timeout=None) -> (output, exit_code, violation_reason)
- SandboxConfig created from global settings internally
- Returns violation_reason string directly for terminated:* mapping

Models:
- SandboxViolation changed to Exception subclass (was BaseModel)

Tests:
- 16 tests: legacy timeout compat, signal escalation,
  memory classification, output truncation, cancellation safety,
  platform guard, normal completion, config resolution

Closes utksh1#326
Signed-off-by: ask-z4ch <229467675+ask-z4ch@users.noreply.github.com>
…out=600) signature

Preserve original 3-parameter signature for backward compatibility.
task_id is accepted but only used for parameter compatibility;
SandboxConfig is created internally from the timeout parameter.

Signed-off-by: ask-z4ch <229467675+ask-z4ch@users.noreply.github.com>
…mpatibility

- Use upstream's sandbox_timeout / sandbox_memory_mb field names
- Add missing sandbox_max_output_bytes, sandbox_allow_network to config.py
- Update test mocks to return 3-tuple (output, exit_code, violation_reason)
- Resolve rebase conflicts with upstream risk-scoring code

Signed-off-by: ask-z4ch <ask-z4ch@users.noreply.github.com>
- Add 10 comprehensive tests covering timeout, memory classification,
  output boundary precision, cancellation, and legacy compat
- Fix flaky exit_code != 0 assertion: exit may be 0 when Python
  finishes before termination signal arrives; output cap is the
  correctness criterion

Signed-off-by: ask-z4ch <ask-z4ch@users.noreply.github.com>
…signature

Signed-off-by: ask-z4ch <ask-z4ch@users.noreply.github.com>
When timeout=None, fall back to settings.sandbox_timeout before
passing to SandboxConfig which rejects None.

Signed-off-by: ask-z4ch <ask-z4ch@users.noreply.github.com>
…ut/memory precision

- Add broadcast_callback to _read_stream and sandbox_execute so output
  chunks are forwarded to TaskExecutor._broadcast for live streaming
- Merge stderr into raw output in _execute_command before returning
- Apply timeout via external asyncio.wait_for in _execute_command
  (legacy-compatible pattern) instead of through SandboxConfig
- Add asyncio.Lock around shared byte accounting to prevent stdout/stderr
  reader races exceeding max_output_bytes
- Capture RSS baseline before subprocess and compute delta for memory
  classification (reliable per-process measurement)
- Add 10 comprehensive regression tests covering timeout (external +
  internal), memory (SIGSEGV, stderr strings, RSS heuristic, exit 137),
  output (lock race prevention, strict boundary), cancellation, and
  legacy timeout compatibility (default, explicit, None)

All 38 sandbox tests pass.

Signed-off-by: Zach <ask.zach@pm.me>
@ask-z4ch ask-z4ch force-pushed the fix/sandbox-executor branch from cfadc43 to 1996fc0 Compare May 29, 2026 17:30
@ask-z4ch
Copy link
Copy Markdown
Contributor Author

ask-z4ch commented May 29, 2026

All concerns from both review rounds are addressed:

Round 1 (live streaming + stderr):

  • broadcast_callback wired through _read_streamsandbox_execute_execute_command; every output chunk is forwarded to TaskExecutor._broadcast(task_id, "output", text) for live streaming
  • stderr merged into stdout before return: stdout + "\n" + stderr (handles empty-stdout edge case)
  • Regression tests: test_live_output_broadcasting, test_stderr_captured_in_output

Round 2 (precision + reliability):

  • Memory: RSS delta (rss_before captured before subprocess, rss_delta = after - before) replaces post-hoc cumulative RUSAGE_CHILDREN for per-process measurement
  • Timeout: external asyncio.wait_for in _execute_command restores the legacy timeout pattern; SandboxConfig uses timeout_seconds=0 when called from _execute_command, so sandbox handles only memory/output limits
  • Output limit: asyncio.Lock in shared state prevents concurrent stdout/stderr readers from exceeding max_output_bytes
  • 10 new tests added — 38 total across timeout (external + internal), memory (SIGSEGV, stderr strings, RSS heuristic, exit 137), output (lock race prevention, strict boundary), cancellation (CancelledError + no orphans), and legacy compatibility (all 3 _execute_command call forms)

Rebased on latest upstream/main (0d48d3c), preserving bounded listener queues and scan phase broadcasts.

Also @utksh1 , could you please change the PR level from advanced to critical? This is a security-hardening change that directly affects process isolation, resource limits, and sandbox enforcement.

Signed-off-by: Zach <ask.zach@pm.me>
@ask-z4ch ask-z4ch requested a review from utksh1 May 29, 2026 17:37
@utksh1
Copy link
Copy Markdown
Owner

utksh1 commented May 30, 2026

Re-reviewed after the latest push. Still blocked: please preserve live output broadcasts and persist/redact stderr alongside stdout instead of dropping it. Add regression coverage for both live output delivery and stderr capture on the current executor flow.

@ask-z4ch
Copy link
Copy Markdown
Contributor Author

Both features are already implemented in the latest commit (38979f5). The broadcast callback is wired through _read_stream → sandbox_execute → _on_chunk closure in _execute_command which calls self._broadcast(task_id, "output", text). Stderr is merged into stdout before return via if stderr: stdout = stdout + "\n" + stderr if stdout else stderr, which then gets redacted and persisted by execute_task. Regression tests test_live_output_broadcasting and test_stderr_captured_in_output cover both. Could you re-check? Maybe the review was on an older commit before the rebase.

The code is correct - I've verified every link in the chain:

execute_task (line 359) → unpacks 3-tuple (output, exit_code, violation_reason) ✓
_execute_command (line 504) → creates _on_chunk closure that calls self._broadcast(task_id, "output", text) ✓
_execute_command → passes broadcast_callback=_on_chunk to sandbox_execute ✓
_execute_command → after sandbox returns, does if stderr: stdout = stdout + "\n" + stderr ✓
_execute_command → returns (stdout, exit_code, violation_reason) — stderr included in stdout ✓
execute_task → redacts output, persists to raw_output_dir ✓
sandbox_execute → _read_stream calls broadcast_callback(chunk, stream_name) for both stdout and stderr ✓

@ask-z4ch
Copy link
Copy Markdown
Contributor Author

@utksh1 what exactly does "still blocked" refer to? I've verified the code on 38979f5 and both live broadcasts and stderr capture are implemented:

  • _on_chunk in _execute_command calls self._broadcast(task_id, "output", text) for every chunk
  • Stderr is read separately then merged into stdout before redaction/persistence
  • Regression tests cover both paths

If something is still off, let me know - I can open a fresh PR from scratch if this branch has issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:backend Backend API, database, or service work area:security Security-sensitive implementation or tests level:advanced 55 pts difficulty label for advanced contributor PRs type:feature Feature work category bonus label type:security Security work category bonus label type:testing Testing work category bonus label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Add plugin execution sandbox with resource limits and timeout enforcement

2 participants