Skip to content

Replace test-count recycling with resource-based worker limits#108

Merged
thejchap merged 1 commit into
recycle-workers-on-test-countfrom
claude/worker-monitoring-strategy-ZBL3W
May 11, 2026
Merged

Replace test-count recycling with resource-based worker limits#108
thejchap merged 1 commit into
recycle-workers-on-test-countfrom
claude/worker-monitoring-strategy-ZBL3W

Conversation

@thejchap
Copy link
Copy Markdown
Owner

Replace the fixed MAX_TESTS_PER_WORKER constant with a flexible, resource-aware worker recycling system that monitors memory usage, open file descriptors, and wall-clock age.

Summary

Worker processes are now recycled based on soft resource ceilings rather than a hard test count. This allows the runner to respond to actual resource pressure (memory leaks, FD accumulation) while remaining deterministic and testable.

Key Changes

  • New resource monitoring types (WorkerHealth, WorkerLimits, RecycleReason):

    • WorkerHealth: Captures worker self-reported RSS bytes and open FD count
    • WorkerLimits: Configurable soft ceilings (default: 1 GiB RSS, 200 FDs, 10 min age)
    • RecycleReason: Enum indicating which limit triggered recycling, with human-readable Display impl for debug logs
  • Pure recycle-decision logic (evaluate_recycle function):

    • Factored out of WorkerProcess for unit testability
    • Checks signals in priority order (memory > FDs > age) to report the strongest reason
    • Respects None limits (opt-out) and missing readings (platform unavailability)
  • Worker-side health reporting (Python):

    • New _measure_rss_bytes() and _measure_open_fds() functions using resource module and /proc//dev filesystem
    • _measure_health() snapshot attached to every run_test/run_doctest response
    • Platform-aware: returns None for unavailable signals rather than guessing
  • Protocol updates:

    • New WorkerHealthWire struct for wire format
    • New RunTestResponseWire wrapper combining health snapshot with test result
    • Python side merges outcome dict with top-level health field
  • Pool API expansion:

    • New WorkerPool::with_python_path_and_limits() for tests to set custom limits
    • WorkerSpawnCtx struct bundles spawn-time parameters to reduce signature bloat
    • Recycling now deferred to end-of-unit (after finalize hooks) to preserve scope fixture teardown
  • Test updates:

    • worker_recycles_after_max_testsworker_recycles_when_age_exceeds_limit: Uses tiny max_age cap and sleep to drive deterministic recycling
    • recycle_does_not_skip_scope_fixture_teardown: Simplified to use age-based recycling within a single unit

Notable Implementation Details

  • Recycling is checked at end-of-unit (not mid-test) to ensure per="scope" fixture teardown runs before process death
  • Health snapshot is updated after every test response, enabling responsive recycling
  • Default limits are tuned for real workloads: 1 GiB RSS (headroom for heavy native deps), 200 FDs (below macOS soft limit), 10 min age (bounds slow leaks)
  • WorkerLimits::unlimited() helper disables all caps for tests exercising one signal in isolation
  • Recycling reason is logged for post-mortem triage of memory leaks vs FD pressure vs age drift

https://claude.ai/code/session_01PMbxzSuASTEYbDFEu4SQqy

Replaces the hardcoded `MAX_TESTS_PER_WORKER = 128` cap with
self-reported resource snapshots: every `run_test`/`run_doctest`
response carries a `WorkerHealthWire { rss_bytes, open_fds }` that the
runner consults at unit boundaries alongside a wall-clock age check.
Recycle decisions now name the tripped signal via a `RecycleReason`
enum, so debug logs can attribute drops to memory pressure vs FD
exhaustion vs slow drift.

Soft ceilings live on `WorkerLimits` (1 GiB RSS / 200 FDs / 10 min by
default, all `Option<u64>` so platforms without `/proc/self/fd` or
`resource` simply skip that signal). Tests use the new
`WorkerPool::with_python_path_and_limits` constructor with tiny caps
(or `WorkerLimits::unlimited`) to exercise recycle behaviour
deterministically; production call sites in `tryke`, `tryke_server`,
and `tryke_dev` are unchanged because `with_python_path` defaults to
`WorkerLimits::default()`.

Test coverage: end-to-end age-based recycle test (replaces the
test-count one); existing scope-fixture-teardown test now triggers via
age cap; new unit tests for `evaluate_recycle` priority order and
no-signal/no-cap fallbacks.

https://claude.ai/code/session_01PMbxzSuASTEYbDFEu4SQqy
@thejchap thejchap merged commit c20ca3b into recycle-workers-on-test-count May 11, 2026
@thejchap thejchap deleted the claude/worker-monitoring-strategy-ZBL3W branch May 11, 2026 13:35
thejchap added a commit that referenced this pull request May 13, 2026
Replaces the hardcoded `MAX_TESTS_PER_WORKER = 128` cap with
self-reported resource snapshots: every `run_test`/`run_doctest`
response carries a `WorkerHealthWire { rss_bytes, open_fds }` that the
runner consults at unit boundaries alongside a wall-clock age check.
Recycle decisions now name the tripped signal via a `RecycleReason`
enum, so debug logs can attribute drops to memory pressure vs FD
exhaustion vs slow drift.

Soft ceilings live on `WorkerLimits` (1 GiB RSS / 200 FDs / 10 min by
default, all `Option<u64>` so platforms without `/proc/self/fd` or
`resource` simply skip that signal). Tests use the new
`WorkerPool::with_python_path_and_limits` constructor with tiny caps
(or `WorkerLimits::unlimited`) to exercise recycle behaviour
deterministically; production call sites in `tryke`, `tryke_server`,
and `tryke_dev` are unchanged because `with_python_path` defaults to
`WorkerLimits::default()`.

Test coverage: end-to-end age-based recycle test (replaces the
test-count one); existing scope-fixture-teardown test now triggers via
age cap; new unit tests for `evaluate_recycle` priority order and
no-signal/no-cap fallbacks.

https://claude.ai/code/session_01PMbxzSuASTEYbDFEu4SQqy

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants