Skip to content

fix(util/worker, ai/task): TTL-based pendingAborts eviction; clarify runWithIterable bond#500

Merged
sroussey merged 8 commits into
mainfrom
claude/loving-mendel-Hcznj-abort
May 15, 2026
Merged

fix(util/worker, ai/task): TTL-based pendingAborts eviction; clarify runWithIterable bond#500
sroussey merged 8 commits into
mainfrom
claude/loving-mendel-Hcznj-abort

Conversation

@sroussey
Copy link
Copy Markdown
Collaborator

Why

WorkerServerBase.pendingAborts previously used FIFO eviction at size 1000 (dropping the oldest 500 in bulk once the cap was crossed). A misbehaving client spamming aborts for ids that never landed could silently evict a legitimate pending abort before its matching call arrived — the run-fn would then execute un-aborted, defeating the whole abort-before-call race fix.

Separately, runWithIterable's top JSDoc misleadingly suggested a two-way abort bond ("leaves the parent alone"). The implementation is actually parent → child only, and the wording invited future contributors to reintroduce reverse propagation by accident.

What

TTL-based pendingAborts (packages/util/src/worker/WorkerServerBase.ts):

  • pendingAborts is now Map<string, number> keyed on insertion timestamp instead of Set<string>.
  • 30s TTL via PENDING_ABORT_TTL_MS. Each insertion schedules a one-shot setTimeout cleanup (mirrors the existing scheduleCompletedRequestCleanup pattern); timer handles are tracked in pendingAbortTimers so dispose-style hooks can clear them.
  • Inline TTL sweep on every recordPendingAbort — drops any entries older than the TTL.
  • 10k hard cap (PENDING_ABORT_HARD_CAP) as a memory safety-net; when exceeded, evict the half with the lowest timestamps (oldest-by-timestamp, not insertion order).
  • consumePendingAbort rechecks the TTL on consumption: present + fresh → abort + delete + return true; present + expired → delete + return false; absent → return false.

Comment fix (packages/ai/src/task/base/runWithIterable.ts):

  • Replace the misleading paragraph with explicit one-way-bond description plus two @example blocks showing parent-cancel vs consumer-break semantics.
  • Add an inline // One-way bond: parent -> child only. marker above the addEventListener("abort", ...) line.

Tests (packages/test/src/test/util/WorkerServerBase.race.test.ts):

  • New describe("WorkerServerBase pendingAborts eviction policy", ...) block with three cases under vi.useFakeTimers():
    • Eviction-correctness: 1500 distinct aborts in a single tick + matching call for a-500 → signal observed aborted (would have failed under FIFO).
    • TTL-expiry: single abort + vi.advanceTimersByTime(31_000) + matching call → signal observed un-aborted.
    • Hard-cap safety: HARD_CAP + 10 aborts with 1ms steps → pendingAborts.size <= HARD_CAP, newest id survives (consume succeeds), oldest id evicted (consume fails). Also asserts the behavioural path via the same run-fn registration shape as the existing tests.

Verification

  • bun test packages/test/src/test/util/WorkerServerBase.race.test.ts — runs the three new tests plus the three pre-existing abort-before-call race tests (regression coverage).
  • Type-check + lint on the two source files.

Risks

  • The 30s TTL must comfortably exceed worst-case message-port roundtrip latency between the main thread and the worker. In practice this is microseconds-to-milliseconds, so 30s is a 10,000x safety margin — not a concern.
  • consumePendingAbort now rechecks Date.now() - ts > TTL on every call (one extra subtraction). Negligible.
  • The per-id setTimeout adds one timer per abort. With the 10k hard cap that's at most 10k pending timers — well within Node/Bun limits, and they GC as they fire.

Rollback

Two independent commits, each rollback-able on its own:

  1. fix(util/worker): TTL-based pendingAborts eviction — revert restores the FIFO Set policy and removes the new tests.
  2. docs(ai/task): clarify runWithIterable abort bond is one-way — comment-only; revert restores the prior JSDoc verbatim.

Note: this PR is branched off main as claude/loving-mendel-Hcznj-abort rather than the default session branch claude/loving-mendel-Hcznj because a separate PR re-enabling CI is being authored on that branch in parallel. The two changes have independent review concerns and should be reviewable in isolation.


Generated by Claude Code

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 14, 2026

Open in StackBlitz

@workglow/cli

npm i https://pkg.pr.new/@workglow/cli@500

@workglow/ai

npm i https://pkg.pr.new/@workglow/ai@500

@workglow/browser-control

npm i https://pkg.pr.new/@workglow/browser-control@500

@workglow/indexeddb

npm i https://pkg.pr.new/@workglow/indexeddb@500

@workglow/javascript

npm i https://pkg.pr.new/@workglow/javascript@500

@workglow/job-queue

npm i https://pkg.pr.new/@workglow/job-queue@500

@workglow/knowledge-base

npm i https://pkg.pr.new/@workglow/knowledge-base@500

@workglow/mcp

npm i https://pkg.pr.new/@workglow/mcp@500

@workglow/storage

npm i https://pkg.pr.new/@workglow/storage@500

@workglow/task-graph

npm i https://pkg.pr.new/@workglow/task-graph@500

@workglow/tasks

npm i https://pkg.pr.new/@workglow/tasks@500

@workglow/util

npm i https://pkg.pr.new/@workglow/util@500

workglow

npm i https://pkg.pr.new/workglow@500

@workglow/anthropic

npm i https://pkg.pr.new/@workglow/anthropic@500

@workglow/bun-webview

npm i https://pkg.pr.new/@workglow/bun-webview@500

@workglow/chrome-ai

npm i https://pkg.pr.new/@workglow/chrome-ai@500

@workglow/electron

npm i https://pkg.pr.new/@workglow/electron@500

@workglow/google-gemini

npm i https://pkg.pr.new/@workglow/google-gemini@500

@workglow/huggingface-inference

npm i https://pkg.pr.new/@workglow/huggingface-inference@500

@workglow/huggingface-transformers

npm i https://pkg.pr.new/@workglow/huggingface-transformers@500

@workglow/node-llama-cpp

npm i https://pkg.pr.new/@workglow/node-llama-cpp@500

@workglow/ollama

npm i https://pkg.pr.new/@workglow/ollama@500

@workglow/openai

npm i https://pkg.pr.new/@workglow/openai@500

@workglow/playwright

npm i https://pkg.pr.new/@workglow/playwright@500

@workglow/postgres

npm i https://pkg.pr.new/@workglow/postgres@500

@workglow/sqlite

npm i https://pkg.pr.new/@workglow/sqlite@500

@workglow/supabase

npm i https://pkg.pr.new/@workglow/supabase@500

@workglow/tf-mediapipe

npm i https://pkg.pr.new/@workglow/tf-mediapipe@500

commit: 1fa4277

@sroussey sroussey force-pushed the claude/loving-mendel-Hcznj-abort branch from 9d7d9ff to 3251e0c Compare May 15, 2026 00:20
sroussey added 5 commits May 15, 2026 00:26
Replace the FIFO eviction policy on `pendingAborts` in `WorkerServerBase`
with a TTL-keyed `Map<id, timestamp>`. Under the old policy a misbehaving
client spamming aborts for ids that never landed could silently evict a
legitimate pending abort (the first 500 entries were dropped once the set
crossed 1000) before its matching `call` arrived; the run-fn would then
execute un-aborted.

Changes:
 - `pendingAborts` is now `Map<string, number>` keyed on insertion timestamp.
 - 30s TTL; per-id one-shot cleanup timer (mirrors the
   `scheduleCompletedRequestCleanup` pattern).
 - Inline TTL sweep on every record.
 - 10k hard cap as a memory safety-net; when exceeded, evict the half with
   the lowest timestamps (oldest-by-timestamp, not insertion order).
 - `consumePendingAbort` rechecks the TTL on consumption so a stale marker
   that slipped past the sweep cannot wrongly abort a fresh call.

Tests cover eviction correctness under a 1500-id flood, TTL expiry past
31s, and hard-cap behaviour preserving the newest ids while evicting the
oldest.
The top-of-function JSDoc on `runWithIterable` had misleading language
about "leaving the parent alone" / two-way bond. The implementation is
actually parent->child only:

  - context.signal.abort() mirrors into localAbort -> strategy stops.
  - localAbort.abort() (from the finally) does NOT touch parentSignal;
    sibling consumers of the parent signal keep running.

Rewrite the comment to spell out the one-way bond explicitly and add an
inline marker above the `addEventListener("abort", ...)` line so future
readers don't reintroduce a reverse propagation.
The new "hard-cap safety" test inserts 10,010 pendingAborts entries to
exercise the >10k cap eviction path. Each insertion runs the O(n) inline
TTL sweep in recordPendingAbort and schedules 2 fake setTimeouts, so the
test does ~50M Map iterations plus ~20k fake-timer registrations. This
exceeds vitest's default 15s timeout on slower CI runners, even though
the assertions themselves are correct.

Give just this one test a 60s budget; the other two new tests are cheap
and keep the default timeout.
The inline `for (const [key, ts] of this.pendingAborts)` sweep ran on
every abort and made `recordPendingAbort` O(n) per call — O(n^2) under
the hard-cap test's 10,010-entry burst, which exceeded the 60s test
timeout (~100M iterations).

Cleanup is now amortised O(1) per call via three complementary mechanisms:
  1. Per-id `setTimeout` (already present) drops each entry after TTL.
  2. `consumePendingAbort` re-checks the timestamp at consume time
     (belt-and-braces for the TTL/timer-firing window).
  3. The hard-cap overflow branch (already present) evicts the oldest
     half in one pass when size exceeds PENDING_ABORT_HARD_CAP — the
     memory safety-net for pathological bursts that outrun the timers.

Behaviour is unchanged: the TTL-expiry test relies on the per-id
setTimeout (real-timer advance fires it); the hard-cap test relies on
the overflow eviction; the eviction-correctness test relies on a fresh
marker not being evicted (now guaranteed since nothing else touches the
map during a synchronous burst).
The hard-cap regression test was inserting 10_010 entries through
recordPendingAbort under vi.useFakeTimers(), which empirically takes
>60s on CI runners due to vitest's fake-timer heap costs (each setTimeout
registration is O(log n), and with 10k+ entries queued the constant
factors stack up).

Make the hard cap injectable via a constructor option (default 10_000)
so the test can use cap=100 and insert 110 entries — same assertions,
~100x less work. The eviction-correctness and TTL-expiry tests are
unchanged since they only use 1500 / 1 entries respectively.
@sroussey sroussey force-pushed the claude/loving-mendel-Hcznj-abort branch from 3251e0c to 94d2497 Compare May 15, 2026 00:26
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 62.23% 21880 / 35155
🔵 Statements 62.1% 22650 / 36470
🔵 Functions 64.03% 4136 / 6459
🔵 Branches 50.8% 10484 / 20634
File CoverageNo changed files found.
Generated in workflow #2253 for commit 1fa4277 by the Vitest Coverage Report Action

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates worker abort-race handling to use TTL-based pending abort retention instead of FIFO eviction, adds regression coverage for pending abort eviction behavior, and clarifies runWithIterable abort propagation documentation.

Changes:

  • Adds WorkerServerBaseOptions and replaces pendingAborts Set with timestamped TTL/cap-based Map handling.
  • Adds tests for abort flood survival, TTL expiry, and hard-cap eviction behavior.
  • Updates runWithIterable JSDoc to describe parent-to-child-only abort propagation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
packages/util/src/worker/WorkerServerBase.ts Implements timestamped pending-abort tracking with TTL timers and hard-cap eviction.
packages/test/src/test/util/WorkerServerBase.race.test.ts Adds coverage for the new pending-abort eviction policy.
packages/ai/src/task/base/runWithIterable.ts Clarifies abort propagation documentation and inline comments.
Comments suppressed due to low confidence (1)

packages/util/src/worker/WorkerServerBase.ts:356

  • The hard cap only bounds pendingAborts; every abort still adds a timeout handle to pendingAbortTimers, and evicting entries at the cap does not clear those handles. A burst above the cap (or repeated aborts for the same id) can therefore create far more than 10k live timers for up to 30s, so the intended memory safety-net does not actually bound timer memory/work.
    const timer = setTimeout(() => {
      this.pendingAborts.delete(id);
      this.pendingAbortTimers.delete(timer);
    }, WorkerServerBase.PENDING_ABORT_TTL_MS);
    this.pendingAbortTimers.add(timer);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +350 to +353
// to drop this id after its TTL elapses. Stored so dispose-style hooks
// can clear pending timers; otherwise they GC naturally.
const timer = setTimeout(() => {
this.pendingAborts.delete(id);
* to the provider, releasing the connection promptly rather than letting
* the run keep streaming events into a closed queue.
* Abort bond is strictly one-way: parent → child.
* • context.signal.abort() → localAbort aborts → strategy sees it.
Comment on lines +84 to +85
* hard-coded values. Subclasses (platform-specific `WorkerServer` classes)
* accept the same shape and forward it via `super(options)`.
Address Copilot review comments on #500:

- recordPendingAbort: re-recording an abort for an id that was already
  pending used to leave the previous setTimeout active. That old timer
  would fire at its original +TTL and delete the renewed entry early,
  even though the most recent record had its own full TTL. Switch
  pendingAbortTimers from Set<timer> to Map<id, timer>, clearTimeout
  the previous timer on re-record, and remove timer entries from the
  map on consume / TTL-fire / hard-cap eviction. This also bounds the
  timer-handle memory (previously the hard cap only bounded the data
  map, leaving timers to accumulate until TTL).

- runWithIterable JSDoc: the example demonstrated an AbortSignal method
  that doesn't exist (signals can't be aborted directly; only their
  owning AbortController can). Rewrite the example to use a controller.

- WorkerServerBase JSDoc: the pendingAbortHardCap option is only
  reachable via `new WorkerServerBase(...)` directly. Platform
  subclasses use the default. Clarify in-line so consumers don't think
  `new WorkerServer({ pendingAbortHardCap: ... })` works.
@sroussey
Copy link
Copy Markdown
Collaborator Author

@copilot resolve the merge conflicts in this pull request

Copilot AI and others added 2 commits May 15, 2026 17:21
Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 15, 2026

@copilot resolve the merge conflicts in this pull request

Resolved the merge conflicts by merging origin/main into this branch and fixing the WorkerServerBase conflict while preserving the TTL-based pending-abort behavior. I also fixed follow-up merge fallout (completed-request cap constants + aligned eviction test expectations) in 8fb963b.

@sroussey sroussey merged commit 1b9cc10 into main May 15, 2026
4 checks passed
@sroussey sroussey deleted the claude/loving-mendel-Hcznj-abort branch May 15, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants