fix(util/worker, ai/task): TTL-based pendingAborts eviction; clarify runWithIterable bond#500
Conversation
@workglow/cli
@workglow/ai
@workglow/browser-control
@workglow/indexeddb
@workglow/javascript
@workglow/job-queue
@workglow/knowledge-base
@workglow/mcp
@workglow/storage
@workglow/task-graph
@workglow/tasks
@workglow/util
workglow
@workglow/anthropic
@workglow/bun-webview
@workglow/chrome-ai
@workglow/electron
@workglow/google-gemini
@workglow/huggingface-inference
@workglow/huggingface-transformers
@workglow/node-llama-cpp
@workglow/ollama
@workglow/openai
@workglow/playwright
@workglow/postgres
@workglow/sqlite
@workglow/supabase
@workglow/tf-mediapipe
commit: |
9d7d9ff to
3251e0c
Compare
Replace the FIFO eviction policy on `pendingAborts` in `WorkerServerBase` with a TTL-keyed `Map<id, timestamp>`. Under the old policy a misbehaving client spamming aborts for ids that never landed could silently evict a legitimate pending abort (the first 500 entries were dropped once the set crossed 1000) before its matching `call` arrived; the run-fn would then execute un-aborted. Changes: - `pendingAborts` is now `Map<string, number>` keyed on insertion timestamp. - 30s TTL; per-id one-shot cleanup timer (mirrors the `scheduleCompletedRequestCleanup` pattern). - Inline TTL sweep on every record. - 10k hard cap as a memory safety-net; when exceeded, evict the half with the lowest timestamps (oldest-by-timestamp, not insertion order). - `consumePendingAbort` rechecks the TTL on consumption so a stale marker that slipped past the sweep cannot wrongly abort a fresh call. Tests cover eviction correctness under a 1500-id flood, TTL expiry past 31s, and hard-cap behaviour preserving the newest ids while evicting the oldest.
The top-of-function JSDoc on `runWithIterable` had misleading language
about "leaving the parent alone" / two-way bond. The implementation is
actually parent->child only:
- context.signal.abort() mirrors into localAbort -> strategy stops.
- localAbort.abort() (from the finally) does NOT touch parentSignal;
sibling consumers of the parent signal keep running.
Rewrite the comment to spell out the one-way bond explicitly and add an
inline marker above the `addEventListener("abort", ...)` line so future
readers don't reintroduce a reverse propagation.
The new "hard-cap safety" test inserts 10,010 pendingAborts entries to exercise the >10k cap eviction path. Each insertion runs the O(n) inline TTL sweep in recordPendingAbort and schedules 2 fake setTimeouts, so the test does ~50M Map iterations plus ~20k fake-timer registrations. This exceeds vitest's default 15s timeout on slower CI runners, even though the assertions themselves are correct. Give just this one test a 60s budget; the other two new tests are cheap and keep the default timeout.
The inline `for (const [key, ts] of this.pendingAborts)` sweep ran on
every abort and made `recordPendingAbort` O(n) per call — O(n^2) under
the hard-cap test's 10,010-entry burst, which exceeded the 60s test
timeout (~100M iterations).
Cleanup is now amortised O(1) per call via three complementary mechanisms:
1. Per-id `setTimeout` (already present) drops each entry after TTL.
2. `consumePendingAbort` re-checks the timestamp at consume time
(belt-and-braces for the TTL/timer-firing window).
3. The hard-cap overflow branch (already present) evicts the oldest
half in one pass when size exceeds PENDING_ABORT_HARD_CAP — the
memory safety-net for pathological bursts that outrun the timers.
Behaviour is unchanged: the TTL-expiry test relies on the per-id
setTimeout (real-timer advance fires it); the hard-cap test relies on
the overflow eviction; the eviction-correctness test relies on a fresh
marker not being evicted (now guaranteed since nothing else touches the
map during a synchronous burst).
The hard-cap regression test was inserting 10_010 entries through recordPendingAbort under vi.useFakeTimers(), which empirically takes >60s on CI runners due to vitest's fake-timer heap costs (each setTimeout registration is O(log n), and with 10k+ entries queued the constant factors stack up). Make the hard cap injectable via a constructor option (default 10_000) so the test can use cap=100 and insert 110 entries — same assertions, ~100x less work. The eviction-correctness and TTL-expiry tests are unchanged since they only use 1500 / 1 entries respectively.
3251e0c to
94d2497
Compare
Coverage Report
File CoverageNo changed files found. |
There was a problem hiding this comment.
Pull request overview
This PR updates worker abort-race handling to use TTL-based pending abort retention instead of FIFO eviction, adds regression coverage for pending abort eviction behavior, and clarifies runWithIterable abort propagation documentation.
Changes:
- Adds
WorkerServerBaseOptionsand replacespendingAbortsSetwith timestamped TTL/cap-basedMaphandling. - Adds tests for abort flood survival, TTL expiry, and hard-cap eviction behavior.
- Updates
runWithIterableJSDoc to describe parent-to-child-only abort propagation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
packages/util/src/worker/WorkerServerBase.ts |
Implements timestamped pending-abort tracking with TTL timers and hard-cap eviction. |
packages/test/src/test/util/WorkerServerBase.race.test.ts |
Adds coverage for the new pending-abort eviction policy. |
packages/ai/src/task/base/runWithIterable.ts |
Clarifies abort propagation documentation and inline comments. |
Comments suppressed due to low confidence (1)
packages/util/src/worker/WorkerServerBase.ts:356
- The hard cap only bounds
pendingAborts; every abort still adds a timeout handle topendingAbortTimers, and evicting entries at the cap does not clear those handles. A burst above the cap (or repeated aborts for the same id) can therefore create far more than 10k live timers for up to 30s, so the intended memory safety-net does not actually bound timer memory/work.
const timer = setTimeout(() => {
this.pendingAborts.delete(id);
this.pendingAbortTimers.delete(timer);
}, WorkerServerBase.PENDING_ABORT_TTL_MS);
this.pendingAbortTimers.add(timer);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // to drop this id after its TTL elapses. Stored so dispose-style hooks | ||
| // can clear pending timers; otherwise they GC naturally. | ||
| const timer = setTimeout(() => { | ||
| this.pendingAborts.delete(id); |
| * to the provider, releasing the connection promptly rather than letting | ||
| * the run keep streaming events into a closed queue. | ||
| * Abort bond is strictly one-way: parent → child. | ||
| * • context.signal.abort() → localAbort aborts → strategy sees it. |
| * hard-coded values. Subclasses (platform-specific `WorkerServer` classes) | ||
| * accept the same shape and forward it via `super(options)`. |
Address Copilot review comments on #500: - recordPendingAbort: re-recording an abort for an id that was already pending used to leave the previous setTimeout active. That old timer would fire at its original +TTL and delete the renewed entry early, even though the most recent record had its own full TTL. Switch pendingAbortTimers from Set<timer> to Map<id, timer>, clearTimeout the previous timer on re-record, and remove timer entries from the map on consume / TTL-fire / hard-cap eviction. This also bounds the timer-handle memory (previously the hard cap only bounded the data map, leaving timers to accumulate until TTL). - runWithIterable JSDoc: the example demonstrated an AbortSignal method that doesn't exist (signals can't be aborted directly; only their owning AbortController can). Rewrite the example to use a controller. - WorkerServerBase JSDoc: the pendingAbortHardCap option is only reachable via `new WorkerServerBase(...)` directly. Platform subclasses use the default. Clarify in-line so consumers don't think `new WorkerServer({ pendingAbortHardCap: ... })` works.
|
@copilot resolve the merge conflicts in this pull request |
Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Agent-Logs-Url: https://github.com/workglow-dev/libs/sessions/b1cbba42-d4c5-496f-b2d4-05d7dae090a4 Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Resolved the merge conflicts by merging |
Why
WorkerServerBase.pendingAbortspreviously used FIFO eviction at size 1000 (dropping the oldest 500 in bulk once the cap was crossed). A misbehaving client spamming aborts for ids that never landed could silently evict a legitimate pending abort before its matchingcallarrived — the run-fn would then execute un-aborted, defeating the whole abort-before-call race fix.Separately,
runWithIterable's top JSDoc misleadingly suggested a two-way abort bond ("leaves the parent alone"). The implementation is actually parent → child only, and the wording invited future contributors to reintroduce reverse propagation by accident.What
TTL-based pendingAborts (
packages/util/src/worker/WorkerServerBase.ts):pendingAbortsis nowMap<string, number>keyed on insertion timestamp instead ofSet<string>.PENDING_ABORT_TTL_MS. Each insertion schedules a one-shotsetTimeoutcleanup (mirrors the existingscheduleCompletedRequestCleanuppattern); timer handles are tracked inpendingAbortTimersso dispose-style hooks can clear them.recordPendingAbort— drops any entries older than the TTL.PENDING_ABORT_HARD_CAP) as a memory safety-net; when exceeded, evict the half with the lowest timestamps (oldest-by-timestamp, not insertion order).consumePendingAbortrechecks the TTL on consumption: present + fresh → abort + delete + returntrue; present + expired → delete + returnfalse; absent → returnfalse.Comment fix (
packages/ai/src/task/base/runWithIterable.ts):@exampleblocks showing parent-cancel vs consumer-break semantics.// One-way bond: parent -> child only.marker above theaddEventListener("abort", ...)line.Tests (
packages/test/src/test/util/WorkerServerBase.race.test.ts):describe("WorkerServerBase pendingAborts eviction policy", ...)block with three cases undervi.useFakeTimers():callfora-500→ signal observed aborted (would have failed under FIFO).vi.advanceTimersByTime(31_000)+ matchingcall→ signal observed un-aborted.HARD_CAP + 10aborts with 1ms steps →pendingAborts.size <= HARD_CAP, newest id survives (consume succeeds), oldest id evicted (consume fails). Also asserts the behavioural path via the same run-fn registration shape as the existing tests.Verification
bun test packages/test/src/test/util/WorkerServerBase.race.test.ts— runs the three new tests plus the three pre-existing abort-before-call race tests (regression coverage).Risks
consumePendingAbortnow rechecksDate.now() - ts > TTLon every call (one extra subtraction). Negligible.setTimeoutadds one timer per abort. With the 10k hard cap that's at most 10k pending timers — well within Node/Bun limits, and they GC as they fire.Rollback
Two independent commits, each rollback-able on its own:
fix(util/worker): TTL-based pendingAborts eviction— revert restores the FIFO Set policy and removes the new tests.docs(ai/task): clarify runWithIterable abort bond is one-way— comment-only; revert restores the prior JSDoc verbatim.Note: this PR is branched off
mainasclaude/loving-mendel-Hcznj-abortrather than the default session branchclaude/loving-mendel-Hcznjbecause a separate PR re-enabling CI is being authored on that branch in parallel. The two changes have independent review concerns and should be reviewable in isolation.Generated by Claude Code