fix(memory-core): yield to event loop during seedEmbeddingCache (R2.A.2)#2
Merged
lesaai merged 2 commits intoApr 24, 2026
Conversation
R2.A.2. The .iterate()-based seed (R2.A v1, a315280) prevents the V8 heap OOM but the iterate loop still runs synchronously for ~117s on a 435K-row embedding_cache. wip-healthcheck SIGKILLs the gateway after its 30s probe timeout fails. No FATAL ERROR, no Abort trap. Patch: convert seedEmbeddingCache to async, yield to the event loop every 1000 rows via setImmediate. Keeps memory bounded; preserves the streaming behavior; restores /health responsiveness during the seed. The only caller is inside an existing async arrow wrapping runMemoryAtomicReindex's build callback. Adding await is a one-line change. Validation: - pnpm tsgo:prod: green - pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failed Scope: does not soften wip-healthcheck (separate guardrail per Parker direction). Does not address secondary listChunks path (R2.A.3).
Revert the top-of-file lint-suppression comments accidentally landed in the previous commit (f9e9970). They were added to work around an oxlint resolver false positive that turned out to be transient state, not a real lint failure. Production code shouldn't carry misleading explanations for problems that didn't actually persist. Net diff of this branch vs base is now just the seedEmbeddingCache yield patch: function -> async, setImmediate every 1000 rows, caller await. No lint comments, no file-level disables.
Member
Author
|
Read-only yield canary passed against the production ~/.openclaw/memory/main.sqlite embedding cache. Results: {
"yieldEvery": 1000,
"rows": 435136,
"embeddingBytes": 8680106189,
"durationMs": 26383,
"timerTicks": 25,
"maxTimerDelayMs": 147,
"rssMb": 144,
"maxRssMb": 150,
"heapUsedMb": 18,
"maxHeapUsedMb": 77
}Interpretation: the 1000-row setImmediate cadence keeps the event loop responsive while scanning the full production embedding cache. This directly addresses the post-R2.A failure mode where .iterate() avoided V8 heap OOM but starved /health long enough for the watchdog to restart the gateway. Canary was read-only and did not touch the live gateway. |
e3f6864
into
cc-mini/chat-completions-upstream-20260423
93 of 97 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Canary candidate. Do not npm-link to live Lēsa until canary passes.
Summary
R2.A.2. The
.iterate()patch from PR #1 (a315280) prevents the V8 heap OOM but the iterate loop still runs synchronously for ~117s on a 435K-rowembedding_cache. The gateway can't service HTTP/healthprobes during that window; wip-healthcheck's 30s probe timeout SIGKILLs the gateway after a single failure.Live repro on 2026-04-24 ~15:31 PDT post-PR#1 deploy:
HTTP probe failed: timeout (30000ms)→Restarting gateway (attempt 1/3)→ SIGKILL → LaunchAgent respawn. No FATAL ERROR, no Abort trap, no StatementSync::All stack. R2.A v1 worked at preventing V8 OOM; the new failure mode is event-loop blocking.Fix
seedEmbeddingCachebecomesasyncand yields to the event loop every 1000 rows viaawait new Promise(resolve => setImmediate(resolve)).runMemoryAtomicReindex'sbuildasync arrow) gainsawait— one-line propagation.The synchronous
.iterate()/insert.run()pair stays the same; we just release the event loop for one tick every batch so HTTP/health(and other I/O work) can run between batches. Memory stays bounded; streaming behavior preserved;/healthstays responsive during the seed.YIELD_EVERY = 1000 rows ≈ tens of milliseconds of sync work per batch. Well under the 30s probe timeout even with no further patience from the watchdog.
Validation
pnpm tsgo:prod: green (core + extensions graphs)pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failedOut of scope
wip-healthcheck-private, not here. Filed separately.listChunks.all()path inmanager-search.ts:246-252(R2.A.3). Bigger surgery — caller needs full candidate set for cosine-similarity ranking, so converting to streaming requires a bounded top-K heap in the caller. Held until R2.A.2 canaries clean.Canary plan
/healthstays responsive throughout, gateway not SIGKILL'd by watchdog