fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e)#99
Merged
Merged
Conversation
Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session variable and the subsequent query were issued as separate Prisma calls. With pool connections, the SET LOCAL can land on one connection and the SELECT on another — the ef_search tuning silently reverts to the server default before the vector search runs. Phase 0a fixed this for the org-context session variable in withOrgContext. Phase 0e applies the same $transaction wrap to all three SET LOCAL hnsw.ef_search call sites surfaced during STEP 1 audit: 1. src/lib/knowledge/search.ts:201 (searchKnowledgeBase) - Hot path: 6+ production callers (KB chat/eval API routes, kb-search-handler runtime, MCP tool exposure, rag-inject, codebase-rag SDLC). - Dynamic ef_search (40 | 60 | 100) based on query word count. - Vector search on KBChunk. 2. src/lib/memory/hot-cold-tier.ts:113 (getColdMemories) - No production callers in src/ today, but exported and reachable dynamically. Patched for consistency with the other two sites so the pattern is uniform across the codebase. - ef_search = 40 (hardcoded). - Vector search on AgentMemory. 3. src/lib/runtime/handlers/memory-read-handler.ts:168 (memoryReadHandler vector search path) - Runtime handler for the memory_read node type. Triggered for every agent run that includes a memory_read node. - ef_search = 40 (hardcoded). - Vector search on AgentMemory. PLAN-V2 §4.5 named only search.ts; the audit instruction in that section ('Apply same pattern to other SET LOCAL callers if discovered during STEP 1 audit') is honored here with all three sites in one PR. FORENSIC HAL-9 covers the same finding. Tests: - 63 existing tests still pass (search.test.ts utility tests: 30; hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17). - 6 new regression-guard tests added in two new files: src/lib/knowledge/__tests__/search-vector-shape.test.ts (3) src/lib/runtime/handlers/__tests__/memory-read-handler-vector-shape.test.ts (3) - hot-cold-tier.test.ts mock extended with $transaction delegating to a tx mock; existing assertions on $executeRawUnsafe still hold via tx-aliased fn references. - Each shape test asserts: (a) $transaction is invoked, (b) SET LOCAL runs on the tx, not the outer prisma, (c) callback receives tx, (d) outer prisma's direct exec/query paths are NOT used. - Mocked tests verify SHAPE of the fix. They cannot verify pool-survival semantics; that needs a real Postgres integration harness (TODO comments left for Phase 1+). Risk: - search.ts is on the KB hot path. $transaction holds a pool connection for the duration of the callback (one SET LOCAL + one SELECT, both microsecond-millisecond). Should not materially affect pool exhaustion under load; verify in staging if concerned. - hot-cold-tier.ts getColdMemories has no production callers, so functional risk is zero. - memory-read-handler.ts vector search is per-agent-run when memory_read node fires. Similar transaction profile to search.ts. Out of scope (intentional): - Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts and memory-read-handler.ts. Both keep the existing Unsafe variant; SET LOCAL takes no parameters, no SQL injection vector. Refactoring to safer APIs is a separate workstream. Verification: - vitest run (search.test.ts + search-vector-shape.test.ts + hot-cold-tier.test.ts + memory-read-handler.test.ts + memory-read-handler-vector-shape.test.ts): 69 passed - tsc --noEmit -p tsconfig.json: exit 0 Refs: - skill-rls-rollout-PLAN-V2.md §4.5 (Phase 0e bundled bug fix) - skill-rls-rollout-FORENSIC-REPORT.md HAL-9 - PR #97 (Phase 0a — same fix pattern for withOrgContext) - PR #98 (docs/rls-tech-debt.md — tracking)
webdevcom01-cell
added a commit
that referenced
this pull request
May 20, 2026
The E2E job in CI has been failing on every push to main since at least PR #99 (Phase 0e merge) with the same error: Error: Timed out waiting 120000ms from config.webServer. Root cause: Playwright's `webServer.timeout` was set to 120_000 ms (2 minutes), but the CI command is `pnpm build && pnpm start`, which has to run a full Next.js production build on a fresh CI runner. The build alone takes 5-15 minutes (4m 11s in the dedicated CI Build job, 15.3 min in the Docker build job). The 120-second window expires before the server can listen on port 3000, and Playwright aborts with the timeout error above. This was masked until now because: 1. The E2E job only runs on push to main (and labeled PRs), so PR-level CI is green even when main-level CI is red. 2. Railway's "Wait for CI" toggle was off, so deploys went through despite the red CI status on the main branch. After enabling "Wait for CI" in preparation for Phase 0b migrations, this latent issue became a hard block: Railway now refuses to deploy any new main commit until CI is green. Fix: bump `webServer.timeout` to 1_500_000 ms (25 minutes) when running under CI. The original 120s remains in place locally so fast feedback isn't lost when running `pnpm test:e2e` against `pnpm dev`. Comment in the config explains the asymmetry. Followups out of scope for this PR: - Caching the `.next` build output between CI Build and E2E jobs so the build doesn't have to run twice. This is the proper long-term fix; tracked separately. - Investigating why Next.js compile is 15 minutes inside Docker builds vs 4 minutes in the dedicated Build job (see PR #98 docs/rls-tech-debt.md item #1). Verification: - tsc --noEmit -p tsconfig.json: exit 0 - Only one file changed (playwright.config.ts). - No app code, no test files, no migrations touched. Risk: low. Worst case, the new 25-minute cap is still too short and we widen it further. Best case, CI goes green and we unblock Phase 0b deploy + future PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session variable and the subsequent query were issued as separate Prisma calls. With pool connections, the SET LOCAL can land on one connection and the SELECT on another — the ef_search tuning silently reverts to the server default before the vector search runs.
Phase 0a fixed this for the org-context session variable in withOrgContext. Phase 0e applies the same $transaction wrap to all three SET LOCAL hnsw.ef_search call sites surfaced during STEP 1 audit:
src/lib/knowledge/search.ts:201 (searchKnowledgeBase)
src/lib/memory/hot-cold-tier.ts:113 (getColdMemories)
src/lib/runtime/handlers/memory-read-handler.ts:168 (memoryReadHandler vector search path)
PLAN-V2 §4.5 named only search.ts; the audit instruction in that section ('Apply same pattern to other SET LOCAL callers if discovered during STEP 1 audit') is honored here with all three sites in one PR.
FORENSIC HAL-9 covers the same finding.
Tests:
Risk:
Out of scope (intentional):
Verification:
Refs:
Description
Briefly describe the changes in this PR and the motivation behind them.
Type of change
Checklist
any, Tailwind only)pnpm test)pnpm lint)pnpm typecheck)console.logstatements left in code