Skip to content

fix(db): wrap withOrgContext in $transaction so session var survives …#97

Merged
webdevcom01-cell merged 1 commit into
mainfrom
phase-0a/with-org-context-fix
May 18, 2026
Merged

fix(db): wrap withOrgContext in $transaction so session var survives …#97
webdevcom01-cell merged 1 commit into
mainfrom
phase-0a/with-org-context-fix

Conversation

@webdevcom01-cell
Copy link
Copy Markdown
Owner

…pool (Phase 0a)

Previously withOrgContext called client.$executeRawUnsafe to SET app.current_org_id, then ran fn(client) as a separate call. With Prisma connection pooling and Railway's 2 replicas, these two operations can land on different connections — the session variable evaporates between the set_config and the user's queries, and any RLS policy that reads app.current_org_id sees NULL.

Today this bug is masked because the application connects as a superuser (RLS bypassed at role level). Once Phase 0b introduces the non-bypass app_user role, every tenant-scoped query under this helper would silently return [].

The fix wraps the helper in client.$transaction(...), which pins a single connection for the duration of the callback. set_config is invoked on the tx client (not the outer pooled client) using $executeRaw with a tagged template (safer than the previous $executeRawUnsafe). Isolation, maxWait, and timeout are set per the rollout plan.

The callback signature changes from
fn: (client: PrismaClient) => Promise
to
fn: (tx: Prisma.TransactionClient) => Promise
because callers MUST use the tx for their queries — using the outer client would bypass the org context. withOrgContext is not yet called from any production route, so this signature change does not break any consumer in this repo.

Tests:

  • 5 existing unit tests rewritten against the new $transaction-based shape. They verify: options bag is correct (ReadCommitted, maxWait=5000, timeout=30000), set_config runs on the tx (not the outer client), the callback receives the tx instance, the callback's return value is propagated, errors propagate.
  • 1 new regression-guard test ensures the callback receives the tx, not the outer client — guards against accidentally reverting to the pool-leak shape.
  • 1 it.skip() integration placeholder with TODO for Phase 1+ real-Postgres harness. The mocked unit tests verify SHAPE of the fix but cannot verify the underlying pool-survival semantics — that requires a real connection pool.
  • 3 registerRLSMiddleware tests left untouched (legacy $use path, now marked @deprecated in JSDoc).

Refs:

Verification:

  • vitest src/lib/db/tests/rls-middleware.test.ts: 8 passed, 1 skipped
  • tsc --noEmit: exit 0

Description

Briefly describe the changes in this PR and the motivation behind them.

Type of change

  • New feature
  • Bug fix
  • Documentation update
  • Refactor (no behavior change)
  • Test update
  • Chore (build, CI, dependencies)

Checklist

  • I have read the Contributing Guide
  • My code follows the project code standards (TypeScript strict, no any, Tailwind only)
  • I have added tests that cover my changes
  • All tests pass (pnpm test)
  • Linting passes (pnpm lint)
  • Type checking passes (pnpm typecheck)
  • I have updated documentation if needed
  • No console.log statements left in code
  • No hardcoded secrets or internal URLs

…pool (Phase 0a)

Previously withOrgContext called client.$executeRawUnsafe to SET
app.current_org_id, then ran fn(client) as a separate call. With
Prisma connection pooling and Railway's 2 replicas, these two
operations can land on different connections — the session
variable evaporates between the set_config and the user's
queries, and any RLS policy that reads app.current_org_id sees
NULL.

Today this bug is masked because the application connects as a
superuser (RLS bypassed at role level). Once Phase 0b introduces
the non-bypass app_user role, every tenant-scoped query under
this helper would silently return [].

The fix wraps the helper in client.$transaction(...), which pins
a single connection for the duration of the callback. set_config
is invoked on the tx client (not the outer pooled client) using
$executeRaw with a tagged template (safer than the previous
$executeRawUnsafe). Isolation, maxWait, and timeout are set per
the rollout plan.

The callback signature changes from
  fn: (client: PrismaClient) => Promise<T>
to
  fn: (tx: Prisma.TransactionClient) => Promise<T>
because callers MUST use the tx for their queries — using the
outer client would bypass the org context. withOrgContext is
not yet called from any production route, so this signature
change does not break any consumer in this repo.

Tests:
  - 5 existing unit tests rewritten against the new
    $transaction-based shape. They verify: options bag is
    correct (ReadCommitted, maxWait=5000, timeout=30000),
    set_config runs on the tx (not the outer client), the
    callback receives the tx instance, the callback's return
    value is propagated, errors propagate.
  - 1 new regression-guard test ensures the callback receives
    the tx, not the outer client — guards against accidentally
    reverting to the pool-leak shape.
  - 1 it.skip() integration placeholder with TODO for Phase 1+
    real-Postgres harness. The mocked unit tests verify SHAPE
    of the fix but cannot verify the underlying pool-survival
    semantics — that requires a real connection pool.
  - 3 registerRLSMiddleware tests left untouched (legacy
    $use path, now marked @deprecated in JSDoc).

Refs:
  - skill-rls-rollout-PLAN-V2.md §4.1 (Phase 0a, CRITICAL #1)
  - skill-rls-rollout-FORENSIC-REPORT.md §1 bullet 1, §3 #1, HAL-1

Verification:
  - vitest src/lib/db/__tests__/rls-middleware.test.ts: 8 passed, 1 skipped
  - tsc --noEmit: exit 0
@webdevcom01-cell
Copy link
Copy Markdown
Owner Author

Self-review

Solo repo — no second reviewer available. Documenting due diligence
before admin merge.

Diff verification

  • Re-read src/lib/db/rls-middleware.ts diff end-to-end
  • Callback signature change from PrismaClient to
    Prisma.TransactionClient matches PLAN-V2 §4.1
  • $executeRaw tagged-template (not $executeRawUnsafe)
  • Transaction options: ReadCommitted, maxWait: 5000,
    timeout: 30000

Caller verification

  • grep -rn "withOrgContext(" src --include="*.ts" returns only
    the definition, the JSDoc reference in src/lib/prisma.ts, and the
    test file. Zero production callers — signature change cannot break
    any consumer.

Test verification

  • Baseline (pre-patch) local: 8 passed
  • Post-patch local: 8 passed | 1 skipped (intentional integration
    TODO for Phase 1+)
  • Local tsc --noEmit: exit 0

CI verification

  • CI / Build: green (4m)
  • CI / Lint: green (35s)
  • CI / Typecheck: green (48s)
  • CI / Unit Tests: green (1m)
  • CodeQL: no new alerts in code changed by this PR
  • CI / E2E: intentionally skipped — withOrgContext has no
    production callers, no surface to exercise. Phase 0b+ PRs that
    wire the helper into routes will carry the e2e label.
  • [n/a] Docker Build & Push: cancelled by 30-min CI timeout during
    post-build Collecting build traces step. Not a Required check
    on this repo, not specific to this PR (Next.js compile alone
    takes 15.3 min on main today). Tracked separately for CI infra
    workstream.

Scope verification

  • 2 files changed (production helper + its unit test)
  • No schema migrations
  • No env var changes
  • No dependency changes

Proceeding with admin merge (bypass rules) under solo-repo policy.

@webdevcom01-cell webdevcom01-cell merged commit 35140d3 into main May 18, 2026
7 of 8 checks passed
@webdevcom01-cell webdevcom01-cell deleted the phase-0a/with-org-context-fix branch May 18, 2026 10:16
webdevcom01-cell added a commit that referenced this pull request May 18, 2026
Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session
variable and the subsequent query were issued as separate Prisma
calls. With pool connections, the SET LOCAL can land on one
connection and the SELECT on another — the ef_search tuning
silently reverts to the server default before the vector search
runs.

Phase 0a fixed this for the org-context session variable in
withOrgContext. Phase 0e applies the same $transaction wrap to
all three SET LOCAL hnsw.ef_search call sites surfaced during
STEP 1 audit:

1. src/lib/knowledge/search.ts:201 (searchKnowledgeBase)
   - Hot path: 6+ production callers (KB chat/eval API routes,
     kb-search-handler runtime, MCP tool exposure, rag-inject,
     codebase-rag SDLC).
   - Dynamic ef_search (40 | 60 | 100) based on query word count.
   - Vector search on KBChunk.

2. src/lib/memory/hot-cold-tier.ts:113 (getColdMemories)
   - No production callers in src/ today, but exported and reachable
     dynamically. Patched for consistency with the other two sites
     so the pattern is uniform across the codebase.
   - ef_search = 40 (hardcoded).
   - Vector search on AgentMemory.

3. src/lib/runtime/handlers/memory-read-handler.ts:168
   (memoryReadHandler vector search path)
   - Runtime handler for the memory_read node type. Triggered for
     every agent run that includes a memory_read node.
   - ef_search = 40 (hardcoded).
   - Vector search on AgentMemory.

PLAN-V2 §4.5 named only search.ts; the audit instruction in that
section ('Apply same pattern to other SET LOCAL callers if
discovered during STEP 1 audit') is honored here with all three
sites in one PR.

FORENSIC HAL-9 covers the same finding.

Tests:
  - 63 existing tests still pass (search.test.ts utility tests: 30;
    hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17).
  - 6 new regression-guard tests added in two new files:
    src/lib/knowledge/__tests__/search-vector-shape.test.ts (3)
    src/lib/runtime/handlers/__tests__/memory-read-handler-vector-shape.test.ts (3)
  - hot-cold-tier.test.ts mock extended with $transaction
    delegating to a tx mock; existing assertions on
    $executeRawUnsafe still hold via tx-aliased fn references.
  - Each shape test asserts: (a) $transaction is invoked, (b)
    SET LOCAL runs on the tx, not the outer prisma, (c) callback
    receives tx, (d) outer prisma's direct exec/query paths are
    NOT used.
  - Mocked tests verify SHAPE of the fix. They cannot verify
    pool-survival semantics; that needs a real Postgres
    integration harness (TODO comments left for Phase 1+).

Risk:
  - search.ts is on the KB hot path. $transaction holds a pool
    connection for the duration of the callback (one SET LOCAL +
    one SELECT, both microsecond-millisecond). Should not
    materially affect pool exhaustion under load; verify in
    staging if concerned.
  - hot-cold-tier.ts getColdMemories has no production callers,
    so functional risk is zero.
  - memory-read-handler.ts vector search is per-agent-run when
    memory_read node fires. Similar transaction profile to
    search.ts.

Out of scope (intentional):
  - Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts
    and memory-read-handler.ts. Both keep the existing Unsafe
    variant; SET LOCAL takes no parameters, no SQL injection
    vector. Refactoring to safer APIs is a separate workstream.

Verification:
  - vitest run (search.test.ts + search-vector-shape.test.ts +
    hot-cold-tier.test.ts + memory-read-handler.test.ts +
    memory-read-handler-vector-shape.test.ts): 69 passed
  - tsc --noEmit -p tsconfig.json: exit 0

Refs:
  - skill-rls-rollout-PLAN-V2.md §4.5 (Phase 0e bundled bug fix)
  - skill-rls-rollout-FORENSIC-REPORT.md HAL-9
  - PR #97 (Phase 0a — same fix pattern for withOrgContext)
  - PR #98 (docs/rls-tech-debt.md — tracking)
webdevcom01-cell added a commit that referenced this pull request May 18, 2026
Running list of issues surfaced during Phase 0a that were
intentionally not addressed in PR #97. Each item is scoped to a
specific RLS phase it must be resolved before, so the rollout
itself doesn't accumulate hidden surprises.

Open items:

1. CI Docker Build & Push timeout (Medium, resolve before Phase 0b)
   - 30-min cap was hit by 29s on PR #97 during Collecting build
     traces. Compile alone takes 15.3 min.

2. MCP pool shutdown noise during SSG (Low, resolve before
   Phase 0e for cleaner build logs).

3. Railway 'Wait for CI' toggle is OFF (HIGH, resolve before
   Phase 0b — hard requirement).
   - Phase 0b adds migrations and would deploy app code against
     a not-yet-migrated database without this gate.

Plus a 'Future-watch' section: preview deploys, deployment
status reporting back to GitHub, auto-rollback on healthcheck
failure.

This is a documentation-only change. No code paths affected.
typecheck: exit 0 (no TS touched).

Refs:
  - skill-rls-rollout-PLAN-V2.md
  - skill-rls-rollout-FORENSIC-REPORT.md HAL-7, §3 #6
  - PR #97 (Phase 0a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant