perf(db): eliminate ResultRow wrapping overhead for PostgreSQL by nicoloboschi · Pull Request #1325 · vectorize-io/hindsight

nicoloboschi · 2026-04-29T13:15:30Z

Summary

The Oracle abstraction PR (#1307) introduced a ResultRow wrapper class around every asyncpg.Record returned by fetch()/fetchrow(). This added Python-level indirection on every column access (row["id"], row["text"], etc.), causing ~570K __getitem__ calls per 20-recall benchmark via object.__getattribute__ + isinstance checks.

Fix: Make ResultRow a Protocol (interface) instead of a concrete class. asyncpg.Record already satisfies dict-like access natively in C. The PG backend now returns raw Records directly. Oracle uses DictResultRow (the concrete wrapper) since its native row types need it.

Benchmark (medium, 10K items, concurrency=4, same pg0 data)

Variant	Mean	vs baseline
v0.5.6 baseline	0.648s	—
With ResultRow wrapping (before)	0.805s	+24%
Without wrapping (this PR)	0.680s	+5% (noise)
With observation_sources junction table	0.680s	+5% (no impact)

Identified via cProfile: result.py:29(__getitem__) was the #3 CPU consumer at 0.156s / 569K calls.

Test plan

test_db_abstraction.py — all pass (updated to use DictResultRow)
test_link_expansion_retrieval.py — 2/2 pass
Lint passes
Recall benchmark matches v0.5.6 baseline on same data

Make ResultRow a Protocol instead of a concrete wrapper class. asyncpg.Record already satisfies the dict-like access pattern (row["key"], .keys(), .get()) natively in C — wrapping it in a Python class added ~570K __getitem__ calls per 20-recall benchmark, causing a measurable ~24% regression at 10K bank size. Changes: - ResultRow is now a Protocol (interface) in result.py - DictResultRow is the concrete wrapper, used only by Oracle backend - PostgresConnection.fetch/fetchrow return raw asyncpg.Record directly - Oracle backend imports DictResultRow as ResultRow (no behavior change) - Tests updated to use DictResultRow Benchmark (medium, 10K items, concurrency=4, same pg0 data): v0.5.6 baseline: 0.648s mean With wrapping: 0.805s mean (+24%) Without wrapping: 0.680s mean (+5%, within noise) With junction table: 0.680s mean (observation_sources has zero impact)

…kend-aware Two performance fixes for the Oracle abstraction layer: 1. Make ResultRow a Protocol instead of a concrete wrapper class. asyncpg.Record satisfies dict-like access natively in C — wrapping added ~570K __getitem__ calls per benchmark, causing a ~24% regression at 10K bank size. 2. Make observation source reads backend-dependent: PG uses native array ops (source_memory_ids column with &&, unnest), Oracle uses the observation_sources junction table. PG also skips junction table writes in the consolidator. At 33K scale, junction table reads doubled retrieval_graph latency (0.093s→0.186s). Changes: - ResultRow is now a Protocol; DictResultRow is the concrete wrapper (Oracle only) - PostgresConnection.fetch/fetchrow return raw asyncpg.Record directly - DataAccessOps.uses_observation_sources_table property (PG=False, Oracle=True) - Consolidator guards junction table writes behind uses_observation_sources_table - memory_engine.py and fact_storage.py branch reads by backend type Benchmark (large, 33K items, concurrency=4, same pg0 data): v0.5.6 baseline: 0.853s mean Junction table reads: 1.027s mean (+20%) Array ops + no wrap: 1.014s mean (+19%, graph=0.091s matches baseline)

Line 312 of fact_storage.py references ``ops`` without ``handle_document_tracking`` declaring it as a parameter — straight NameError on every retain that walks the upsert path. Bug landed on main in d8ec2d7 (#1325) when ``delete_stale_observations_for_memories`` started taking a backend-aware ``ops`` to choose between the PG array operator and the Oracle junction table; the call site was added but the parameter wasn't threaded into the enclosing function. Fix: add ``ops=None`` to ``handle_document_tracking`` and pass ``pool.ops`` from each of the three call sites in orchestrator.py. This is unrelated to the Alembic dialect-dispatcher refactor in this PR but is what's blocking it — the NameError caused 17 retain tests to fail (and left a pytest-xdist worker in a state that hung the whole job at 99%).

handle_document_tracking calls delete_stale_observations_for_memories with ops=ops, but ops is not a parameter of handle_document_tracking itself (introduced in #1325 as part of the backend-aware observation read split). Every retain that hits the document-tracking path raises NameError before any actual work happens. Add ops as a kwarg-only parameter on handle_document_tracking and forward pool.ops from each of the three call sites in _streaming_retain_batch. Behaviorally a no-op for the PG path (uses_observation_sources_table is False, so the existing PG branch runs) and for the Oracle path (junction table branch already runs when ops.uses_observation_sources_table is True).

Regression from vectorize-io#1325 (perf(db): eliminate ResultRow wrapping). Line fact_storage.py:312 was changed to pass ops=ops to delete_stale_observations_for_memories, but ops was never added to handle_document_tracking's signature, so every first-batch retain hit NameError: name 'ops' is not defined. Effect: every batch_retain task fails on the first-batch document tracking path. The poller schedules a retry, the retry fails the same way, and after the retry budget is exhausted the parent operation is marked failed. No memories get stored. The gateway's "hindsight: 1 retains (...) captured" log is misleading — that just means dispatch succeeded; the daemon's actual write path was raising all along. Fix is to give handle_document_tracking an ops parameter (defaulting to None to preserve the Oracle/junction-table fallback) and pass pool.ops from both call sites in orchestrator._streaming_retain_batch. Reproducible on any retain that triggers a first-batch document upsert. Verified against the daemon's structured log on a live deployment running main as of d8ec2d7.

…1343) * fix(async-ops): atomically commit batch_retain parent and child rows submit_async_batch_retain inserts a parent row (status='pending', task_payload=NULL — it's a status aggregator, not directly executable) and then loops to insert one child row per sub-batch. The parent INSERT and child INSERTs were not transactionally coupled: the parent's INSERT ran in its own auto-committing connection, and each child went through a separate _submit_async_operation call that acquired its own connection. Any failure between them (connection drop, asyncpg timeout, schema- cache invalidation under concurrent load, or any other exception raised during child setup) leaves a parent row with zero children. The worker poller skips it forever because of the "task_payload IS NOT NULL" filter, the status aggregator never fires because there are no children to complete, and the row sits pending indefinitely. It also pollutes queue-depth metrics that operators rely on to size worker pools. Fix: wrap parent INSERT and all child INSERTs in a single async transaction so the create-batch operation is atomic — either all rows become visible to workers or none are. Child INSERT SQL is inlined for the duration of the transaction; _submit_async_operation is left untouched so other callers are unaffected. submit_task() is deferred to after the transaction commits because SyncTaskBackend (used in tests) executes synchronously and would otherwise read the not-yet-committed row. Tests: - New regression test test_submit_async_batch_retain_rolls_back_parent_on_child_failure monkeypatches BatchRetainChildMetadata to raise on the second sub-batch and asserts zero async_operations rows remain after the failure (parent must roll back together with children). - Mirrors the existing test_submit_async_operation_leaves_claimable_row_when_submit_task_fails but at the parent-level (the child-level case was already fixed). * test(async-retain-tags): rewrite for inlined child INSERT submit_async_batch_retain now inserts children inline inside the parent's transaction (rather than calling _submit_async_operation per child) and notifies the task backend after commit. The pre-existing test mocked _submit_async_operation and asserted on its call args; that path no longer runs for children. Replace those assertions with the new equivalent: count the INSERTs on the connection, inspect the post-commit submit_task payload for document_tags, and cross-check the JSON serialized into the child's task_payload column. Same intent (document_tags propagates through to the worker), aligned with the new code path. * fix(retain): thread ops through handle_document_tracking handle_document_tracking calls delete_stale_observations_for_memories with ops=ops, but ops is not a parameter of handle_document_tracking itself (introduced in #1325 as part of the backend-aware observation read split). Every retain that hits the document-tracking path raises NameError before any actual work happens. Add ops as a kwarg-only parameter on handle_document_tracking and forward pool.ops from each of the three call sites in _streaming_retain_batch. Behaviorally a no-op for the PG path (uses_observation_sources_table is False, so the existing PG branch runs) and for the Oracle path (junction table branch already runs when ops.uses_observation_sources_table is True). * test(observation-invalidation): pass ops to handle_document_tracking The test calls handle_document_tracking directly (rather than going through the retain orchestrator) and didn't pass ops. With the param defaulting to None, the inner delete_stale_observations_for_memories call falls through to the Oracle junction-table read path and queries a non-existent public.observation_sources relation under PG. The orchestrator's three call sites already pass pool.ops; this test just needs to mirror that. Pass memory._backend.ops to keep the test backend-agnostic.

…1330) * feat(oracle): unify migrations under Alembic with dialect dispatcher Oracle DDL was a 636-line idempotent file (`migrations_oracle.py`) outside Alembic, which meant no version tracking, no per-tenant version table, and schema drift every time a PG migration was added without a corresponding Oracle change. This unifies both backends behind a single Alembic tree. - New `alembic/_dialect.py::run_for_dialect(pg=, oracle=)` helper. Each migration declares `_pg_upgrade` / `_oracle_upgrade` and dispatches based on the live connection's dialect. - `alembic/env.py` is dialect-aware: PG keeps the existing search_path / read-write session setup; Oracle uses `ALTER SESSION SET CURRENT_SCHEMA` and `DDL_LOCK_TIMEOUT`. - `alembic/script.py.mako` scaffolds the new pattern by default. - All 59 existing PG migrations refactored mechanically — bodies moved into `_pg_upgrade` / `_pg_downgrade`, top-level dispatchers added. - New `o1a2b3c4d5e6_oracle_baseline` migration brings a fresh Oracle 23ai database to the current schema in one step (PG = no-op). Drops the legacy partition-conversion / dedup / `observation_sources` backfill since those only existed for pre-baseline Oracle installs we explicitly are not supporting. - `OracleBackend.run_migrations()` now goes through the unified Alembic pipeline; `migrations.py` skips the PG-specific advisory lock + pgvector setup when the URL is Oracle. - `migrations_oracle.py` deleted; tests updated to use `run_migrations()`. - New `tests/test_migration_shape.py` lint fails CI if any migration omits `run_for_dialect` — keeps drift from re-emerging. - CLAUDE.md updated with the new template and dialect-asymmetry guidance. * ci: run client integration tests against Oracle on oracle-tests label Adds test-python-client-oracle and test-typescript-client-oracle. These mirror the existing test-python-client / test-typescript-client jobs but spin up Oracle 23ai as a service container and point the API server at it via HINDSIGHT_API_DATABASE_BACKEND=oracle + DATABASE_URL. Why a new job instead of matrixing the existing one: Oracle Free's image takes ~2min to start and is network-heavy, so we don't want to pay that cost on every PR — only when oracle-tests is opted in via the PR label, matching the existing test-api-oracle gate. Why client tests, not unit tests: the unit suite already runs against both backends via the abstraction layer. Only the client tests exercise full HTTP round-trips with real serialized payloads, so they catch API changes that work on PG but break on Oracle (or vice versa) in ways the abstraction can't see. * refactor(oracle): tighten feature requirements and dedup is_oracle_url - Move is_oracle_url to db_url.py and import from there in env.py and migrations.py — was duplicated in both. - Type-annotate _configure_pg_session / _configure_oracle_session params (Engine, Connection); ty checks pass. - Update the Oracle baseline comment around vector + text index creation to make the hard requirement explicit: VECTOR + CTXSYS must be available, the migration fails hard if either is missing. The swallow-only-ORA-00955 behavior was already correct; the previous comment misleadingly called it "best-effort". * chore(openclaw): apply pending prettier reformat to keep verify-generated-files green Three formatting-only changes prettier wants to make. They've been stale on main; CI's verify-generated-files runs lint with LINT_ALL=1 (vs the "only changed integrations" local default), which surfaces them on every unrelated PR. Folding them in here so this PR can land. * fix(retain): plumb ops through handle_document_tracking Line 312 of fact_storage.py references ``ops`` without ``handle_document_tracking`` declaring it as a parameter — straight NameError on every retain that walks the upsert path. Bug landed on main in d8ec2d7 (#1325) when ``delete_stale_observations_for_memories`` started taking a backend-aware ``ops`` to choose between the PG array operator and the Oracle junction table; the call site was added but the parameter wasn't threaded into the enclosing function. Fix: add ``ops=None`` to ``handle_document_tracking`` and pass ``pool.ops`` from each of the three call sites in orchestrator.py. This is unrelated to the Alembic dialect-dispatcher refactor in this PR but is what's blocking it — the NameError caused 17 retain tests to fail (and left a pytest-xdist worker in a state that hung the whole job at 99%). * test(observation): pass ops to handle_document_tracking in upsert test The test calls fact_storage.handle_document_tracking directly, which delegates to delete_stale_observations_for_memories(ops=ops). With ops=None the helper falls back to the Oracle junction-table query and fails on PG with "relation public.observation_sources does not exist". Real callers (orchestrator, _delete_stale_observations_for_memories wrapper) all pass self._backend.ops; the test just needs to do the same. * ci: run client-against-oracle on every API change, drop label gate Reserve the "oracle-tests" label for the heavy test-api-oracle (full unit suite). The two client integration jobs against Oracle should run on every API/client change just like their PG counterparts — the whole point is to catch PG/Oracle drift before merge, which doesn't work if you have to remember to label every PR. test-api-oracle keeps its label gate because the full suite is too slow to run on every push. * fix(oracle): rewrite path-style service to ?service_name= for SQLAlchemy Oracle Free / Autonomous DB only register a service name with the listener, but SQLAlchemy's oracle+oracledb dialect interprets the URL path as a SID. That mismatch crashes alembic migrations on first connect: DPY-6003: SID "FREEPDB1" is not registered with the listener Rewrite ``oracle://user:pass@host:port/SERVICE`` to ``oracle+oracledb://user:pass@host:port/?service_name=SERVICE`` so the dialect uses the correct connect descriptor. ``?sid=`` and ``?service_name=`` already in the URL are passed through untouched. Also adds scripts/dev/start-oracle.sh / stop-oracle.sh that spin up the same Oracle 23ai Free image CI uses (``container-registry.oracle.com/database/free``) and bootstrap the HINDSIGHT_TEST user, so we can repro this kind of issue locally without round-tripping through GitHub Actions. * fix(oracle): commit after migrations so alembic_version persists On Oracle, alembic runs each migration with transactional_ddl=False ("Will assume non-transactional DDL"). Each CREATE TABLE auto-commits, but the trailing ``UPDATE alembic_version SET version_num = ...`` is plain DML that needs an explicit COMMIT. Without it the connection close rolls the update back, leaving the schema fully created but the version row one revision behind — so ``run_migrations`` reports success while the head row sits at the previous revision. Caught locally with the new scripts/dev/start-oracle.sh harness running the same Oracle 23ai Free image CI uses; alembic_version was stuck at ``k6l7m8n9o0p1`` even though every table from the ``o1a2b3c4d5e6`` baseline existed. After the fix it correctly advances to ``o1a2b3c4d5e6``, and a second run is a no-op as expected. PG already needs the same commit (Supabase RW-mode SET), so just drop the ``if not is_oracle`` guard. * ci(oracle): run python client tests sequentially to avoid ORA-00060 The python client pyproject.toml defaults to -n auto (pytest-xdist). Against Oracle that hits row-level deadlocks during retain cleanup — ORA-00060 is logged repeatedly in the API server output and most tests fail with "Internal Server Error" at fixture teardown. Same shape as the existing test-api-oracle issue, which is already pinned to -n0. Override to -n0 in the Oracle client job (only). The PG client job stays parallel since pgvector + advisory locks handle concurrent retain fine. TS client tests are unaffected — they run via vitest, not pytest.

nicoloboschi force-pushed the fix/eliminate-resultrow-wrapping branch 2 times, most recently from ba8dad0 to 891e104 Compare April 29, 2026 13:52

nicoloboschi force-pushed the fix/eliminate-resultrow-wrapping branch from 891e104 to 792aea6 Compare April 29, 2026 14:08

nicoloboschi merged commit d8ec2d7 into main Apr 29, 2026
55 of 60 checks passed

cdbartholomew mentioned this pull request Apr 29, 2026

fix(async-ops): atomically commit batch_retain parent and child rows #1343

Merged

3 tasks

youchi1 mentioned this pull request Apr 30, 2026

fix(retain): thread ops into handle_document_tracking #1358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(db): eliminate ResultRow wrapping overhead for PostgreSQL#1325

perf(db): eliminate ResultRow wrapping overhead for PostgreSQL#1325
nicoloboschi merged 2 commits intomainfrom
fix/eliminate-resultrow-wrapping

nicoloboschi commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicoloboschi commented Apr 29, 2026

Summary

Benchmark (medium, 10K items, concurrency=4, same pg0 data)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant