Skip to content

perf(db): eliminate ResultRow wrapping overhead for PostgreSQL#1325

Merged
nicoloboschi merged 2 commits intomainfrom
fix/eliminate-resultrow-wrapping
Apr 29, 2026
Merged

perf(db): eliminate ResultRow wrapping overhead for PostgreSQL#1325
nicoloboschi merged 2 commits intomainfrom
fix/eliminate-resultrow-wrapping

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

Summary

The Oracle abstraction PR (#1307) introduced a ResultRow wrapper class around every asyncpg.Record returned by fetch()/fetchrow(). This added Python-level indirection on every column access (row["id"], row["text"], etc.), causing ~570K __getitem__ calls per 20-recall benchmark via object.__getattribute__ + isinstance checks.

Fix: Make ResultRow a Protocol (interface) instead of a concrete class. asyncpg.Record already satisfies dict-like access natively in C. The PG backend now returns raw Records directly. Oracle uses DictResultRow (the concrete wrapper) since its native row types need it.

Benchmark (medium, 10K items, concurrency=4, same pg0 data)

Variant Mean vs baseline
v0.5.6 baseline 0.648s
With ResultRow wrapping (before) 0.805s +24%
Without wrapping (this PR) 0.680s +5% (noise)
With observation_sources junction table 0.680s +5% (no impact)

Identified via cProfile: result.py:29(__getitem__) was the #3 CPU consumer at 0.156s / 569K calls.

Test plan

  • test_db_abstraction.py — all pass (updated to use DictResultRow)
  • test_link_expansion_retrieval.py — 2/2 pass
  • Lint passes
  • Recall benchmark matches v0.5.6 baseline on same data

Make ResultRow a Protocol instead of a concrete wrapper class. asyncpg.Record
already satisfies the dict-like access pattern (row["key"], .keys(), .get())
natively in C — wrapping it in a Python class added ~570K __getitem__ calls
per 20-recall benchmark, causing a measurable ~24% regression at 10K bank size.

Changes:
- ResultRow is now a Protocol (interface) in result.py
- DictResultRow is the concrete wrapper, used only by Oracle backend
- PostgresConnection.fetch/fetchrow return raw asyncpg.Record directly
- Oracle backend imports DictResultRow as ResultRow (no behavior change)
- Tests updated to use DictResultRow

Benchmark (medium, 10K items, concurrency=4, same pg0 data):
  v0.5.6 baseline:    0.648s mean
  With wrapping:       0.805s mean (+24%)
  Without wrapping:    0.680s mean (+5%, within noise)
  With junction table: 0.680s mean (observation_sources has zero impact)
@nicoloboschi nicoloboschi force-pushed the fix/eliminate-resultrow-wrapping branch 2 times, most recently from ba8dad0 to 891e104 Compare April 29, 2026 13:52
…kend-aware

Two performance fixes for the Oracle abstraction layer:

1. Make ResultRow a Protocol instead of a concrete wrapper class. asyncpg.Record
   satisfies dict-like access natively in C — wrapping added ~570K __getitem__
   calls per benchmark, causing a ~24% regression at 10K bank size.

2. Make observation source reads backend-dependent: PG uses native array ops
   (source_memory_ids column with &&, unnest), Oracle uses the observation_sources
   junction table. PG also skips junction table writes in the consolidator.
   At 33K scale, junction table reads doubled retrieval_graph latency (0.093s→0.186s).

Changes:
- ResultRow is now a Protocol; DictResultRow is the concrete wrapper (Oracle only)
- PostgresConnection.fetch/fetchrow return raw asyncpg.Record directly
- DataAccessOps.uses_observation_sources_table property (PG=False, Oracle=True)
- Consolidator guards junction table writes behind uses_observation_sources_table
- memory_engine.py and fact_storage.py branch reads by backend type

Benchmark (large, 33K items, concurrency=4, same pg0 data):
  v0.5.6 baseline:       0.853s mean
  Junction table reads:   1.027s mean (+20%)
  Array ops + no wrap:    1.014s mean (+19%, graph=0.091s matches baseline)
@nicoloboschi nicoloboschi force-pushed the fix/eliminate-resultrow-wrapping branch from 891e104 to 792aea6 Compare April 29, 2026 14:08
@nicoloboschi nicoloboschi merged commit d8ec2d7 into main Apr 29, 2026
55 of 60 checks passed
nicoloboschi added a commit that referenced this pull request Apr 29, 2026
Line 312 of fact_storage.py references ``ops`` without ``handle_document_tracking``
declaring it as a parameter — straight NameError on every retain that walks
the upsert path. Bug landed on main in d8ec2d7 (#1325) when
``delete_stale_observations_for_memories`` started taking a backend-aware
``ops`` to choose between the PG array operator and the Oracle junction
table; the call site was added but the parameter wasn't threaded into the
enclosing function.

Fix: add ``ops=None`` to ``handle_document_tracking`` and pass ``pool.ops``
from each of the three call sites in orchestrator.py.

This is unrelated to the Alembic dialect-dispatcher refactor in this PR but
is what's blocking it — the NameError caused 17 retain tests to fail (and
left a pytest-xdist worker in a state that hung the whole job at 99%).
cdbartholomew added a commit that referenced this pull request Apr 29, 2026
handle_document_tracking calls delete_stale_observations_for_memories
with ops=ops, but ops is not a parameter of handle_document_tracking
itself (introduced in #1325 as part of the backend-aware observation
read split). Every retain that hits the document-tracking path raises
NameError before any actual work happens.

Add ops as a kwarg-only parameter on handle_document_tracking and
forward pool.ops from each of the three call sites in
_streaming_retain_batch. Behaviorally a no-op for the PG path
(uses_observation_sources_table is False, so the existing PG branch
runs) and for the Oracle path (junction table branch already runs
when ops.uses_observation_sources_table is True).
youchi1 added a commit to youchi1/hindsight that referenced this pull request Apr 30, 2026
Regression from vectorize-io#1325 (perf(db): eliminate ResultRow wrapping). Line
fact_storage.py:312 was changed to pass ops=ops to
delete_stale_observations_for_memories, but ops was never added to
handle_document_tracking's signature, so every first-batch retain hit
NameError: name 'ops' is not defined.

Effect: every batch_retain task fails on the first-batch document
tracking path. The poller schedules a retry, the retry fails the same
way, and after the retry budget is exhausted the parent operation is
marked failed. No memories get stored. The gateway's
"hindsight: 1 retains (...) captured" log is misleading — that just
means dispatch succeeded; the daemon's actual write path was raising
all along.

Fix is to give handle_document_tracking an ops parameter (defaulting
to None to preserve the Oracle/junction-table fallback) and pass
pool.ops from both call sites in orchestrator._streaming_retain_batch.

Reproducible on any retain that triggers a first-batch document
upsert. Verified against the daemon's structured log on a live
deployment running main as of d8ec2d7.
nicoloboschi pushed a commit that referenced this pull request Apr 30, 2026
…1343)

* fix(async-ops): atomically commit batch_retain parent and child rows

submit_async_batch_retain inserts a parent row (status='pending',
task_payload=NULL — it's a status aggregator, not directly executable)
and then loops to insert one child row per sub-batch. The parent INSERT
and child INSERTs were not transactionally coupled: the parent's
INSERT ran in its own auto-committing connection, and each child went
through a separate _submit_async_operation call that acquired its own
connection.

Any failure between them (connection drop, asyncpg timeout, schema-
cache invalidation under concurrent load, or any other exception
raised during child setup) leaves a parent row with zero children.
The worker poller skips it forever because of the
"task_payload IS NOT NULL" filter, the status aggregator never fires
because there are no children to complete, and the row sits pending
indefinitely. It also pollutes queue-depth metrics that operators rely
on to size worker pools.

Fix: wrap parent INSERT and all child INSERTs in a single
async transaction so the create-batch operation is atomic — either
all rows become visible to workers or none are. Child INSERT SQL is
inlined for the duration of the transaction; _submit_async_operation
is left untouched so other callers are unaffected. submit_task() is
deferred to after the transaction commits because SyncTaskBackend
(used in tests) executes synchronously and would otherwise read the
not-yet-committed row.

Tests:
- New regression test
  test_submit_async_batch_retain_rolls_back_parent_on_child_failure
  monkeypatches BatchRetainChildMetadata to raise on the second
  sub-batch and asserts zero async_operations rows remain after the
  failure (parent must roll back together with children).
- Mirrors the existing
  test_submit_async_operation_leaves_claimable_row_when_submit_task_fails
  but at the parent-level (the child-level case was already fixed).

* test(async-retain-tags): rewrite for inlined child INSERT

submit_async_batch_retain now inserts children inline inside the
parent's transaction (rather than calling _submit_async_operation per
child) and notifies the task backend after commit. The pre-existing
test mocked _submit_async_operation and asserted on its call args;
that path no longer runs for children.

Replace those assertions with the new equivalent: count the INSERTs on
the connection, inspect the post-commit submit_task payload for
document_tags, and cross-check the JSON serialized into the child's
task_payload column. Same intent (document_tags propagates through to
the worker), aligned with the new code path.

* fix(retain): thread ops through handle_document_tracking

handle_document_tracking calls delete_stale_observations_for_memories
with ops=ops, but ops is not a parameter of handle_document_tracking
itself (introduced in #1325 as part of the backend-aware observation
read split). Every retain that hits the document-tracking path raises
NameError before any actual work happens.

Add ops as a kwarg-only parameter on handle_document_tracking and
forward pool.ops from each of the three call sites in
_streaming_retain_batch. Behaviorally a no-op for the PG path
(uses_observation_sources_table is False, so the existing PG branch
runs) and for the Oracle path (junction table branch already runs
when ops.uses_observation_sources_table is True).

* test(observation-invalidation): pass ops to handle_document_tracking

The test calls handle_document_tracking directly (rather than going
through the retain orchestrator) and didn't pass ops. With the param
defaulting to None, the inner delete_stale_observations_for_memories
call falls through to the Oracle junction-table read path and queries
a non-existent public.observation_sources relation under PG.

The orchestrator's three call sites already pass pool.ops; this test
just needs to mirror that. Pass memory._backend.ops to keep the test
backend-agnostic.
nicoloboschi added a commit that referenced this pull request Apr 30, 2026
…1330)

* feat(oracle): unify migrations under Alembic with dialect dispatcher

Oracle DDL was a 636-line idempotent file (`migrations_oracle.py`) outside
Alembic, which meant no version tracking, no per-tenant version table, and
schema drift every time a PG migration was added without a corresponding
Oracle change. This unifies both backends behind a single Alembic tree.

- New `alembic/_dialect.py::run_for_dialect(pg=, oracle=)` helper. Each
  migration declares `_pg_upgrade` / `_oracle_upgrade` and dispatches based
  on the live connection's dialect.
- `alembic/env.py` is dialect-aware: PG keeps the existing search_path /
  read-write session setup; Oracle uses `ALTER SESSION SET CURRENT_SCHEMA`
  and `DDL_LOCK_TIMEOUT`.
- `alembic/script.py.mako` scaffolds the new pattern by default.
- All 59 existing PG migrations refactored mechanically — bodies moved into
  `_pg_upgrade` / `_pg_downgrade`, top-level dispatchers added.
- New `o1a2b3c4d5e6_oracle_baseline` migration brings a fresh Oracle 23ai
  database to the current schema in one step (PG = no-op). Drops the legacy
  partition-conversion / dedup / `observation_sources` backfill since those
  only existed for pre-baseline Oracle installs we explicitly are not
  supporting.
- `OracleBackend.run_migrations()` now goes through the unified Alembic
  pipeline; `migrations.py` skips the PG-specific advisory lock + pgvector
  setup when the URL is Oracle.
- `migrations_oracle.py` deleted; tests updated to use `run_migrations()`.
- New `tests/test_migration_shape.py` lint fails CI if any migration omits
  `run_for_dialect` — keeps drift from re-emerging.
- CLAUDE.md updated with the new template and dialect-asymmetry guidance.

* ci: run client integration tests against Oracle on oracle-tests label

Adds test-python-client-oracle and test-typescript-client-oracle. These
mirror the existing test-python-client / test-typescript-client jobs but
spin up Oracle 23ai as a service container and point the API server at it
via HINDSIGHT_API_DATABASE_BACKEND=oracle + DATABASE_URL.

Why a new job instead of matrixing the existing one: Oracle Free's image
takes ~2min to start and is network-heavy, so we don't want to pay that
cost on every PR — only when oracle-tests is opted in via the PR label,
matching the existing test-api-oracle gate.

Why client tests, not unit tests: the unit suite already runs against
both backends via the abstraction layer. Only the client tests exercise
full HTTP round-trips with real serialized payloads, so they catch API
changes that work on PG but break on Oracle (or vice versa) in ways the
abstraction can't see.

* refactor(oracle): tighten feature requirements and dedup is_oracle_url

- Move is_oracle_url to db_url.py and import from there in env.py and
  migrations.py — was duplicated in both.
- Type-annotate _configure_pg_session / _configure_oracle_session params
  (Engine, Connection); ty checks pass.
- Update the Oracle baseline comment around vector + text index creation
  to make the hard requirement explicit: VECTOR + CTXSYS must be
  available, the migration fails hard if either is missing. The
  swallow-only-ORA-00955 behavior was already correct; the previous
  comment misleadingly called it "best-effort".

* chore(openclaw): apply pending prettier reformat to keep verify-generated-files green

Three formatting-only changes prettier wants to make. They've been stale
on main; CI's verify-generated-files runs lint with LINT_ALL=1 (vs the
"only changed integrations" local default), which surfaces them on every
unrelated PR. Folding them in here so this PR can land.

* fix(retain): plumb ops through handle_document_tracking

Line 312 of fact_storage.py references ``ops`` without ``handle_document_tracking``
declaring it as a parameter — straight NameError on every retain that walks
the upsert path. Bug landed on main in d8ec2d7 (#1325) when
``delete_stale_observations_for_memories`` started taking a backend-aware
``ops`` to choose between the PG array operator and the Oracle junction
table; the call site was added but the parameter wasn't threaded into the
enclosing function.

Fix: add ``ops=None`` to ``handle_document_tracking`` and pass ``pool.ops``
from each of the three call sites in orchestrator.py.

This is unrelated to the Alembic dialect-dispatcher refactor in this PR but
is what's blocking it — the NameError caused 17 retain tests to fail (and
left a pytest-xdist worker in a state that hung the whole job at 99%).

* test(observation): pass ops to handle_document_tracking in upsert test

The test calls fact_storage.handle_document_tracking directly, which
delegates to delete_stale_observations_for_memories(ops=ops). With ops=None
the helper falls back to the Oracle junction-table query and fails on PG
with "relation public.observation_sources does not exist". Real callers
(orchestrator, _delete_stale_observations_for_memories wrapper) all pass
self._backend.ops; the test just needs to do the same.

* ci: run client-against-oracle on every API change, drop label gate

Reserve the "oracle-tests" label for the heavy test-api-oracle (full unit
suite). The two client integration jobs against Oracle should run on every
API/client change just like their PG counterparts — the whole point is to
catch PG/Oracle drift before merge, which doesn't work if you have to
remember to label every PR. test-api-oracle keeps its label gate because
the full suite is too slow to run on every push.

* fix(oracle): rewrite path-style service to ?service_name= for SQLAlchemy

Oracle Free / Autonomous DB only register a service name with the listener,
but SQLAlchemy's oracle+oracledb dialect interprets the URL path as a SID.
That mismatch crashes alembic migrations on first connect:
  DPY-6003: SID "FREEPDB1" is not registered with the listener

Rewrite ``oracle://user:pass@host:port/SERVICE`` to
``oracle+oracledb://user:pass@host:port/?service_name=SERVICE`` so the
dialect uses the correct connect descriptor. ``?sid=`` and ``?service_name=``
already in the URL are passed through untouched.

Also adds scripts/dev/start-oracle.sh / stop-oracle.sh that spin up the same
Oracle 23ai Free image CI uses (``container-registry.oracle.com/database/free``)
and bootstrap the HINDSIGHT_TEST user, so we can repro this kind of issue
locally without round-tripping through GitHub Actions.

* fix(oracle): commit after migrations so alembic_version persists

On Oracle, alembic runs each migration with transactional_ddl=False
("Will assume non-transactional DDL"). Each CREATE TABLE auto-commits, but
the trailing ``UPDATE alembic_version SET version_num = ...`` is plain DML
that needs an explicit COMMIT. Without it the connection close rolls the
update back, leaving the schema fully created but the version row one
revision behind — so ``run_migrations`` reports success while the head row
sits at the previous revision.

Caught locally with the new scripts/dev/start-oracle.sh harness running the
same Oracle 23ai Free image CI uses; alembic_version was stuck at
``k6l7m8n9o0p1`` even though every table from the ``o1a2b3c4d5e6`` baseline
existed. After the fix it correctly advances to ``o1a2b3c4d5e6``, and a
second run is a no-op as expected.

PG already needs the same commit (Supabase RW-mode SET), so just drop the
``if not is_oracle`` guard.

* ci(oracle): run python client tests sequentially to avoid ORA-00060

The python client pyproject.toml defaults to -n auto (pytest-xdist).
Against Oracle that hits row-level deadlocks during retain cleanup —
ORA-00060 is logged repeatedly in the API server output and most tests
fail with "Internal Server Error" at fixture teardown. Same shape as the
existing test-api-oracle issue, which is already pinned to -n0.

Override to -n0 in the Oracle client job (only). The PG client job stays
parallel since pgvector + advisory locks handle concurrent retain fine.
TS client tests are unaffected — they run via vitest, not pytest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant