Skip to content

feat(oracle): unify migrations under Alembic with dialect dispatcher#1330

Merged
nicoloboschi merged 10 commits intomainfrom
feat/oracle-alembic-migrations
Apr 30, 2026
Merged

feat(oracle): unify migrations under Alembic with dialect dispatcher#1330
nicoloboschi merged 10 commits intomainfrom
feat/oracle-alembic-migrations

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

Summary

  • Bring Oracle migrations into Alembic so both PG and Oracle share a single revision tree, version table, and tenant-isolation story before we ship Oracle support.
  • Each migration file dispatches via run_for_dialect(pg=..., oracle=...); _pg_upgrade / _oracle_upgrade slots make per-dialect intent explicit and visible in review.
  • A pytest lint (tests/test_migration_shape.py) fails CI if a migration skips the dispatcher, which is the guardrail that keeps PG/Oracle drift from coming back.

What changed

  • New alembic/_dialect.py helper + dialect-aware alembic/env.py (PG keeps search_path / RW-session setup; Oracle uses CURRENT_SCHEMA + DDL_LOCK_TIMEOUT).
  • 59 existing PG migrations mechanically refactored (bodies → _pg_upgrade, top-level dispatcher added). No behavioural change on PG.
  • New o1a2b3c4d5e6_oracle_baseline.py brings a fresh Oracle 23ai DB to current schema in one step. Dropped the legacy partition-conversion / memory_links dedup / observation_sources backfill — those existed only for pre-baseline Oracle installs we explicitly aren't supporting.
  • OracleBackend.run_migrations() now calls the shared Alembic pipeline. migrations.py skips the PG-only advisory-lock + pgvector setup on Oracle URLs.
  • migrations_oracle.py deleted (636 lines). Tests updated to call run_migrations().
  • script.py.mako scaffolds the new pattern; CLAUDE.md updated with the template and the dialect-asymmetry rule.

Open questions for review

  • Dropping the post-baseline patches. The old file converted memory_units to partitioning, deduped memory_links, and backfilled observation_sources on every run. Are we OK that the new baseline can't migrate any pre-existing Oracle install? My read was yes — Oracle hasn't shipped — but worth a confirm.
  • k6l7m8n9o0p1 conflict resolution. Resolved to keep the PG-no-op behaviour from main; updated the docstring to point at the new baseline. Worth a sanity check.
  • Concurrency. Oracle path skips PG advisory locks. Oracle DDL is autocommit and IF NOT EXISTS / 955-swallowing makes concurrent runs safe in practice, but if you want true cross-worker mutual exclusion later we should add DBMS_LOCK.

Test plan

  • ./scripts/hooks/lint.sh — clean.
  • tests/test_migration_shape.py — 60/60 pass.
  • tests/test_db_abstraction.py — 140/140 pass.
  • Alembic tree resolves to a single head (o1a2b3c4d5e6).
  • Live PG migration run end-to-end (couldn't validate locally — pg0 fixture failed to bind a TCP socket; failure was unrelated to this change).
  • Live Oracle migration run against Oracle 23ai (needs a running instance — please verify before merge).
  • Manual eyeball of CREATE INDEX CONCURRENTLY migrations (e.g. c1a2b3d4e5f6); they use op.execute("COMMIT") which still works through the dispatcher but is the trickiest case.

Oracle DDL was a 636-line idempotent file (`migrations_oracle.py`) outside
Alembic, which meant no version tracking, no per-tenant version table, and
schema drift every time a PG migration was added without a corresponding
Oracle change. This unifies both backends behind a single Alembic tree.

- New `alembic/_dialect.py::run_for_dialect(pg=, oracle=)` helper. Each
  migration declares `_pg_upgrade` / `_oracle_upgrade` and dispatches based
  on the live connection's dialect.
- `alembic/env.py` is dialect-aware: PG keeps the existing search_path /
  read-write session setup; Oracle uses `ALTER SESSION SET CURRENT_SCHEMA`
  and `DDL_LOCK_TIMEOUT`.
- `alembic/script.py.mako` scaffolds the new pattern by default.
- All 59 existing PG migrations refactored mechanically — bodies moved into
  `_pg_upgrade` / `_pg_downgrade`, top-level dispatchers added.
- New `o1a2b3c4d5e6_oracle_baseline` migration brings a fresh Oracle 23ai
  database to the current schema in one step (PG = no-op). Drops the legacy
  partition-conversion / dedup / `observation_sources` backfill since those
  only existed for pre-baseline Oracle installs we explicitly are not
  supporting.
- `OracleBackend.run_migrations()` now goes through the unified Alembic
  pipeline; `migrations.py` skips the PG-specific advisory lock + pgvector
  setup when the URL is Oracle.
- `migrations_oracle.py` deleted; tests updated to use `run_migrations()`.
- New `tests/test_migration_shape.py` lint fails CI if any migration omits
  `run_for_dialect` — keeps drift from re-emerging.
- CLAUDE.md updated with the new template and dialect-asymmetry guidance.
Adds test-python-client-oracle and test-typescript-client-oracle. These
mirror the existing test-python-client / test-typescript-client jobs but
spin up Oracle 23ai as a service container and point the API server at it
via HINDSIGHT_API_DATABASE_BACKEND=oracle + DATABASE_URL.

Why a new job instead of matrixing the existing one: Oracle Free's image
takes ~2min to start and is network-heavy, so we don't want to pay that
cost on every PR — only when oracle-tests is opted in via the PR label,
matching the existing test-api-oracle gate.

Why client tests, not unit tests: the unit suite already runs against
both backends via the abstraction layer. Only the client tests exercise
full HTTP round-trips with real serialized payloads, so they catch API
changes that work on PG but break on Oracle (or vice versa) in ways the
abstraction can't see.
- Move is_oracle_url to db_url.py and import from there in env.py and
  migrations.py — was duplicated in both.
- Type-annotate _configure_pg_session / _configure_oracle_session params
  (Engine, Connection); ty checks pass.
- Update the Oracle baseline comment around vector + text index creation
  to make the hard requirement explicit: VECTOR + CTXSYS must be
  available, the migration fails hard if either is missing. The
  swallow-only-ORA-00955 behavior was already correct; the previous
  comment misleadingly called it "best-effort".
…ated-files green

Three formatting-only changes prettier wants to make. They've been stale
on main; CI's verify-generated-files runs lint with LINT_ALL=1 (vs the
"only changed integrations" local default), which surfaces them on every
unrelated PR. Folding them in here so this PR can land.
Line 312 of fact_storage.py references ``ops`` without ``handle_document_tracking``
declaring it as a parameter — straight NameError on every retain that walks
the upsert path. Bug landed on main in d8ec2d7 (#1325) when
``delete_stale_observations_for_memories`` started taking a backend-aware
``ops`` to choose between the PG array operator and the Oracle junction
table; the call site was added but the parameter wasn't threaded into the
enclosing function.

Fix: add ``ops=None`` to ``handle_document_tracking`` and pass ``pool.ops``
from each of the three call sites in orchestrator.py.

This is unrelated to the Alembic dialect-dispatcher refactor in this PR but
is what's blocking it — the NameError caused 17 retain tests to fail (and
left a pytest-xdist worker in a state that hung the whole job at 99%).
The test calls fact_storage.handle_document_tracking directly, which
delegates to delete_stale_observations_for_memories(ops=ops). With ops=None
the helper falls back to the Oracle junction-table query and fails on PG
with "relation public.observation_sources does not exist". Real callers
(orchestrator, _delete_stale_observations_for_memories wrapper) all pass
self._backend.ops; the test just needs to do the same.
@nicoloboschi nicoloboschi added the oracle-tests Enable Oracle 23ai integration tests in CI label Apr 30, 2026
Reserve the "oracle-tests" label for the heavy test-api-oracle (full unit
suite). The two client integration jobs against Oracle should run on every
API/client change just like their PG counterparts — the whole point is to
catch PG/Oracle drift before merge, which doesn't work if you have to
remember to label every PR. test-api-oracle keeps its label gate because
the full suite is too slow to run on every push.
@nicoloboschi nicoloboschi removed the oracle-tests Enable Oracle 23ai integration tests in CI label Apr 30, 2026
Oracle Free / Autonomous DB only register a service name with the listener,
but SQLAlchemy's oracle+oracledb dialect interprets the URL path as a SID.
That mismatch crashes alembic migrations on first connect:
  DPY-6003: SID "FREEPDB1" is not registered with the listener

Rewrite ``oracle://user:pass@host:port/SERVICE`` to
``oracle+oracledb://user:pass@host:port/?service_name=SERVICE`` so the
dialect uses the correct connect descriptor. ``?sid=`` and ``?service_name=``
already in the URL are passed through untouched.

Also adds scripts/dev/start-oracle.sh / stop-oracle.sh that spin up the same
Oracle 23ai Free image CI uses (``container-registry.oracle.com/database/free``)
and bootstrap the HINDSIGHT_TEST user, so we can repro this kind of issue
locally without round-tripping through GitHub Actions.
On Oracle, alembic runs each migration with transactional_ddl=False
("Will assume non-transactional DDL"). Each CREATE TABLE auto-commits, but
the trailing ``UPDATE alembic_version SET version_num = ...`` is plain DML
that needs an explicit COMMIT. Without it the connection close rolls the
update back, leaving the schema fully created but the version row one
revision behind — so ``run_migrations`` reports success while the head row
sits at the previous revision.

Caught locally with the new scripts/dev/start-oracle.sh harness running the
same Oracle 23ai Free image CI uses; alembic_version was stuck at
``k6l7m8n9o0p1`` even though every table from the ``o1a2b3c4d5e6`` baseline
existed. After the fix it correctly advances to ``o1a2b3c4d5e6``, and a
second run is a no-op as expected.

PG already needs the same commit (Supabase RW-mode SET), so just drop the
``if not is_oracle`` guard.
The python client pyproject.toml defaults to -n auto (pytest-xdist).
Against Oracle that hits row-level deadlocks during retain cleanup —
ORA-00060 is logged repeatedly in the API server output and most tests
fail with "Internal Server Error" at fixture teardown. Same shape as the
existing test-api-oracle issue, which is already pinned to -n0.

Override to -n0 in the Oracle client job (only). The PG client job stays
parallel since pgvector + advisory locks handle concurrent retain fine.
TS client tests are unaffected — they run via vitest, not pytest.
@nicoloboschi nicoloboschi merged commit d39a2ca into main Apr 30, 2026
61 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant