Skip to content

Releases: yonk-labs/pg-raggraph

v0.5.0a14 — chunkshop 0.9.1 floor: bound symbol_aware over-parse

09 Jun 17:14
91bf165

Choose a tag to compare

Fixes the #79 root cause: large/generated code corpora no longer OOM the embedding step

chunkshop 0.9.0's path-less language detection parsed ~2× more files as code, so a single generated/minified file could explode into thousands of symbol_aware chunks and exhaust memory when embedding them.

chunkshop 0.9.1 (chunkshop#71/#72) adds two on-by-default guards — a content-detection fallback for generated/minified files and max_symbols_per_file=2000. This release raises the [chunkshop] extra floor to chunkshop>=0.9.1, so pg-raggraph's chunk_strategy="chunkshop:symbol_aware" path inherits the protection with no config change.

  • Verified: a 3,000-function generated .ts drops 3,000 → 74 chunks; normal code untouched.
  • 467 unit + 17 integration green under chunkshop 0.9.1.

Together with v0.5.0a13 (cross-file resolver spill-to-DB), both halves of #79 are addressed.

Full changelog: see CHANGELOG.md.

v0.5.0a13 — cross-file resolver: fix OOM (spill to DB)

09 Jun 13:02
12b29f9

Choose a tag to compare

Fix: cross-file code-graph resolver OOM on large repos

The v0.5.0a12 cross-file resolver (cross_file_code_graph=True) accumulated one entry per call site (each with a source snippet) in memory — O(call-sites), ~427 B/call, projecting to ~3.6 GB at 100K files → OOM on big codebases.

Fix: CorpusCodeGraph now keeps only the small symbol index in memory and spills call sites to an UNLOGGED code_calls_stage table (migration 014). Phase-2 resolution drains them in keyset batches (resolve_batch(), verified byte-identical to the one-shot resolver) plus one resolve_class_edges() pass. Peak resolver memory is now O(batch + symbol index) instead of O(corpus call sites); scratch rows are deleted after.

  • Parity: batched == one-shot edges (unit test); 644 cross-file edges on self-ingest unchanged.
  • Memory: peak in-memory call list measured 0 (was ~24.7K).
  • No change to the resolved graph; default ingest path untouched.

Full changelog: see CHANGELOG.md.

v0.5.0a12 — cross-file code graph

09 Jun 02:19
6c679b5

Choose a tag to compare

Cross-file code graph (#76)

chunk_strategy="chunkshop:symbol_aware" now supports corpus-wide cross-file resolution for the code call graph.

  • Opt in with ingest_records(..., cross_file_code_graph=True) or PGRG_CROSS_FILE_CODE_GRAPH=1. Default stays per-file/streaming (no regression).
  • One CorpusCodeGraph resolver accumulates the whole ingest (O(symbols), not content — bounded-memory streaming preserved), then materializes cross-file CALLS/INHERITS/IMPLEMENTS edges. Each carries resolved_intra_file in relationships.properties.
  • Self-ingest of src/pg_raggraph: 415 → 1,030 CALLS edges, 636 cross-file, symbols-with-callers 246 → 351.

Fixes the consumer symptom where most symbols had 0 callers because cross-file calls weren't traced (EnterpriseDB/bento#779).

Full changelog: see CHANGELOG.md.

v0.5.0a9 — idempotent migration 013

04 Jun 13:46
bbe36ca

Choose a tag to compare

Pre-release

Patch release for the migration 013 idempotency fix (#67, #68).

Fixed

  • Migration 013_relationships_unique.sql re-runs cleanly instead of looping forever. ALTER TABLE ... ADD CONSTRAINT has no IF NOT EXISTS in Postgres, so re-applying 013 on a database where relationships_ns_edge_unique already existed — the constraint applied out-of-band, or its pgrg_applied_migrations row lost to a backup/restore taken between the schema change and the bookkeeping commit — raised 42710 duplicate_object. Because the migration runs in one transaction, that rolled it back and never recorded it, so every subsequent startup / pgrg migrate retried and failed again (the partial-migration-not-tracked loop). The constraint add is now wrapped in a pg_constraint existence guard, so the migration converges to the same end state on the first run and any repeat run. Steps 1-2 (dedup DML) were already idempotent. Adds test_migration_013_is_rerunnable.

Upgrade note: any database currently stuck in the 013 failure loop converges automatically on the next startup or pgrg migrate.

v0.5.0a8 — RRF fusion + chunkshop 0.8.2

02 Jun 11:40
7a0fed4

Choose a tag to compare

First PyPI publish since 0.5.0a1 — brings the published package current with the chunkshop-0.6 integration line (a2→a8).

Highlights

  • Optional RRF fusion mode (fusion="rrf", #57) — Reciprocal Rank Fusion for naive + hybrid retrieval, alongside the default linear weighted scoring. Scale-free across the cosine/BM25 legs; per-call override on query()/ask(). Default-off — linear path byte-identical.
  • chunkshop floor 0.7.0 → 0.8.2 (10-language codeparse, import-aware resolution, remote embedder, perf).
  • Carries the full a2→a7 line: A/B-gate harness/runner/writer, MCP agent UX, background extraction, online embedding migration, code-graph queries, lede_spacy deterministic extraction.

Packaging

  • The ab-gate extra is now an empty marker (PyPI forbids git deps in published metadata, and llm-judge has no PyPI release yet). Install the A/B-gate judge runtime from git: pip install 'llm-judge @ git+https://github.com/yonk-labs/llm-judge.git@main'. No A/B-gate runtime code change.

Install

pip install pg-raggraph==0.5.0a8

pip install 'pg-raggraph[all]' pulls the optional surfaces (server, community, langchain, llamaindex, mcp, chunkshop, lede_spacy).

Supersedes the v0.5.0a7 tag, whose upload PyPI rejected over the ab-gate git dependency; a8 is identical content plus the packaging fix.

v0.5.0a1 — background extraction + multi-worker safety + observability

28 May 13:18

Choose a tag to compare

Headline: defer_extraction=True decouples LLM/lede extraction from ingest()59× faster time-to-queryable (~18 ms/doc vs ~1063 ms/doc on lede_spacy MHR). A new pgrg extract CLI (--once / --daemon) drains the queue out-of-band, with multi-worker safety guaranteed by SELECT … FOR UPDATE SKIP LOCKED (claim) and a new UNIQUE (namespace, src_id, dst_id, rel_type) constraint + ON CONFLICT on writes.

Backward-compatible: synchronous-extract default behavior is byte-for-byte unchanged.

What's new

Background extraction

Pass defer_extraction=True to ingest_records() and the producer returns in chunk + embed time only — the document is immediately queryable via vector + BM25, with entity/relationship extraction deferred. Drain on your schedule:

# cron-driven
pgrg --db $PGRG_DSN extract --namespace crm --once

# always-on daemon — SIGTERM-graceful, emits metrics per iteration
pgrg --db $PGRG_DSN extract --namespace crm --daemon --poll-interval 1.0

New module: pg_raggraph.backfill (claim_pending / extract_documents / release_processing). New migration 012_documents_graph_status.sql adds the queue-tracking columns with a partial index on pending rows.

Measured impact (MHR slice, lede_spacy, no LLM):

n_docs SYNC ingest DEFER ingest DRAIN B+C total A/B speedup
20 21.27 s 0.36 s 7.24 s 7.60 s 59.0×
40 26.27 s 0.44 s 15.12 s 15.56 s 59.8×

Total async path (B+C) is also faster than synchronous (A) because the synchronous path holds per-doc transactions open across extraction.

Multi-worker safety invariants

  • Namespace-scoped reaper. release_processing is keyword-only on namespace= / doc_ids=. The CLI passes its --namespace, so a worker starting in namespace A no longer steals namespace B's in-flight 'processing' claims.
  • Edge-level idempotency. Migration 013_relationships_unique.sql de-duplicates any existing rows and adds UNIQUE (namespace, src_id, dst_id, rel_type). Both ingest paths use ON CONFLICT DO UPDATE SET weight = GREATEST(...) — re-extraction is safe.
  • merge_entities updated to pre-delete colliding rows before the src_id/dst_id rewrite.

Run as many pgrg extract --daemon workers as you want against the same namespace.

Observability

Three new metric events per iteration, through the existing _emit_metric channel:

  • pgrg.backfill.claimnamespace, batch_size, claimed, latency_ms
  • pgrg.backfill.extractclaimed, ready, failed, entities, relationships, latency_ms
  • pgrg.backfill.queue_depth — per-status doc counts

Query-time hint

QueryResult.metadata['graph_status_summary'] and GraphRAG.status(ns)['graph_status'] expose per-status doc counts so callers can see whether the graph is still backfilling.

Benchmarks

  • benchmarks/defer_extraction_bench.py — A/B/C harness (sync ingest vs deferred ingest vs drain).
  • benchmarks/ingest_perf.py extended with --provider {local,http}. Measured: TEI HTTP CPU beats local fastembed 2.1× on bge-small (66 vs 140 ms/chunk).

Documentation

  • docs/cookbook/background-extraction.md — full guide with three architectural patterns (sync / cron / always-on daemon), end-to-end FastAPI walkthrough, operator playbook.
  • README.md / docs/README.md / docs/user-guide.md / docs/operations-guide.md — cross-references and discovery paths.

Production-readiness audit

skill-output/prod-ready/ contains the 16-finding audit that drove this release's safety fixes. P0s landed; P1s (timed background reaper, retry counter + cap, extractor health probe, multi-worker concurrency test) are tracked for the next cycle.

Verdict: single-worker / single-namespace deployments are production-ready; multi-worker deployments are safe by construction.

Compatibility

  • Synchronous ingest behavior: byte-for-byte unchanged.
  • Schema migrations 012 + 013 run automatically on connect; both are forward-only and idempotent.
  • release_processing signature is now keyword-only — any caller passing doc_ids positionally needs doc_ids= (the only known internal caller in tests is updated).

Full changelog

CHANGELOG.md

v0.4.0a1 — chunkshop 0.6.1, online embedding migration, code-graph queries

26 May 16:17
22140fe

Choose a tag to compare

Alpha release. Additive and backward-compatible — existing query/ingest behavior is unchanged (retrieval path untouched); validated against a 2.5 GB real corpus with no accuracy change.

Highlights

Online embedding-model migrationpgrg migrate-embeddings prepare/backfill/build-index/status/cutover/finalize. Change the embedding model/dimension on a live database via an expand/contract column swap: no parallel DB, online resumable backfill, brief atomic cutover. A startup guard (EmbeddingDimMismatch) fails fast if embedding_dim no longer matches the live column. See docs/cookbook/changing-embedding-dimensions.md.

Code-graph queriespgrg code-impact <fqn> and GraphRAG.code_impact(...) report a code symbol's callers/callees (with evidence) over the existing relationships graph — cycle-safe recursive CTEs, --depth for transitive impact, tree or --json.

Chunkshop 0.6.1 code surfaces — dependency floor 0.5.0 → 0.6.1. chunkshop:code_aware / chunkshop:symbol_aware chunkers; pgrg ingest-chunkshop-table (+ --with-code-edges); code_summaryCODE_SYMBOL description enrichment; top_terms folded into the chunk full-text search_vector.

Hardeningif_oversize fallback in chunkshop delegation; clear error when importing code_edges from a schema without the table.

Notes

  • lede / lede-spacy stay at >=0.4.5 (current latest).
  • Re-running an embedding migration is data-safe; embedding_old is preserved until finalize.

Full details in CHANGELOG.md (0.4.0a1).