Releases: yonk-labs/pg-raggraph
v0.5.0a14 — chunkshop 0.9.1 floor: bound symbol_aware over-parse
Fixes the #79 root cause: large/generated code corpora no longer OOM the embedding step
chunkshop 0.9.0's path-less language detection parsed ~2× more files as code, so a single generated/minified file could explode into thousands of symbol_aware chunks and exhaust memory when embedding them.
chunkshop 0.9.1 (chunkshop#71/#72) adds two on-by-default guards — a content-detection fallback for generated/minified files and max_symbols_per_file=2000. This release raises the [chunkshop] extra floor to chunkshop>=0.9.1, so pg-raggraph's chunk_strategy="chunkshop:symbol_aware" path inherits the protection with no config change.
- Verified: a 3,000-function generated
.tsdrops 3,000 → 74 chunks; normal code untouched. - 467 unit + 17 integration green under chunkshop 0.9.1.
Together with v0.5.0a13 (cross-file resolver spill-to-DB), both halves of #79 are addressed.
Full changelog: see CHANGELOG.md.
v0.5.0a13 — cross-file resolver: fix OOM (spill to DB)
Fix: cross-file code-graph resolver OOM on large repos
The v0.5.0a12 cross-file resolver (cross_file_code_graph=True) accumulated one entry per call site (each with a source snippet) in memory — O(call-sites), ~427 B/call, projecting to ~3.6 GB at 100K files → OOM on big codebases.
Fix: CorpusCodeGraph now keeps only the small symbol index in memory and spills call sites to an UNLOGGED code_calls_stage table (migration 014). Phase-2 resolution drains them in keyset batches (resolve_batch(), verified byte-identical to the one-shot resolver) plus one resolve_class_edges() pass. Peak resolver memory is now O(batch + symbol index) instead of O(corpus call sites); scratch rows are deleted after.
- Parity: batched == one-shot edges (unit test); 644 cross-file edges on self-ingest unchanged.
- Memory: peak in-memory call list measured 0 (was ~24.7K).
- No change to the resolved graph; default ingest path untouched.
Full changelog: see CHANGELOG.md.
v0.5.0a12 — cross-file code graph
Cross-file code graph (#76)
chunk_strategy="chunkshop:symbol_aware" now supports corpus-wide cross-file resolution for the code call graph.
- Opt in with
ingest_records(..., cross_file_code_graph=True)orPGRG_CROSS_FILE_CODE_GRAPH=1. Default stays per-file/streaming (no regression). - One
CorpusCodeGraphresolver accumulates the whole ingest (O(symbols), not content — bounded-memory streaming preserved), then materializes cross-fileCALLS/INHERITS/IMPLEMENTSedges. Each carriesresolved_intra_fileinrelationships.properties. - Self-ingest of
src/pg_raggraph: 415 → 1,030 CALLS edges, 636 cross-file, symbols-with-callers 246 → 351.
Fixes the consumer symptom where most symbols had 0 callers because cross-file calls weren't traced (EnterpriseDB/bento#779).
Full changelog: see CHANGELOG.md.
v0.5.0a9 — idempotent migration 013
Patch release for the migration 013 idempotency fix (#67, #68).
Fixed
- Migration
013_relationships_unique.sqlre-runs cleanly instead of looping forever.ALTER TABLE ... ADD CONSTRAINThas noIF NOT EXISTSin Postgres, so re-applying 013 on a database whererelationships_ns_edge_uniquealready existed — the constraint applied out-of-band, or itspgrg_applied_migrationsrow lost to a backup/restore taken between the schema change and the bookkeeping commit — raised42710 duplicate_object. Because the migration runs in one transaction, that rolled it back and never recorded it, so every subsequent startup /pgrg migrateretried and failed again (the partial-migration-not-tracked loop). The constraint add is now wrapped in apg_constraintexistence guard, so the migration converges to the same end state on the first run and any repeat run. Steps 1-2 (dedup DML) were already idempotent. Addstest_migration_013_is_rerunnable.
Upgrade note: any database currently stuck in the 013 failure loop converges automatically on the next startup or pgrg migrate.
v0.5.0a8 — RRF fusion + chunkshop 0.8.2
First PyPI publish since 0.5.0a1 — brings the published package current with the chunkshop-0.6 integration line (a2→a8).
Highlights
- Optional RRF fusion mode (
fusion="rrf", #57) — Reciprocal Rank Fusion fornaive+hybridretrieval, alongside the default linear weighted scoring. Scale-free across the cosine/BM25 legs; per-call override onquery()/ask(). Default-off — linear path byte-identical. - chunkshop floor 0.7.0 → 0.8.2 (10-language codeparse, import-aware resolution, remote embedder, perf).
- Carries the full a2→a7 line: A/B-gate harness/runner/writer, MCP agent UX, background extraction, online embedding migration, code-graph queries, lede_spacy deterministic extraction.
Packaging
- The
ab-gateextra is now an empty marker (PyPI forbids git deps in published metadata, andllm-judgehas no PyPI release yet). Install the A/B-gate judge runtime from git:pip install 'llm-judge @ git+https://github.com/yonk-labs/llm-judge.git@main'. No A/B-gate runtime code change.
Install
pip install pg-raggraph==0.5.0a8pip install 'pg-raggraph[all]' pulls the optional surfaces (server, community, langchain, llamaindex, mcp, chunkshop, lede_spacy).
Supersedes the v0.5.0a7 tag, whose upload PyPI rejected over the ab-gate git dependency; a8 is identical content plus the packaging fix.
v0.5.0a1 — background extraction + multi-worker safety + observability
Headline: defer_extraction=True decouples LLM/lede extraction from ingest() — 59× faster time-to-queryable (~18 ms/doc vs ~1063 ms/doc on lede_spacy MHR). A new pgrg extract CLI (--once / --daemon) drains the queue out-of-band, with multi-worker safety guaranteed by SELECT … FOR UPDATE SKIP LOCKED (claim) and a new UNIQUE (namespace, src_id, dst_id, rel_type) constraint + ON CONFLICT on writes.
Backward-compatible: synchronous-extract default behavior is byte-for-byte unchanged.
What's new
Background extraction
Pass defer_extraction=True to ingest_records() and the producer returns in chunk + embed time only — the document is immediately queryable via vector + BM25, with entity/relationship extraction deferred. Drain on your schedule:
# cron-driven
pgrg --db $PGRG_DSN extract --namespace crm --once
# always-on daemon — SIGTERM-graceful, emits metrics per iteration
pgrg --db $PGRG_DSN extract --namespace crm --daemon --poll-interval 1.0New module: pg_raggraph.backfill (claim_pending / extract_documents / release_processing). New migration 012_documents_graph_status.sql adds the queue-tracking columns with a partial index on pending rows.
Measured impact (MHR slice, lede_spacy, no LLM):
| n_docs | SYNC ingest | DEFER ingest | DRAIN | B+C total | A/B speedup |
|---|---|---|---|---|---|
| 20 | 21.27 s | 0.36 s | 7.24 s | 7.60 s | 59.0× |
| 40 | 26.27 s | 0.44 s | 15.12 s | 15.56 s | 59.8× |
Total async path (B+C) is also faster than synchronous (A) because the synchronous path holds per-doc transactions open across extraction.
Multi-worker safety invariants
- Namespace-scoped reaper.
release_processingis keyword-only onnamespace=/doc_ids=. The CLI passes its--namespace, so a worker starting in namespace A no longer steals namespace B's in-flight 'processing' claims. - Edge-level idempotency. Migration
013_relationships_unique.sqlde-duplicates any existing rows and addsUNIQUE (namespace, src_id, dst_id, rel_type). Both ingest paths useON CONFLICT DO UPDATE SET weight = GREATEST(...)— re-extraction is safe. merge_entitiesupdated to pre-delete colliding rows before the src_id/dst_id rewrite.
Run as many pgrg extract --daemon workers as you want against the same namespace.
Observability
Three new metric events per iteration, through the existing _emit_metric channel:
pgrg.backfill.claim—namespace,batch_size,claimed,latency_mspgrg.backfill.extract—claimed,ready,failed,entities,relationships,latency_mspgrg.backfill.queue_depth— per-status doc counts
Query-time hint
QueryResult.metadata['graph_status_summary'] and GraphRAG.status(ns)['graph_status'] expose per-status doc counts so callers can see whether the graph is still backfilling.
Benchmarks
benchmarks/defer_extraction_bench.py— A/B/C harness (sync ingest vs deferred ingest vs drain).benchmarks/ingest_perf.pyextended with--provider {local,http}. Measured: TEI HTTP CPU beats local fastembed 2.1× on bge-small (66 vs 140 ms/chunk).
Documentation
docs/cookbook/background-extraction.md— full guide with three architectural patterns (sync / cron / always-on daemon), end-to-end FastAPI walkthrough, operator playbook.README.md/docs/README.md/docs/user-guide.md/docs/operations-guide.md— cross-references and discovery paths.
Production-readiness audit
skill-output/prod-ready/ contains the 16-finding audit that drove this release's safety fixes. P0s landed; P1s (timed background reaper, retry counter + cap, extractor health probe, multi-worker concurrency test) are tracked for the next cycle.
Verdict: single-worker / single-namespace deployments are production-ready; multi-worker deployments are safe by construction.
Compatibility
- Synchronous ingest behavior: byte-for-byte unchanged.
- Schema migrations 012 + 013 run automatically on connect; both are forward-only and idempotent.
release_processingsignature is now keyword-only — any caller passingdoc_idspositionally needsdoc_ids=(the only known internal caller in tests is updated).
Full changelog
v0.4.0a1 — chunkshop 0.6.1, online embedding migration, code-graph queries
Alpha release. Additive and backward-compatible — existing query/ingest behavior is unchanged (retrieval path untouched); validated against a 2.5 GB real corpus with no accuracy change.
Highlights
Online embedding-model migration — pgrg migrate-embeddings prepare/backfill/build-index/status/cutover/finalize. Change the embedding model/dimension on a live database via an expand/contract column swap: no parallel DB, online resumable backfill, brief atomic cutover. A startup guard (EmbeddingDimMismatch) fails fast if embedding_dim no longer matches the live column. See docs/cookbook/changing-embedding-dimensions.md.
Code-graph queries — pgrg code-impact <fqn> and GraphRAG.code_impact(...) report a code symbol's callers/callees (with evidence) over the existing relationships graph — cycle-safe recursive CTEs, --depth for transitive impact, tree or --json.
Chunkshop 0.6.1 code surfaces — dependency floor 0.5.0 → 0.6.1. chunkshop:code_aware / chunkshop:symbol_aware chunkers; pgrg ingest-chunkshop-table (+ --with-code-edges); code_summary → CODE_SYMBOL description enrichment; top_terms folded into the chunk full-text search_vector.
Hardening — if_oversize fallback in chunkshop delegation; clear error when importing code_edges from a schema without the table.
Notes
lede/lede-spacystay at>=0.4.5(current latest).- Re-running an embedding migration is data-safe;
embedding_oldis preserved untilfinalize.
Full details in CHANGELOG.md (0.4.0a1).