Releases · yonk-labs/pg-raggraph

09 Jun 17:14

TheYonk

v0.5.0a14

91bf165

v0.5.0a14 — chunkshop 0.9.1 floor: bound symbol_aware over-parse Latest

Latest

Fixes the #79 root cause: large/generated code corpora no longer OOM the embedding step

chunkshop 0.9.0's path-less language detection parsed ~2× more files as code, so a single generated/minified file could explode into thousands of symbol_aware chunks and exhaust memory when embedding them.

chunkshop 0.9.1 (chunkshop#71/#72) adds two on-by-default guards — a content-detection fallback for generated/minified files and max_symbols_per_file=2000. This release raises the [chunkshop] extra floor to chunkshop>=0.9.1, so pg-raggraph's chunk_strategy="chunkshop:symbol_aware" path inherits the protection with no config change.

Verified: a 3,000-function generated .ts drops 3,000 → 74 chunks; normal code untouched.
467 unit + 17 integration green under chunkshop 0.9.1.

Together with v0.5.0a13 (cross-file resolver spill-to-DB), both halves of #79 are addressed.

Full changelog: see CHANGELOG.md.

Assets 2

09 Jun 13:02

TheYonk

v0.5.0a13

12b29f9

v0.5.0a13 — cross-file resolver: fix OOM (spill to DB)

Fix: cross-file code-graph resolver OOM on large repos

The v0.5.0a12 cross-file resolver (cross_file_code_graph=True) accumulated one entry per call site (each with a source snippet) in memory — O(call-sites), ~427 B/call, projecting to ~3.6 GB at 100K files → OOM on big codebases.

Fix: CorpusCodeGraph now keeps only the small symbol index in memory and spills call sites to an UNLOGGED code_calls_stage table (migration 014). Phase-2 resolution drains them in keyset batches (resolve_batch(), verified byte-identical to the one-shot resolver) plus one resolve_class_edges() pass. Peak resolver memory is now O(batch + symbol index) instead of O(corpus call sites); scratch rows are deleted after.

Parity: batched == one-shot edges (unit test); 644 cross-file edges on self-ingest unchanged.
Memory: peak in-memory call list measured 0 (was ~24.7K).
No change to the resolved graph; default ingest path untouched.

Full changelog: see CHANGELOG.md.

Assets 2

09 Jun 02:19

TheYonk

v0.5.0a12

6c679b5

v0.5.0a12 — cross-file code graph

Cross-file code graph (#76)

chunk_strategy="chunkshop:symbol_aware" now supports corpus-wide cross-file resolution for the code call graph.

Opt in with ingest_records(..., cross_file_code_graph=True) or PGRG_CROSS_FILE_CODE_GRAPH=1. Default stays per-file/streaming (no regression).
One CorpusCodeGraph resolver accumulates the whole ingest (O(symbols), not content — bounded-memory streaming preserved), then materializes cross-file CALLS/INHERITS/IMPLEMENTS edges. Each carries resolved_intra_file in relationships.properties.
Self-ingest of src/pg_raggraph: 415 → 1,030 CALLS edges, 636 cross-file, symbols-with-callers 246 → 351.

Fixes the consumer symptom where most symbols had 0 callers because cross-file calls weren't traced (EnterpriseDB/bento#779).

Full changelog: see CHANGELOG.md.

Assets 2

04 Jun 13:46

TheYonk

v0.5.0a9

bbe36ca

v0.5.0a9 — idempotent migration 013 Pre-release

Pre-release

Patch release for the migration 013 idempotency fix (#67, #68).

Fixed

Migration 013_relationships_unique.sql re-runs cleanly instead of looping forever. ALTER TABLE ... ADD CONSTRAINT has no IF NOT EXISTS in Postgres, so re-applying 013 on a database where relationships_ns_edge_unique already existed — the constraint applied out-of-band, or its pgrg_applied_migrations row lost to a backup/restore taken between the schema change and the bookkeeping commit — raised 42710 duplicate_object. Because the migration runs in one transaction, that rolled it back and never recorded it, so every subsequent startup / pgrg migrate retried and failed again (the partial-migration-not-tracked loop). The constraint add is now wrapped in a pg_constraint existence guard, so the migration converges to the same end state on the first run and any repeat run. Steps 1-2 (dedup DML) were already idempotent. Adds test_migration_013_is_rerunnable.

Upgrade note: any database currently stuck in the 013 failure loop converges automatically on the next startup or pgrg migrate.

Assets 2

02 Jun 11:40

TheYonk

v0.5.0a8

7a0fed4

v0.5.0a8 — RRF fusion + chunkshop 0.8.2 Pre-release

Pre-release

First PyPI publish since 0.5.0a1 — brings the published package current with the chunkshop-0.6 integration line (a2→a8).

Highlights

Optional RRF fusion mode (fusion="rrf", #57) — Reciprocal Rank Fusion for naive + hybrid retrieval, alongside the default linear weighted scoring. Scale-free across the cosine/BM25 legs; per-call override on query()/ask(). Default-off — linear path byte-identical.
chunkshop floor 0.7.0 → 0.8.2 (10-language codeparse, import-aware resolution, remote embedder, perf).
Carries the full a2→a7 line: A/B-gate harness/runner/writer, MCP agent UX, background extraction, online embedding migration, code-graph queries, lede_spacy deterministic extraction.

Packaging

The ab-gate extra is now an empty marker (PyPI forbids git deps in published metadata, and llm-judge has no PyPI release yet). Install the A/B-gate judge runtime from git: pip install 'llm-judge @ git+https://github.com/yonk-labs/llm-judge.git@main'. No A/B-gate runtime code change.

Install

pip install pg-raggraph==0.5.0a8

pip install 'pg-raggraph[all]' pulls the optional surfaces (server, community, langchain, llamaindex, mcp, chunkshop, lede_spacy).

Supersedes the v0.5.0a7 tag, whose upload PyPI rejected over the ab-gate git dependency; a8 is identical content plus the packaging fix.

Assets 2

28 May 13:18

TheYonk

v0.5.0a1

32117ff

v0.5.0a1 — background extraction + multi-worker safety + observability Pre-release

Pre-release

Headline: defer_extraction=True decouples LLM/lede extraction from ingest() — 59× faster time-to-queryable (~18 ms/doc vs ~1063 ms/doc on lede_spacy MHR). A new pgrg extract CLI (--once / --daemon) drains the queue out-of-band, with multi-worker safety guaranteed by SELECT … FOR UPDATE SKIP LOCKED (claim) and a new UNIQUE (namespace, src_id, dst_id, rel_type) constraint + ON CONFLICT on writes.

Backward-compatible: synchronous-extract default behavior is byte-for-byte unchanged.

What's new

Background extraction

Pass defer_extraction=True to ingest_records() and the producer returns in chunk + embed time only — the document is immediately queryable via vector + BM25, with entity/relationship extraction deferred. Drain on your schedule:

# cron-driven
pgrg --db $PGRG_DSN extract --namespace crm --once

# always-on daemon — SIGTERM-graceful, emits metrics per iteration
pgrg --db $PGRG_DSN extract --namespace crm --daemon --poll-interval 1.0

New module: pg_raggraph.backfill (claim_pending / extract_documents / release_processing). New migration 012_documents_graph_status.sql adds the queue-tracking columns with a partial index on pending rows.

Measured impact (MHR slice, lede_spacy, no LLM):

n_docs	SYNC ingest	DEFER ingest	DRAIN	B+C total	A/B speedup
20	21.27 s	0.36 s	7.24 s	7.60 s	59.0×
40	26.27 s	0.44 s	15.12 s	15.56 s	59.8×

Total async path (B+C) is also faster than synchronous (A) because the synchronous path holds per-doc transactions open across extraction.

Multi-worker safety invariants

Namespace-scoped reaper. release_processing is keyword-only on namespace= / doc_ids=. The CLI passes its --namespace, so a worker starting in namespace A no longer steals namespace B's in-flight 'processing' claims.
Edge-level idempotency. Migration 013_relationships_unique.sql de-duplicates any existing rows and adds UNIQUE (namespace, src_id, dst_id, rel_type). Both ingest paths use ON CONFLICT DO UPDATE SET weight = GREATEST(...) — re-extraction is safe.
merge_entities updated to pre-delete colliding rows before the src_id/dst_id rewrite.

Run as many pgrg extract --daemon workers as you want against the same namespace.

Observability

Three new metric events per iteration, through the existing _emit_metric channel:

pgrg.backfill.claim — namespace, batch_size, claimed, latency_ms
pgrg.backfill.extract — claimed, ready, failed, entities, relationships, latency_ms
pgrg.backfill.queue_depth — per-status doc counts

Query-time hint

QueryResult.metadata['graph_status_summary'] and GraphRAG.status(ns)['graph_status'] expose per-status doc counts so callers can see whether the graph is still backfilling.

Benchmarks

benchmarks/defer_extraction_bench.py — A/B/C harness (sync ingest vs deferred ingest vs drain).
benchmarks/ingest_perf.py extended with --provider {local,http}. Measured: TEI HTTP CPU beats local fastembed 2.1× on bge-small (66 vs 140 ms/chunk).

Documentation

docs/cookbook/background-extraction.md — full guide with three architectural patterns (sync / cron / always-on daemon), end-to-end FastAPI walkthrough, operator playbook.
README.md / docs/README.md / docs/user-guide.md / docs/operations-guide.md — cross-references and discovery paths.

Production-readiness audit

skill-output/prod-ready/ contains the 16-finding audit that drove this release's safety fixes. P0s landed; P1s (timed background reaper, retry counter + cap, extractor health probe, multi-worker concurrency test) are tracked for the next cycle.

Verdict: single-worker / single-namespace deployments are production-ready; multi-worker deployments are safe by construction.

Compatibility

Synchronous ingest behavior: byte-for-byte unchanged.
Schema migrations 012 + 013 run automatically on connect; both are forward-only and idempotent.
release_processing signature is now keyword-only — any caller passing doc_ids positionally needs doc_ids= (the only known internal caller in tests is updated).

Full changelog

CHANGELOG.md

Assets 2

26 May 16:17

TheYonk

v0.4.0a1

22140fe

v0.4.0a1 — chunkshop 0.6.1, online embedding migration, code-graph queries Pre-release

Pre-release

Alpha release. Additive and backward-compatible — existing query/ingest behavior is unchanged (retrieval path untouched); validated against a 2.5 GB real corpus with no accuracy change.

Highlights

Online embedding-model migration — pgrg migrate-embeddings prepare/backfill/build-index/status/cutover/finalize. Change the embedding model/dimension on a live database via an expand/contract column swap: no parallel DB, online resumable backfill, brief atomic cutover. A startup guard (EmbeddingDimMismatch) fails fast if embedding_dim no longer matches the live column. See docs/cookbook/changing-embedding-dimensions.md.

Code-graph queries — pgrg code-impact <fqn> and GraphRAG.code_impact(...) report a code symbol's callers/callees (with evidence) over the existing relationships graph — cycle-safe recursive CTEs, --depth for transitive impact, tree or --json.

Chunkshop 0.6.1 code surfaces — dependency floor 0.5.0 → 0.6.1. chunkshop:code_aware / chunkshop:symbol_aware chunkers; pgrg ingest-chunkshop-table (+ --with-code-edges); code_summary → CODE_SYMBOL description enrichment; top_terms folded into the chunk full-text search_vector.

Hardening — if_oversize fallback in chunkshop delegation; clear error when importing code_edges from a schema without the table.

Notes

lede / lede-spacy stay at >=0.4.5 (current latest).
Re-running an embedding migration is data-safe; embedding_old is preserved until finalize.

Full details in CHANGELOG.md (0.4.0a1).

Assets 2

Releases: yonk-labs/pg-raggraph

v0.5.0a14 — chunkshop 0.9.1 floor: bound symbol_aware over-parse

Fixes the #79 root cause: large/generated code corpora no longer OOM the embedding step

Uh oh!

v0.5.0a13 — cross-file resolver: fix OOM (spill to DB)

Fix: cross-file code-graph resolver OOM on large repos

Uh oh!

v0.5.0a12 — cross-file code graph

Cross-file code graph (#76)

Uh oh!

v0.5.0a9 — idempotent migration 013

Fixed

Uh oh!

v0.5.0a8 — RRF fusion + chunkshop 0.8.2

Highlights

Packaging

Install

Uh oh!

v0.5.0a1 — background extraction + multi-worker safety + observability

What's new

Background extraction

Multi-worker safety invariants

Observability

Query-time hint

Benchmarks

Documentation

Production-readiness audit

Compatibility

Full changelog

Uh oh!

v0.4.0a1 — chunkshop 0.6.1, online embedding migration, code-graph queries

Highlights

Notes

Uh oh!