Skip to content

v0.5.0a1 — background extraction + multi-worker safety + observability

Pre-release
Pre-release

Choose a tag to compare

@TheYonk TheYonk released this 28 May 13:18
· 140 commits to main since this release

Headline: defer_extraction=True decouples LLM/lede extraction from ingest()59× faster time-to-queryable (~18 ms/doc vs ~1063 ms/doc on lede_spacy MHR). A new pgrg extract CLI (--once / --daemon) drains the queue out-of-band, with multi-worker safety guaranteed by SELECT … FOR UPDATE SKIP LOCKED (claim) and a new UNIQUE (namespace, src_id, dst_id, rel_type) constraint + ON CONFLICT on writes.

Backward-compatible: synchronous-extract default behavior is byte-for-byte unchanged.

What's new

Background extraction

Pass defer_extraction=True to ingest_records() and the producer returns in chunk + embed time only — the document is immediately queryable via vector + BM25, with entity/relationship extraction deferred. Drain on your schedule:

# cron-driven
pgrg --db $PGRG_DSN extract --namespace crm --once

# always-on daemon — SIGTERM-graceful, emits metrics per iteration
pgrg --db $PGRG_DSN extract --namespace crm --daemon --poll-interval 1.0

New module: pg_raggraph.backfill (claim_pending / extract_documents / release_processing). New migration 012_documents_graph_status.sql adds the queue-tracking columns with a partial index on pending rows.

Measured impact (MHR slice, lede_spacy, no LLM):

n_docs SYNC ingest DEFER ingest DRAIN B+C total A/B speedup
20 21.27 s 0.36 s 7.24 s 7.60 s 59.0×
40 26.27 s 0.44 s 15.12 s 15.56 s 59.8×

Total async path (B+C) is also faster than synchronous (A) because the synchronous path holds per-doc transactions open across extraction.

Multi-worker safety invariants

  • Namespace-scoped reaper. release_processing is keyword-only on namespace= / doc_ids=. The CLI passes its --namespace, so a worker starting in namespace A no longer steals namespace B's in-flight 'processing' claims.
  • Edge-level idempotency. Migration 013_relationships_unique.sql de-duplicates any existing rows and adds UNIQUE (namespace, src_id, dst_id, rel_type). Both ingest paths use ON CONFLICT DO UPDATE SET weight = GREATEST(...) — re-extraction is safe.
  • merge_entities updated to pre-delete colliding rows before the src_id/dst_id rewrite.

Run as many pgrg extract --daemon workers as you want against the same namespace.

Observability

Three new metric events per iteration, through the existing _emit_metric channel:

  • pgrg.backfill.claimnamespace, batch_size, claimed, latency_ms
  • pgrg.backfill.extractclaimed, ready, failed, entities, relationships, latency_ms
  • pgrg.backfill.queue_depth — per-status doc counts

Query-time hint

QueryResult.metadata['graph_status_summary'] and GraphRAG.status(ns)['graph_status'] expose per-status doc counts so callers can see whether the graph is still backfilling.

Benchmarks

  • benchmarks/defer_extraction_bench.py — A/B/C harness (sync ingest vs deferred ingest vs drain).
  • benchmarks/ingest_perf.py extended with --provider {local,http}. Measured: TEI HTTP CPU beats local fastembed 2.1× on bge-small (66 vs 140 ms/chunk).

Documentation

  • docs/cookbook/background-extraction.md — full guide with three architectural patterns (sync / cron / always-on daemon), end-to-end FastAPI walkthrough, operator playbook.
  • README.md / docs/README.md / docs/user-guide.md / docs/operations-guide.md — cross-references and discovery paths.

Production-readiness audit

skill-output/prod-ready/ contains the 16-finding audit that drove this release's safety fixes. P0s landed; P1s (timed background reaper, retry counter + cap, extractor health probe, multi-worker concurrency test) are tracked for the next cycle.

Verdict: single-worker / single-namespace deployments are production-ready; multi-worker deployments are safe by construction.

Compatibility

  • Synchronous ingest behavior: byte-for-byte unchanged.
  • Schema migrations 012 + 013 run automatically on connect; both are forward-only and idempotent.
  • release_processing signature is now keyword-only — any caller passing doc_ids positionally needs doc_ids= (the only known internal caller in tests is updated).

Full changelog

CHANGELOG.md