Skip to content

v0.3.0

Choose a tag to compare

@github-actions github-actions released this 07 Jun 17:18
· 90 commits to main since this release

The workflow + measurement release. Ships a new public-API surface
that closes the templated-workload retention gap end-to-end, in all three
bindings (Rust, Python, Node): analyze_query_set, the QueryRewrite
trait with two built-in implementations (Stripper and Vocabulary),
Document::context_with_rewrites(...) to compose them with an audit
trail, Vocabulary::enrich(...) as the chunk-side mirror, and evaluate
for deterministic A/B with no LLM judge. On the CUAD framework comparison
the full detect → compile → context_with_rewrites → A/B workflow takes
≥0.8 retention from 81.3% → 90.7% — a 9.4-point lift over raw BM25,
beating LlamaIndex's 86% by 4 points, at native BM25 latency (~2.5ms/query)
on default lexical retrieval. Worked example, hand-curated CUAD clause-name
dictionary, and a 6-arm probe contrasting the workflow vs hybrid+cross-encoder
live in docs/findings/CUAD_CLAUSE_EXPANSION.md and
docs/findings/CUAD_HYBRID_RERANK.md.

Vocabulary.enrich(...) ships with bidirectional measured evidence on the
regime rule it follows.
Positive side: docs/findings/SPIDER_ENRICH.md
measured +0.19 mean column recall on Spider-shape schema retrieval (curated
workload synonyms; n=30, candidate_k=10). Negative side:
docs/findings/CUAD_ENRICH_DEFINITIONS_NULL.md measured −2.0 pts on
CUAD prose chunks. The two findings together complete the four-corner
rule
with measured evidence on all four corners: workload-pervasive
signal manipulation fails on either side of the pipeline; only
workload-curated semantics work. See docs/findings/VOCABULARY_ENRICH.md
for the regime rule, use-case ranking, and failure modes.

Breaking on the manual-chunks path (Python + Node): the typed
redhop.Chunk(text, *, source=None, id=None, metadata=None, ...) constructor
becomes the only accepted input shape for Document.from_chunks and the
low-level build_context / filter_context / analyze_context /
context_economics entry points. Bare strings and dicts both raise
ValueError with a migration hint pointing at the new constructor. The
trade-off is intentional: the dict path didn't expose chunk metadata at all,
so manually-constructed chunks couldn't carry page / heading / line
into citations — a real functional gap, not just ergonomics. The typed
Chunk closes that gap and surfaces source (provenance) and id
(identity) as the two distinct concepts they already are in the Rust core
(see Breaking below for the migration).

Added

Templated-workload helpers (Rust + Python + Node)

  • analyze_query_set(queries) → QuerySetReport — diagnostic that takes
    a representative sample of your queries and reports whether they share
    enough boilerplate to be templated, which terms are doing the dilution,
    and a coarse estimated_dilution_cost band. Cross-workload probe
    (docs/findings/QUERY_SET_ANALYZER.md): CUAD fires (share 0.66, cost
    high); HotpotQA + MuSiQue both stay quiet (0.00 and 0.12, both
    is_templated=False). Conservative by design — false positives push
    users toward a workaround that won't help, which is worse than staying
    quiet.
  • QueryRewrite trait + Stripper + Vocabulary — compiled,
    observable, token-level-correct replacement for the function-form
    rewrites originally drafted for this release. Each QueryRewrite
    implementation returns a RewriteResult { query, record } so every
    stage's {stage, from, to, matched, added, removed} lands on
    ContextReport::query_rewrites automatically when called through
    the chain.
    • Stripper::new(boilerplate) — compiled boilerplate-removal
      rewrite. Matches at token granularity through the analyzer (with a
      surface-form fallback for tokens like "of"/"the" that stem to
      empty), so a single-token strip cannot accidentally erase a
      substring inside a longer word (an "of" strip does not erase
      the "of" inside "office"). Replaces the substring-based
      drop_template_terms function originally drafted for 0.3.0.
    • Vocabulary::new(entries) / Vocabulary::bidirectional(entries)
      — compiled workload-curated equivalence classes. Tokenizes keys,
      synonyms, and the query through the same analyzer the BM25 index
      uses, so a vocabulary key "ip" cannot fire on the "ip" inside
      "recipient". Bidirectional mode treats every class member as a
      trigger (PTO ↔ "paid time off" ↔ "vacation"). The CUAD probe
      (docs/findings/CUAD_CLAUSE_EXPANSION.md) shows +3.0 points on top
      of the template-stripped baseline (the new token-level matching
      re-validates at 90.7% vs the substring-based predecessor's 90.3%
      — same workload, +0.4 from analyzer alignment).
    • Document::context_with_rewrites(query, &[&stripper, &vocab])
      — runs the chain left-to-right through retrieval. Each stage sees
      the previous stage's output; the per-stage RewriteRecords land on
      ctx.report.query_rewrites automatically.
    • Future-extensible. Both Stripper and Vocabulary are
      QueryRewrite implementations; user code can ship its own (e.g. a
      workload-specific normalizer) and chain it alongside the built-ins.
      The trait is exported on the public API surface.
    • Vocabulary::enrich(chunk) → RewriteResult — chunk-side
      mirror of apply shipped as a primitive on mechanism reasoning
      with asymmetric measured evidence
      . The mechanism (a chunk-side
      doc2query variant) and the regime hypothesis
      (expected value ∝ shortness × opacity × dictionary-exists) are
      well-grounded; the positive prediction (short opaque coded
      units — schema columns, API symbols, error codes) is not yet
      measured by RedHop
      . Spider/BIRD as the schema-regime probe is
      queued, not run. The negative prediction (long prose chunks
      • workload-pervasive vocabulary will dilute, not help) has been
        measured directly:
        CUAD_ENRICH_DEFINITIONS_NULL
        regressed retention −2.0pt vs the 90.7% workflow baseline
        (~24-point loss on the 17/50 affected contracts). This completes
        the four-corner rule from CUAD_PRF_NULL + SUB_IDF_AUTO_DROP_NULL
        onto the chunk side: workload-pervasive signal manipulation fails
        on either side of the pipeline. Users adopting enrich should
        A/B on their own corpus with redhop::evaluate(...)
        the regime rule is a hypothesis, not a guarantee. Audit trail
        (per-chunk RewriteRecord with stage: "enrich") returned to
        the caller so the A/B is auditable. Synthetic demo (not a
        benchmark): crates/examples/examples/enrich_code_search.rs.
        Full asymmetric-evidence framing + use case predictions + failure
        modes in docs/findings/VOCABULARY_ENRICH.md.
  • evaluate(query, ctx, gold) → EvalReport — in-process retrieval-eval
    scorer, no LLM judge. Self-eval (mean_grounding, evidence_density,
    retained_evidence_ratio, second_hop_rescues, low_confidence,
    estimated_waste_tokens) is always populated; gold-relative metrics
    (context_recall, context_precision, answer_token_recall) are
    optionally unlocked by passing gold_chunks and/or gold_answer.
    Composite overall blends whichever fields are available. Designed as
    a refraction of the same primitives the runtime uses to make its
    Decision Report — a low overall and report.low_confidence_retrieval
    are the same signal viewed twice, not independent measurements, so eval
    and runtime can never disagree. Rationale, contract details, and the
    10 / 11 / 9 Rust / Python / Node tests pin in
    docs/findings/EVALUATE_API.md.

Findings (the evidence layer)

New findings document what was tried, what worked, and what was
falsified across this release:

  • ConfirmedQUERY_SET_ANALYZER, CUAD_RECALL_GAP,
    CUAD_CLAUSE_EXPANSION, MULTILINGUAL_ANALYZER, EVALUATE_API,
    CUAD_HYBRID_RERANK (substitute-not-stack rule), VOCABULARY_ENRICH
    (confirmed on both sides of the regime rule), SPIDER_ENRICH
    (the positive-side validation for Vocabulary.enrich(...): curated
    chunk-side enrichment on a Spider-shape sample lifted mean column
    recall +0.19 from 0.77 → 0.97, ≥0.8 retention 63% → 93%).
  • Null result / falsifiedCUAD_PRF_NULL (unweighted PRF on
    boilerplate-heavy corpora), CUAD_CHUNK_FRAGMENTATION_NULL (chunker
    isn't the CUAD lever), SUB_IDF_AUTO_DROP_NULL (corpus-only IDF
    manipulation fails in both directions),
    CUAD_ENRICH_DEFINITIONS_NULL (chunk-side enrich on per-contract
    Definitions regressed −2.0 pts vs the 90.7% workflow baseline;
    ~24-point loss on the 17/50 contracts where Definitions were
    extractable — chunk-side parallel to CUAD_PRF_NULL's failure mode,
    measured directly).
  • The four-corner rule is now measured on all four corners.
    Workload-pervasive signal manipulation fails on either side of the
    pipeline; only workload-curated semantics work:
    query-side curated wins (CUAD_CLAUSE_EXPANSION +3.0pt) ·
    query-side auto fails (CUAD_PRF_NULL −3.7pt) ·
    chunk-side curated wins (SPIDER_ENRICH +0.19 mean recall) ·
    chunk-side auto fails (CUAD_ENRICH_DEFINITIONS_NULL −2.0pt).

Examples

Eleven new harnesses under crates/examples/examples/:
cuad_query_preprocessing, cuad_chunk_strategy_sweep,
cuad_chunk_fragmentation, cuad_clause_expansion, cuad_hybrid_rerank,
cuad_perf, cuad_prf, cuad_rust_vs_python_path,
multilingual_query_set_probe, query_set_analyzer_probe,
sub_idf_reweighting_probe.

Documentation

  • New workflow-lift chart .github/workflow_lift.svg embedded in the
    root README + binding READMEs — surfaces the 81 → 88 → 90.7% story
    visually.
  • Root README, python/README.md, nodejs/README.md "Templated
    workloads" section rewritten to detect → strip → (optional) vocabulary →
    A/B with Stripper / Vocabulary / context_with_rewrites tabled.
  • docs/CHOOSING_A_CONFIG.md step 3 leads with the new "two paths up
    the same hill" decision table contrasting retrieval="hybrid" (the
    one-knob alternative) vs BM25 + the helpers (best-quality).

Chat-RAG and chronology preservation

  • ContextConfig::preserve_order: bool — new field (default false,
    no behavior change for existing callers). When set, the assembled
    context emits selected chunks in source-document order instead of
    the strategy's relevance-emitted order. The selection step is
    untouched; only the final ordering changes. Designed for chat
    histories, narrative transcripts, and sequential logs where
    chronology / causality matters and a relevance-ranked emission would
    destroy the meaning ("after the refund came in" reads strangely if
    presented before "ordered the laptop").
  • The sort key is (source, chunk_position) where chunk_position
    prefers a chunk_index metadata field (stamped automatically by
    Document::from_chunks_with based on input order, so caller-supplied
    chunks via from_chunks get a stable chronology key for free) and
    falls back to the chunker's existing sentence_range.start for
    text-loaded paths.
  • Exposed across all three bindings:
    • RustContextConfig { preserve_order: true, .. }; flows
      through LoadOptions::preserve_order for the text() /
      chunks() paths.
    • Pythonredhop.Document.from_text(text, preserve_order=True)
      and from_chunks / from_file / from_bytes; also exposed on the
      low-level redhop.build_context(query, chunks, preserve_order=True)
      and redhop.filter_context(...).
    • NodeDocument.fromText(text, { preserveOrder: true }) and
      siblings; also a preserveOrder?: boolean field on the
      ContextOptions shape consumed by buildContext and filterContext.
  • Worked example:
    crates/examples/examples/chat_rag.rs
    shows a 12-turn chat where, on the query "shipping refund label return", the strategy picks four turns by relevance — preserve_order
    off emits them in [turn-08, turn-03, turn-05, turn-06] (relevance);
    preserve_order on emits them in [turn-03, turn-05, turn-06, turn-08]
    (chronological), so the LLM reads what was said in the order it was
    said. 3 new Rust unit tests pin the contract
    (preserve_order_off_emits_relevance_order,
    preserve_order_on_emits_document_order,
    preserve_order_groups_by_source).

Changed

  • Package registry URLs now point at https://www.redhopai.com as
    the canonical Homepage, with the GitHub repo kept as Repository
    (PyPI) / repository (npm) / repository (crates.io). Before this,
    PyPI displayed two identical "Homepage" and "Repository" links both
    pointing at GitHub; npm displayed neither. PyPI also gains
    Documentation, Changelog, Issues, and Evidence layer link
    entries; npm gains homepage, repository, bugs, and an
    expanded keywords array (reasoning, embeddings added).
  • Findings master table refreshed with new rows on
    /docs/benchmarks/ (website) and docs/findings/README.md (repo).
    Framework comparison row updated: the CUAD headline is now
    90.7% via Stripper + Vocabulary (was 88% via strip alone),
    beating LlamaIndex by 4 points. VOCABULARY_ENRICH row promoted from
    asymmetric measured evidence to confirmed on both sides of the
    regime rule
    after the SPIDER_ENRICH probe landed.
  • RewriteResult.query field renamed to RewriteResult.text
    (Rust). The same struct is the output of both query-side
    QueryRewrite::apply and chunk-side Vocabulary::enrich. The old
    query field name read awkwardly on the enrich path
    (vocab.enrich(chunk_text).query describes a chunk, not a query);
    text is neutral and accurate for both directions. The audit-record
    stage field is the signal of which side of the pipeline emitted
    the result ("strip" / "vocabulary" / "enrich"). Pre-publish
    rename — no callers exist outside the repo yet, but flagging for
    anyone building from source on a pre-release commit.
  • User-facing docs (README.md, python/README.md, nodejs/README.md,
    website) elevate the rewrite chain + audit trail + evaluate to a
    dedicated "Show your work" section.
    The 0.3.0 differentiator
    versus other RAG frameworks is every transform is observable on
    the same Decision Report and every change is A/B-scoreable without
    an LLM judge
    ; the previous docs surfaced the 3-call surface plus
    citations but understated the rewrite/audit/evaluate combo. The new
    section appears on every binding's README and as both a homepage
    card and a section on the website.

Fixed

  • Document.from_folder was constructing LoadOptions without
    preserve_order under --features files,semantic.
    Caught
    locally while writing examples/python/07_retrieval_tiers.py (a
    full-feature build). The bug was hidden in the lean (no-features)
    default build because the missing-field code path was behind
    #[cfg(feature = "files")]. The default published wheel ships with
    features = ["files", "semantic"], so end users would have hit it.
    Fixed; all 4 feature configurations (--no-default-features,
    --features files, --features semantic, --features files,semantic)
    now compile cleanly.

Breaking — redhop.Chunk is now the only accepted manual-chunks shape

  • Document.from_chunks + build_context + filter_context +
    analyze_context + context_economics now require typed
    redhop.Chunk(...) instances.
    Bare strings and plain dicts both
    raise ValueError with a migration hint:
    chunk 0: expected redhop.Chunk(text, source=..., ...); got str. As of
    0.3.0, strings and dicts are no longer accepted — wrap your input as
    `redhop.Chunk(text, source='myfile.txt')`.
    
  • What the new constructor looks like:
    redhop.Chunk(
        text,
        source=None,       # provenance: file path / URL / logical handle
        id=None,            # identity: stable id, defaults to c0, c1, …
        metadata=None,      # open dict; citations read page/heading/line
        token_count=None,   # auto from whitespace if omitted
        embedding=None,     # for pre-computed dense vectors
    )
    Node mirrors with new redhop.Chunk(text, { source, id, metadata, tokenCount, embedding }).
  • Why this is now a breaking change instead of a backward-compat additive:
    the dict path didn't accept metadata at all, so manually-supplied
    chunks couldn't carry page/heading/line into citations. The two-ways-
    to-do-it cleanup is incidental; closing the metadata gap is the real
    reason. Strict typing also surfaces source (provenance) and id
    (identity) as distinct concepts the way the Rust core has always
    treated them — the dict path conflated them in practice.
  • Migration:
    Before After
    from_chunks(["a", "b"]) from_chunks([redhop.Chunk("a"), redhop.Chunk("b")])
    from_chunks([{"text": "a", "source": "x.md"}]) from_chunks([redhop.Chunk("a", source="x.md")])
    from_chunks([{"text": "a", "id": "x", "source": "y.md"}]) from_chunks([redhop.Chunk("a", id="x", source="y.md")])
    buildContext(q, [{ id, text }, ...]) (Node) buildContext(q, [new Chunk(text, { id }), ...])
  • What's new on the typed-chunks path: citations now pick up page,
    heading, and line from metadata={...} on chunks the user built
    themselves. Before 0.3.0 those fields were always None on the
    manual-chunks path — only the file loaders populated them.
  • Rust callers unaffected. The redhop::core::Chunk struct hasn't
    changed shape. Document::from_chunks(Vec<Chunk>) still takes
    Vec<redhop::core::Chunk> exactly as it did. A new public facade
    redhop::chunks_typed(Vec<Chunk>, &LoadOptions) was added so the
    bindings can route pre-formed chunks through the indexing pipeline
    without going through the chunker (preserving 1-to-1 chunk identity).

Breaking (Node only — Python and Rust callers unaffected)

  • Node BuiltContext is now a #[napi] class (was a plain
    #[napi(object)]). The four exposed properties (text, chunks,
    citations, report) remain readable as JS properties via getters, so
    existing user code that does ctx.text, ctx.chunks, etc., continues
    to work unchanged. The TypeScript type changes from
    interface BuiltContext { … } to class BuiltContext { … }. The
    reason for the change is that redhop.evaluate(query, ctx, …) needs
    access to the underlying Rust struct (chunk IDs, the full report shape)
    which a plain object can't carry.
    • What breaks: if you were JSON.stringify(ctx), class getters
      aren't enumerable by default and the output will be {} instead of
      the four-field object. Project to a plain object explicitly:
      JSON.stringify({ text: ctx.text, chunks: ctx.chunks, citations: ctx.citations, report: ctx.report }).
      No other behavior changes.