Skip to content

Releases: vysakh0/redhop

v0.3.3

09 Jun 13:26

Choose a tag to compare

Answer-quality eval surface (Rust + Python + Node) + audit of
defaulted-on heuristics.
Two threads in this release:
(1) a new in-process evaluate(...) / critique(...) surface for
closed-set answer-quality metrics — lexical + LLM-judged, with
claim-decomposed faithfulness and TP/FP/FN correctness; calibrated
against Ragas at n=200 HotpotQA (r=+0.664, MAE=0.151);
(2) the defaulted-on heuristics audit (five measured, one
already-flipped in 0.3.2, two API smells fixed).

Added — eval surface

  • evaluate(query, ctx, answer=, gold_answer=, judge=, decompose_faithfulness=, decompose_correctness=)
    in-process answer-quality eval that returns one EvalReport blending
    lexical (CI-deterministic) and LLM-judged metrics. Available in Rust
    (redhop::evaluate(...)), Python (redhop.evaluate(...)), and Node
    (evaluateWithJudge(...) for the async judge path). Faithfulness,
    relevancy, correctness in _lexical (no LLM) and _judged (opt-in)
    flavors; gold-relative metrics (context_recall,
    context_precision, answer_token_recall) when gold is provided.
  • critique(answer, aspects, judge=, context=, query=) — open-ended
    user-defined dimensions (harmfulness, conciseness, brand voice,
    etc.). One LLM call per aspect; polarity-corrected scores so high =
    good across the report regardless of highIsGood. Returns a
    CritiqueReport with per-aspect scores in input order.
  • summarize(reports) — aggregates a sequence of per-case
    EvalReports into a means + N + share-flagged summary, the same
    shape RedHop's runtime uses for its Decision Report.
  • Judge surfaceJudge.from_callable(fn).cached() (Python),
    Judge.fromCallable(fn, name).cached() (Node), and the Rust
    Judge trait with CachedJudge and CallableJudge wrappers. One
    caching layer for any user-supplied LLM caller; an LRU sized by
    the caller. Single primitive supports faithfulness, relevancy,
    correctness, critique, and decomposed paths.
  • Claim-decomposed faithfulness (decompose_faithfulness=True):
    extracts atomic claims via a few-shot LLM call, then batch-verifies
    all of them in a single second LLM call. Two LLM calls regardless
    of how many claims were extracted. gpt-4o-mini correlates with
    Ragas's faithfulness at r=+0.664 on n=200 HotpotQA (see
    docs/findings/EVAL_JUDGED_CALIBRATION.md).
    Verifier prompt includes paraphrase-positive examples + negative
    entity-substitution examples to balance strictness and recall.
  • TP/FP/FN correctness (decompose_correctness=True): mirrors
    decomposed-faithfulness on the answer-vs-gold axis. Extracts
    claims from both the answer and the gold answer, classifies each as
    TP / FP / FN, returns F₁. Diagnostic counters
    (n_correctness_tp/fp/fn) surface the intermediate categorisation.
  • Refusal-aware decomposition — "I don't know" answers correctly
    produce mean_faithfulness_judged = None (0 claims extracted)
    instead of being scored as a vacuous 1.0. Surfaces refusals as a
    distinct category, not as faithfulness=1.
  • Diagnostic counters on EvalReport:
    n_faithfulness_claims_extracted, n_faithfulness_claims_supported,
    n_correctness_tp, n_correctness_fp, n_correctness_fn. Surface
    intermediate classifier counts so callers can debug WHY a metric
    landed where it did.

Added — eval evidence

  • docs/COMPARISON_RAGAS.md — public-facing
    head-to-head with Ragas on claim-decomposed faithfulness. n=200
    HotpotQA, gpt-4o-mini, with Claude haiku as third-judge tie-breaker.
  • docs/findings/EVAL_JUDGED_CALIBRATION.md
    rewritten end-to-end: three-layer evidence (5-case wiring probe →
    5-case Ragas side-by-side → n=200 HotpotQA correlation + third-judge
    subset). Documents the v0→v4 prompt iteration that fixed four
    traceable failure modes (paraphrase rejection, comparative
    hallucination, compound-attribution dilution, wrong-entity
    substitution). Calls out single-shot LLM noise as a measured
    property of the workload (gpt-4o-mini at temp=0 is not
    deterministic through OpenRouter — ~20–30% per-case variance).
  • docs/findings/ANSWER_QUALITY_EVAL.md
    full API tour for the new evaluate(...) + critique(...) surface.
  • docs/findings/EVAL_VS_RAGAS_SOURCE.md
    source-read comparison of the two libraries' implementations.
  • bench/eval_correlation_hotpot.py — runs the n=200 Pearson r / MAE
    measurement on HotpotQA against Ragas with configurable context mode
    (supporting / distractor_only / all).
  • bench/eval_third_judge.py — Claude haiku tie-breaker via the local
    claude -p CLI; no API key needed.
  • bench/eval_faith_trace.py — diagnostic harness for tracing claim
    extraction + per-claim verifier votes on specific qids. Has a
    --variant v0/v1/v2/v3/v4 flag for prompt iteration without
    rebuilding the Rust crate.
  • bench/eval_judged_calibration.py — the 5-case wiring probe with
    optional Ragas side-by-side when installed.
  • bench/select_third_judge_subset.py — filters contested cases from
    a correlation-bench JSON so the third-judge run stays cheap.

Breaking — Rust only

  • redhop::evaluate(...) signature now takes six parameters:
    (query, ctx, answer, gold, judge, config) instead of the prior
    three-parameter shape. The Python and Node bindings absorb this via
    kwargs / options and are NOT breaking. Pass None for answer and
    judge and EvalConfig::default() for config to match the old
    behavior.

Added — defaulted-on heuristics audit

  • code_neighbors_default / codeNeighborsDefault — surfaces the
    ±N adjacent-chunk auto-pull on code chunks as a constructor kwarg on
    Python from_text/from_file/from_bytes/from_folder/from_chunks
    and as a field on the Node Options struct. Default 1 (unchanged
    behavior). Pass 0 to disable, or 2/3 for more aggressive
    expansion under loose token budgets. See
    docs/findings/CODE_NEIGHBORS_DEFAULT.md
    for the measured budget tradeoff.
  • prose_heading_default / proseHeadingDefault — surfaces the
    auto-attach of section-heading chunks to prose hits as the same
    constructor-level kwarg / option. Default true (unchanged). Pass
    false for memory-tight workloads where the heading isn't
    load-bearing. See
    docs/findings/PROSE_HEADING_DEFAULT.md
    for the measured +7pt ≥0.8 lift at typical budgets.
  • crates/redhop/src/load.rsLoadOptions now exposes
    code_neighbors_default and prose_heading_default as
    Option<usize> / Option<bool>. Threads through read_folder_with
    for parity with the in-memory loaders.
  • Audit finding docs. Five new findings on the defaulted-on
    heuristics audit:
    RAW_ANALYZER (flipped in 0.3.2),
    HYBRID_CANDIDATE_POOL
    (inert knob — don't tune),
    PROSE_HEADING_DEFAULT
    (+7pt at typical budgets),
    BM25_SOURCE_FIELD (+4pt with
    signal, 0pt with noise),
    CODE_NEIGHBORS_DEFAULT
    (budget-dependent compromise).
  • Cross-binding parity tests for the two new kwargs (Python
    test_loaders.py, Node smoke.cjs).

Changed

  • No default values changed in this release. All defaults remain
    what they were after 0.3.2 — the new kwargs default to the existing
    Rust values (code_neighbors_default=1, prose_heading_default=true).
    Existing callers see zero behavior change; the new kwargs are an
    opt-out / tune surface only.

v0.3.0

07 Jun 17:18

Choose a tag to compare

The workflow + measurement release. Ships a new public-API surface
that closes the templated-workload retention gap end-to-end, in all three
bindings (Rust, Python, Node): analyze_query_set, the QueryRewrite
trait with two built-in implementations (Stripper and Vocabulary),
Document::context_with_rewrites(...) to compose them with an audit
trail, Vocabulary::enrich(...) as the chunk-side mirror, and evaluate
for deterministic A/B with no LLM judge. On the CUAD framework comparison
the full detect → compile → context_with_rewrites → A/B workflow takes
≥0.8 retention from 81.3% → 90.7% — a 9.4-point lift over raw BM25,
beating LlamaIndex's 86% by 4 points, at native BM25 latency (~2.5ms/query)
on default lexical retrieval. Worked example, hand-curated CUAD clause-name
dictionary, and a 6-arm probe contrasting the workflow vs hybrid+cross-encoder
live in docs/findings/CUAD_CLAUSE_EXPANSION.md and
docs/findings/CUAD_HYBRID_RERANK.md.

Vocabulary.enrich(...) ships with bidirectional measured evidence on the
regime rule it follows.
Positive side: docs/findings/SPIDER_ENRICH.md
measured +0.19 mean column recall on Spider-shape schema retrieval (curated
workload synonyms; n=30, candidate_k=10). Negative side:
docs/findings/CUAD_ENRICH_DEFINITIONS_NULL.md measured −2.0 pts on
CUAD prose chunks. The two findings together complete the four-corner
rule
with measured evidence on all four corners: workload-pervasive
signal manipulation fails on either side of the pipeline; only
workload-curated semantics work. See docs/findings/VOCABULARY_ENRICH.md
for the regime rule, use-case ranking, and failure modes.

Breaking on the manual-chunks path (Python + Node): the typed
redhop.Chunk(text, *, source=None, id=None, metadata=None, ...) constructor
becomes the only accepted input shape for Document.from_chunks and the
low-level build_context / filter_context / analyze_context /
context_economics entry points. Bare strings and dicts both raise
ValueError with a migration hint pointing at the new constructor. The
trade-off is intentional: the dict path didn't expose chunk metadata at all,
so manually-constructed chunks couldn't carry page / heading / line
into citations — a real functional gap, not just ergonomics. The typed
Chunk closes that gap and surfaces source (provenance) and id
(identity) as the two distinct concepts they already are in the Rust core
(see Breaking below for the migration).

Added

Templated-workload helpers (Rust + Python + Node)

  • analyze_query_set(queries) → QuerySetReport — diagnostic that takes
    a representative sample of your queries and reports whether they share
    enough boilerplate to be templated, which terms are doing the dilution,
    and a coarse estimated_dilution_cost band. Cross-workload probe
    (docs/findings/QUERY_SET_ANALYZER.md): CUAD fires (share 0.66, cost
    high); HotpotQA + MuSiQue both stay quiet (0.00 and 0.12, both
    is_templated=False). Conservative by design — false positives push
    users toward a workaround that won't help, which is worse than staying
    quiet.
  • QueryRewrite trait + Stripper + Vocabulary — compiled,
    observable, token-level-correct replacement for the function-form
    rewrites originally drafted for this release. Each QueryRewrite
    implementation returns a RewriteResult { query, record } so every
    stage's {stage, from, to, matched, added, removed} lands on
    ContextReport::query_rewrites automatically when called through
    the chain.
    • Stripper::new(boilerplate) — compiled boilerplate-removal
      rewrite. Matches at token granularity through the analyzer (with a
      surface-form fallback for tokens like "of"/"the" that stem to
      empty), so a single-token strip cannot accidentally erase a
      substring inside a longer word (an "of" strip does not erase
      the "of" inside "office"). Replaces the substring-based
      drop_template_terms function originally drafted for 0.3.0.
    • Vocabulary::new(entries) / Vocabulary::bidirectional(entries)
      — compiled workload-curated equivalence classes. Tokenizes keys,
      synonyms, and the query through the same analyzer the BM25 index
      uses, so a vocabulary key "ip" cannot fire on the "ip" inside
      "recipient". Bidirectional mode treats every class member as a
      trigger (PTO ↔ "paid time off" ↔ "vacation"). The CUAD probe
      (docs/findings/CUAD_CLAUSE_EXPANSION.md) shows +3.0 points on top
      of the template-stripped baseline (the new token-level matching
      re-validates at 90.7% vs the substring-based predecessor's 90.3%
      — same workload, +0.4 from analyzer alignment).
    • Document::context_with_rewrites(query, &[&stripper, &vocab])
      — runs the chain left-to-right through retrieval. Each stage sees
      the previous stage's output; the per-stage RewriteRecords land on
      ctx.report.query_rewrites automatically.
    • Future-extensible. Both Stripper and Vocabulary are
      QueryRewrite implementations; user code can ship its own (e.g. a
      workload-specific normalizer) and chain it alongside the built-ins.
      The trait is exported on the public API surface.
    • Vocabulary::enrich(chunk) → RewriteResult — chunk-side
      mirror of apply shipped as a primitive on mechanism reasoning
      with asymmetric measured evidence
      . The mechanism (a chunk-side
      doc2query variant) and the regime hypothesis
      (expected value ∝ shortness × opacity × dictionary-exists) are
      well-grounded; the positive prediction (short opaque coded
      units — schema columns, API symbols, error codes) is not yet
      measured by RedHop
      . Spider/BIRD as the schema-regime probe is
      queued, not run. The negative prediction (long prose chunks
      • workload-pervasive vocabulary will dilute, not help) has been
        measured directly:
        CUAD_ENRICH_DEFINITIONS_NULL
        regressed retention −2.0pt vs the 90.7% workflow baseline
        (~24-point loss on the 17/50 affected contracts). This completes
        the four-corner rule from CUAD_PRF_NULL + SUB_IDF_AUTO_DROP_NULL
        onto the chunk side: workload-pervasive signal manipulation fails
        on either side of the pipeline. Users adopting enrich should
        A/B on their own corpus with redhop::evaluate(...)
        the regime rule is a hypothesis, not a guarantee. Audit trail
        (per-chunk RewriteRecord with stage: "enrich") returned to
        the caller so the A/B is auditable. Synthetic demo (not a
        benchmark): crates/examples/examples/enrich_code_search.rs.
        Full asymmetric-evidence framing + use case predictions + failure
        modes in docs/findings/VOCABULARY_ENRICH.md.
  • evaluate(query, ctx, gold) → EvalReport — in-process retrieval-eval
    scorer, no LLM judge. Self-eval (mean_grounding, evidence_density,
    retained_evidence_ratio, second_hop_rescues, low_confidence,
    estimated_waste_tokens) is always populated; gold-relative metrics
    (context_recall, context_precision, answer_token_recall) are
    optionally unlocked by passing gold_chunks and/or gold_answer.
    Composite overall blends whichever fields are available. Designed as
    a refraction of the same primitives the runtime uses to make its
    Decision Report — a low overall and report.low_confidence_retrieval
    are the same signal viewed twice, not independent measurements, so eval
    and runtime can never disagree. Rationale, contract details, and the
    10 / 11 / 9 Rust / Python / Node tests pin in
    docs/findings/EVALUATE_API.md.

Findings (the evidence layer)

New findings document what was tried, what worked, and what was
falsified across this release:

  • ConfirmedQUERY_SET_ANALYZER, CUAD_RECALL_GAP,
    CUAD_CLAUSE_EXPANSION, MULTILINGUAL_ANALYZER, EVALUATE_API,
    CUAD_HYBRID_RERANK (substitute-not-stack rule), VOCABULARY_ENRICH
    (confirmed on both sides of the regime rule), SPIDER_ENRICH
    (the positive-side validation for Vocabulary.enrich(...): curated
    chunk-side enrichment on a Spider-shape sample lifted mean column
    recall +0.19 from 0.77 → 0.97, ≥0.8 retention 63% → 93%).
  • Null result / falsifiedCUAD_PRF_NULL (unweighted PRF on
    boilerplate-heavy corpora), CUAD_CHUNK_FRAGMENTATION_NULL (chunker
    isn't the CUAD lever), SUB_IDF_AUTO_DROP_NULL (corpus-only IDF
    manipulation fails in both directions),
    CUAD_ENRICH_DEFINITIONS_NULL (chunk-side enrich on per-contract
    Definitions regressed −2.0 pts vs the 90.7% workflow baseline;
    ~24-point loss on the 17/50 contracts where Definitions were
    extractable — chunk-side parallel to CUAD_PRF_NULL's failure mode,
    measured directly).
  • The four-corner rule is now measured on all four corners.
    Workload-pervasive signal manipulation fails on either side of the
    pipeline; only workload-curated semantics work:
    query-side curated wins (CUAD_CLAUSE_EXPANSION +3.0pt) ·
    query-side auto fails (CUAD_PRF_NULL −3.7pt) ·
    chunk-side curated wins (SPIDER_ENRICH +0.19 mean recall) ·
    chunk-side auto fails (CUAD_ENRICH_DEFINITIONS_NULL −2.0pt).

Examples

Eleven new harnesses under crates/examples/examples/:
cuad_query_preprocessing, cuad_chunk_strategy_sweep,
cuad_chunk_fragmentation, cuad_clause_expansion, cuad_hybrid_rerank,
cuad_perf, cuad_prf, cuad_rust_vs_python_path,
multilingual_query_set_probe, query_set_analyzer_probe,
sub_idf_reweighting_probe.

Documentation

  • New workflow-lift chart .github/workflow_lift.svg embedded in the
    root README + binding READMEs — surfaces the 81 → 88 → 90.7% story
    visually.
  • Root README, python/README.md, nodejs/README.md "Templated
    workloads" section rewritten to detect → strip → (optional) vocabulary →
    A/B with Stripper / Vocabulary / context_with_rewrites tabled.
  • docs/CHOOSING_A_CONFIG.md step 3 leads with the new "two paths...
Read more

v0.2.2

06 Jun 12:33

Choose a tag to compare

The binding parity + evidence layer release. No breaking changes for any
binding's callers. The Node binding gains 14 missing Report fields, the
documentation gets its first visual presentation (badges, charts,
architecture diagram), and the evidence layer grows by five new findings
that document what was tried, what worked, and what was falsified honestly.

Added

Node binding — full Report field-surface parity with Python

  • Report gains 14 fields + a permanent alias: strategy,
    requestedStrategy, inputTokens, tokenBudget, tokenUtilization,
    nInputChunks, nSelected, inputDistractorRatio,
    reasoningPreservationDelta, distractorsPruned, removedTotal,
    evidenceDensity, distractorRatio, estimatedWasteTokens, plus
    secondHopRescueCount (== secondHopRescues, the existing short
    name; both names will always be present and equal). Before this
    release Node's Report exposed roughly half of Python's surface —
    programmatic callers using report.totalTokens could not read
    report.nSelected or any of the economics fields. All additions are
    non-breaking; no existing field changed name or shape.
  • docs/API_STABILITY.md gains a "Known call-shape asymmetries"
    section documenting the two pre-existing idiomatic differences
    between the Python and Node bindings (from_text positional vs
    options-bag source; ctx.text() callable in Python vs ctx.text
    property in Node). These are stable within 0.x — neither will be
    silently flipped.

README + binding-page presentation

  • Multi-registry badges at the top of every README (root, Python,
    Node). PyPI / crates.io / npm version numbers, license, and a link
    to the evidence layer. Brand color (#e11d48) on the registry
    badges.
  • A retention-vs-frameworks bar chart (.github/retention_vs_frameworks.svg)
    showing the measured head-to-head numbers from
    FRAMEWORK_COMPARISON.md:
    HotpotQA multi-hop (RedHop 77%, LangChain 71%, LlamaIndex 72%) and
    CUAD contracts (RedHop 82%, LangChain 73%, LlamaIndex 86%). Hand-rolled
    SVG, no fake screenshots, every number traces to
    reports/framework_comparison.txt.
  • A pipeline architecture diagram (.github/architecture.svg)
    showing the five stages — Document → Chunking → Retrieval →
    Allocation → BuiltContext — with the calibrating finding named under
    each internal stage and a "YOU BRING / REDHOP OWNS / YOU GET" scope
    label.
  • A Decision Report visual (.github/decision_report.svg) —
    terminal-styled SVG rendering of ctx.report output. Same content
    as the ASCII block (which is preserved under a collapsed <details>
    for copy-paste).
  • A References section at the bottom of the root README, citing
    the named work each piece of the runtime leans on: BM25 (Robertson &
    Zaragoza 2009), Porter2 (Porter 2001), RRF (Cormack et al. 2009),
    Lost-in-the-Middle (Liu et al. 2023), NQC (Shtok et al. 2012), MDR
    (Xiong et al. 2021), and the HotpotQA/MuSiQue/CUAD evaluation
    datasets. Each citation links to the finding doc that uses that
    work.
  • The binding READMEs (python/README.md, nodejs/README.md) get
    the architecture + Decision Report visuals via absolute
    raw.githubusercontent.com URLs so they render on PyPI and npm
    package pages (not just on GitHub).

Structural test suite (crates/redhop/tests/)

  • proptest_invariants.rs — 9 property-based invariants
    pinning build_context's behavior under random valid inputs:
    build_context_never_panics, resolved_strategy_is_never_auto,
    auto_decision_triangle_holds, selection_is_subset_of_input,
    no_duplicate_chunk_ids_in_output, token_budget_respected,
    report_counts_match_reality, report_ratios_are_finite_and_in_range,
    build_context_is_deterministic. Adds proptest = "1" as a
    dev-dependency. Catches edge cases hand-written tests miss by
    construction (empty input, NaN scores, all-stopword query, single-token
    budget, …) — the bug class structural tests exist to close off.
  • default_calibration.rs — 9 pins binding each tuned default
    on ContextConfig::default() / DocumentConfig::default() to the
    finding that calibrated it (token_budget = 8192,
    auto_passthrough_max_tokens = 1500, distractor_min_grounding = 0.10,
    redundancy_max_cosine = 0.92, link_min_jaccard = 0.12,
    low_confidence_max_grounding ≡ distractor_min_grounding,
    target_tokens = 128, candidate_k = 20, Document.strategy = Auto).
    Silent default drift becomes impossible; intentional drift is
    documented in the same commit that makes it.
  • public_api_snapshot.rs — 5 compile-time guards against silent
    rename/removal of public symbols, plus type-bound signature pins for
    load-bearing functions and exhaustive-match pins on
    ContextStrategy / AutoDecision / RetrievalMethod string
    variants. Caught 4 real signature mistakes during authoring.
  • golden_quality.rs — 6 end-to-end retrieval-quality canaries
    on small inline corpora: G01 lexical-keyword match, G02 stemming
    finds morphological variants, G03 distractor doesn't dominate
    relevant chunk, G04 second-hop chunk survives default assembly, G05
    low-confidence signal fires on off-corpus queries, G06 citations
    carry correct source. The canary that fires between benchmark runs
    when RedHop "got dumber" on a real query shape.
  • Three field-set parity tests (test_report_field_surface_parity,
    test_built_context_field_surface_parity,
    test_context_economics_field_surface_parity in
    python/tests/test_parity_node.py). Each compares the SET of fields
    each binding exposes for the named return type. Auto-catches the gap
    class where a new #[getter] (Python) or pub field (Node) appears
    on one side without the other keeping up — the failure mode that hid
    the 14-field Node Report gap until a smoke test stumbled on
    strategy.

Evidence layer — five new finding documents

  • MUSIQUE_RECALL_GAP.md
    decomposes the dense recall gap between HotpotQA (0.76) and MuSiQue
    (0.28) into five distinct contributors (gold density, retrieval signal
    type, wide-net coverage, embedder capacity, chunking) and documents an
    attempted full-pool RRF refactor of RetrievalMode::Hybrid that an
    honest A/B benchmark falsified. Branch feature/hybrid-full-pool-rrf
    on origin holds the working refactor as a research record; main keeps
    the existing Hybrid behavior. Includes 5 reproducible example
    harnesses under crates/examples/examples/musique_*.rs and
    hybrid_old_vs_new.rs (the A/B that closed the question).
  • RERANKING_LIMITS.md Update —
    2026-06-06 (kind-label gate)
    — falsifies both directions of the
    HotpotQA-type-label gate proposed in the original finding's "open
    problem" section. Closes that probe.
  • RERANKING_LIMITS.md Update —
    2026-06-06 (later, grounding gate)
    — documents the discovery and
    cross-corpus falsification of a grounding_top1 ≤ 0.35 gate that
    worked on HotpotQA (+0.031 lift, robust to 5-fold CV) but failed to
    generalize to MuSiQue. Also covers an NQC + WIG cross-corpus probe
    that didn't port. Closes the CE-gate research direction with a
    measured negative result. Includes the Phase A feature-logging
    harness (crates/examples/examples/ce_gate_feature_log{,_musique}.rs)
    for any future probe.
  • DENSE_RERANK_CEILING.md
    Update — 2026-06-06
    — falsifies MDR single-pass as a uniform policy
    (−0.05 vs dense baseline) while documenting a real +0.027 lift on the
    subset of queries where dense had a gold in the pool but missed it.
    Closes the single-shot MDR probe.
  • LOCAL_RERANK.md Update —
    2026-06-06
    — notes the status of LocalRerankRetriever after the
    MuSiQue investigation: it is now a building block rather than the
    default Hybrid, but the "semantic recall without ANN" contract it
    established is intact and the working refactor is preserved on
    feature/hybrid-full-pool-rrf for future re-evaluation.

Changed

  • python/tests/test_parity_node.py now pins strategy +
    requested_strategy data-value parity (in addition to the field-set
    parity above). The harness previously normalized away these fields
    rather than testing them — direct dict-key access means a future
    regression that drops either field on either side fails with a clear
    KeyError instead of silently passing.
  • Documentation polish: python/README.md's from_text row shows the
    optional source= parameter; nodejs/README.md lists the full
    report shape; the top-level README's "hybrid" row in the
    Retrieval tiers table accurately reflects the shipped semantics.

Notes on the runtime

  • RetrievalMode::Hybrid is unchanged for this release. A full-pool
    RRF refactor was built end-to-end on feature/hybrid-full-pool-rrf
    (commit c81ffbe, all tests passing, fmt + clippy clean) but a direct
    A/B benchmark falsified the ship decision: at the user-facing
    candidate_k = 20 the new composition gave only +0.0074 on MuSiQue
    and +0.0017 on HotpotQA (both below the +0.02 pre-registered ship bar)
    and regressed HotpotQA at K=4 by −0.011. The wide-K wins are real
    (+0.07 MuSiQue@50, +0.034 HotpotQA@50) but not user-facing.
    LocalRerankRetriever's BM25-prune-then-RRF composition is still
    what RetrievalMode::Hybrid resolves to. Full A/B numbers and the
    ship-decision audit are in
    MUSIQUE_RECALL_GAP.md.

v0.2.1

06 Jun 02:20

Choose a tag to compare

The robustness + bugfix patch release. Two real bugs fixed (one BM25
edge case, one cross-binding serde-compat break), one new Python helper,
and ~30 new tests pinning load-bearing contracts across the codebase.

Fixed

  • BM25: silent wildcard fallback on no-signal queries. Queries whose
    every term was filtered out (stopwords only, or all-out-of-vocab)
    silently fell back to a match-all wildcard, returning the corpus's
    top-BM25 chunks as if the query had matched something. Now returns
    an empty result set with a clear signal.
  • ContextReport.removed and .economics missing #[serde(default)].
    A binding payload from an older RedHop binary missing these fields
    would error on deserialize — a silent cross-version compatibility break
    for Python/Node callers shuttling ContextReport across the FFI as
    JSON. Both target types already derive Default; the fix is a no-op
    for fresh payloads and gracefully fills in zeros for old ones.

Added

  • redhop.context_with_timeout (Python). Thin ThreadPoolExecutor
    watchdog around Document.context() for agent integrations that need
    to bail on slow queries:

    try:
        ctx = redhop.context_with_timeout(doc, q, timeout_ms=5000)
    except TimeoutError:
        ...

    Forwards budget / neighbors / include_heading. Scope is
    deliberately Python-only — true Rust-side cancellation needs hooks in
    Tantivy/ONNX that don't exist yet, and the docstring + TimeoutError
    message document the limitation.

  • docs/DEFAULT_PROVENANCE.md — every tuned default in
    ContextConfig / DocumentConfig linked back to the finding that
    justifies it (so callers can audit which numbers are calibrated vs
    arbitrary).

Internal — robustness tests

Seven new test passes (~30 tests) pinning load-bearing contracts that
were previously informal:

  • Determinism — same input → same output, Rust + cross-binding parity.
  • Internal invariants — 7+ consistency invariants across the strategy
    matrix (selected ⊆ input, removed.total matches drop count, etc.).
  • ConcurrencySend + Sync audit + 1024-call parallel stress.
  • Adversarial loaders — 9 tests covering corrupt PDFs, symlink loops,
    deep recursion, malformed DOCX/PPTX/XLSX.
  • Auto-gate boundary — pins the inclusive <= semantics at
    1499/1500/1501 input tokens + the custom-gate path.
  • Serde round-trip — every cross-FFI type (Chunk, Score,
    ContextReport, ...) survives JSON round-trip; forward-compat
    exercised via a minimal pre-0.1.3 payload.
  • Strategy semantics — 7 differential tests pinning the contrasts
    between all 5 ContextStrategy variants on a shared corpus
    (catches accidental strategy convergence).
  • Persisted cache — incremental cache hit/miss contract for
    read_folder_with(persist=true): per-file (mtime, size) skip,
    no-op reload doesn't rewrite, fingerprint invalidation on config
    change, deleted-file cleanup.

No public API changes. Python and Node callers are unaffected aside
from the new context_with_timeout helper.

v0.2.0

03 Jun 13:34

Choose a tag to compare

The binding-parity + non-English release. Three months of incremental
quality work plus a focused arc on cross-binding consistency: Python, Node,
and Rust all expose the same surface, return the same values for the same
inputs, and drift is now actively prevented in CI. The Rust crate also gains
a pluggable lexical analyzer, closing the structural bug class (BM25 ↔
grounding-scorer disagreement) that 0.1.3–0.1.4 fixed by hand four times.

Breaking changes

Two source-level breaks for Rust callers; ..Default::default() and the
pip/npm consumers are unaffected:

  1. ContextConfig + DocumentConfig grew new required fields
    (analyzer: Arc<dyn Analyzer>) for the pluggable lexical analyzer.
    Callers constructing those structs via field literals from outside
    the crate need to add analyzer: redhop::analyzer::default_english().
  2. ContextConfig::default().token_budget changed from 2048 → 8192
    to align with the Python binding's long-standing default (which was
    shipping to PyPI users that whole time). Rust callers relying on the
    old 2048 default will now get a 4× larger assembled context. Set
    token_budget: 2048 explicitly to restore the old behavior. Python +
    Node users see no change.

Added

Pluggable lexical analyzer

  • crate::analyzer::Analyzer trait + SnowballAnalyzer (18
    Snowball Porter2 languages). First-class extension point: one analyzer
    drives BOTH the BM25 retriever AND the grounding scorer, so the two
    layers structurally cannot disagree on what "the same term" means.
    Design rationale in docs/design/ANALYZER_PLUGIN.md; usage in
    docs/LANGUAGE.md.
  • Document::with_analyzer(Arc<dyn Analyzer>) — mirrors
    with_embedder. Swaps the analyzer for both layers in lockstep.
  • LoadOptions::language: Option<String> — string-routed access to
    the 18 builtins (english, german, french, spanish, italian,
    portuguese, dutch, russian, swedish, norwegian, danish,
    finnish, romanian, hungarian, turkish, arabic, greek,
    tamil). Unknown language names return an error (no silent fallback
    to English).
  • Python language kwarg on every Document.from_* constructor.
  • Node language field on Options.

Binding parity (Node catches up to Python)

  • Document.analyze(query) — pure diagnostics, returns the same
    Report shape as context().report without paying assembly cost.
  • Document.nFiles getter — number of source files indexed (1
    for single-source ctors, the readable count for fromFolder).
  • Document.skippedFiles getter — SkippedFile[] ({source, reason} pairs) for files fromFolder couldn't parse. Was a silent
    skip with no introspection before.
  • buildContext / filterContext / analyzeContext /
    contextEconomics
    top-level functions — the low-level "I do my own
    retrieval, just want RedHop for assembly" surface. Mirrors Python's
    same-named functions; takes ChunkInput[] + ContextOptions.
  • groundingScore(query, text) + linkStrength(a, b) — the
    observability primitives the strategies use internally, exposed so
    external code reuses RedHop's exact relevance notion instead of
    reimplementing.

Tests + infrastructure

  • crates/redhop/tests/quality_suite.rs — 45-test behavior-level
    suite organized by what a user perceives, not by code structure.
    Covers tokenization (T01-T07), multi-field reach (T08-T09), document
    structure (T10-T13), context assembly (T14-T20), hybrid contract
    (T21-T22), edge cases (T23-T26), Unicode/multilingual (T27-T30),
    adversarial queries (T31-T34), nested markdown (T35), cross-format
    mixed corpus (T36), non-English pinning (T37-T40), and the analyzer
    plugin (T41-T45). Found two real bugs on its first runs (an
    empty-query BM25 crash and an accent-folding gap), and a binding bug
    via T41-T44 (from_chunks silently dropping language= in Python).
  • python/tests/test_parity_node.py + nodejs/test/parity_runner.cjs
    — cross-binding parity harness. 6 tests hand identical inputs to
    Python and Node and diff structured outputs (caught the
    analyzeContext / contextEconomics token_budget divergence on
    its first run).
  • crates/cli/tests/cli_smoke.rs — first-ever CLI integration
    tests. Asserts --help works on each subcommand + a real
    analyze-context - stdin pipe.
  • Node CI job.github/workflows/ci.yml now builds the napi
    addon and runs npm test on PRs. Previously PRs only exercised
    Rust + Python.
  • ASCII folding (cafécafe, SüßigkeitSussigkeit,
    naïvenaive) in both BM25 and the grounding scorer (via NFKD).
    New tests T27, T28, T39 pin this.

Documentation

  • docs/LANGUAGE.md — honest scope of non-English support, by
    family + the Analyzer plugin's public API (Rust / Python / Node).
  • docs/design/ANALYZER_PLUGIN.md — rewritten to describe the
    shipped surface (was originally a proposal with several deviations).
  • README "Language support" section + per-package READMEs
    (python/README.md, nodejs/README.md) — language= examples.
  • docs/ARCHITECTURE.md — refreshed against the post-consolidation
    workspace (the pre-0.2 split of redhop-{core,context,…} into
    separate crates was rolled into one published redhop crate; diagram
    and crate-name references updated).
  • docs/API_STABILITY.md — full Node section added; Python section
    updated with language=, n_files, skipped_files; Rust section
    updated with the consolidated module paths.

Changed

  • Python folder walker unified with Rust's read_folder_with
    −429 LOC in python/src/lib.rs (≈25% of the file). Removed the
    parallel build_folder_persisted, collect_files, PersistedIndex,
    CachedFile, fingerprint, etc. Both bindings now share Rust's
    single implementation; on-disk index format is byte-compatible with
    the previous Python writer, so existing caches reload cleanly.
  • strategy_from_str + retrieval_from_str consolidated to a
    single source of truth in redhop::load. Python's wrappers now
    forward to the Rust functions with map_err instead of duplicating
    the match arms.
  • Document carries n_files() and skipped_files() accessors on
    the Rust struct. Single-source constructors default to 1 / empty;
    read_folder_with (both simple and persisted paths) now records
    (source, reason) for each skipped file instead of silently dropping
    them.
  • MSRV bumped 1.75 → 1.77 across all three workspace declarations
    (workspace, python/Cargo.toml, nodejs/Cargo.toml) — the napi-rs
    2.x in the Node binding sets the actual floor; the inconsistency
    meant a 1.75 user hit a mysterious napi error instead of a clear MSRV
    one.

Fixed

  • All-stopword query no longer crashes BM25. A query the analyzer
    pipeline reduces to zero positive terms ("", " ", "the and is of in or") used to surface as a hard Tantivy error (Invalid query: Only excluding terms given). The retriever now traps that error
    class (and the empty query class) and returns an empty result.
    Caught by quality_suite::t25 on its first run.
  • Python Document.from_chunks silently dropped language= — the
    pyo3 signature accepted the kwarg but the call into doc_config
    passed None instead of the user's value. Caught by the new Python
    analyzer test suite on its first run.
  • Node analyzeContext / contextEconomics were honoring the
    user's token_budget option; Python's equivalents hardcode
    usize::MAX because these are no-budget pure-analysis surfaces.
    Caught by the cross-binding parity tests on their first run.
  • Node index.d.ts was stale — the language field and
    minCandidates field were present on the Rust Options struct but
    hadn't been regenerated. TypeScript users got "Object literal may
    only specify known properties" on perfectly valid options.

Notes

  • unicode-normalization promoted from transitive (via tantivy) to a
    direct dep of redhop. Used for the grounding scorer's NFKD fold.
  • Workspace test count: 320/320 (Rust) + 81/81 (Python, +1 BGE
    fixture skip) + Node smoke + analyzer suites. Was 260 at the v0.1.4
    tag.
  • CI gates: cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace, cargo doc --workspace --no-deps --features files,semantic (warning-free), the
    cross-binding parity suite, and the Node smoke + analyzer suites.
    All six CI jobs green.
  • [package.metadata.docs.rs] all-features = true added to the
    redhop crate so the published doc page on docs.rs shows the
    files + semantic items instead of just the lean lexical surface.
  • 21 example files swept clean of hardcoded /Users/vysakh/... paths;
    they resolve datasets/models/exports through
    redhop_examples::{data_path, exports_path, model_path, bge_small_paths, ms_marco_paths} helpers that honor
    REDHOP_{DATA,EXPORTS,MODELS}_DIR env vars.

v0.1.4

02 Jun 10:36

Choose a tag to compare

Citation ergonomics — for both code and prose. Follow-on to 0.1.3's
BM25-quality theme: retrieval is now precise, but the assembled context
returned to the LLM left the user staring at a def line without the
implementation, or at a deep section paragraph without its parent heading.
Both gaps closed by default; both have explicit opt-out knobs. Plus three
prose-side fixes that surfaced during the audit (setext headings, PDF
heading heuristic, plumbing).

Changed

  • Document::context(query) on a code chunk now attaches ±1 neighbor
    chunks by default.
    Code is chunked as fixed-token windows so a 50-line
    function often spans 2-3 chunks; a hit on the chunk containing the def
    line would previously cite only the signature, omitting the body in the
    next chunk. With DocumentConfig::code_neighbors_default = 1
    (the new default), citations on code hits include the surrounding
    implementation. Behavior change for code-shaped corpora — set
    code_neighbors_default: 0 to restore the old chunk-only behavior. No
    effect on prose corpora (fires only on chunks tagged
    metadata["kind"] == "code").
  • Document::context(query) on a prose chunk with a section heading
    now attaches the section's opener chunk by default.
    A query that
    lands deep inside ## Refunds → ### Eligibility previously cited only
    the matched chunk — the LLM lost the section title. With
    DocumentConfig::prose_heading_default = true (the new default), the
    section's first chunk is attached. Behavior change for hierarchical
    prose — set prose_heading_default: false to disable. Only fires on
    chunks that carry non-empty metadata["heading"] (markdown, DOCX,
    PPTX, XLSX, and — new in this release — PDF).
  • Markdown sections now recognize setext headings (Title\n=====
    for H1, Title\n----- for H2) in addition to ATX (#/##/…). YAML
    frontmatter (--- ... --- at file start) is detected and excluded
    from setext scanning so its closing fence doesn't get treated as an
    H2 underline. Pandoc output / older docs / man pages now produce the
    same section structure as their ATX equivalents.
  • PDF chunks now carry best-effort heading metadata. A per-page
    heuristic lifts the first short, non-paragraph-shaped line into
    Section::heading (rejecting page-number footers, body lines ending
    in sentence punctuation, and lines ending in a digit). Lets the BM25
    heading-field search added in 0.1.3 actually reach PDF chunks by
    topic; previously metadata["heading"] was always None on PDFs.

Added

  • DocumentConfig::code_neighbors_default: usize (default 1).
  • DocumentConfig::prose_heading_default: bool (default true).
    Both inherited via the Python / Node bindings' default config; no
    binding-surface change for callers who don't override.

Fixed

  • (No code-bug fixes — 0.1.3 already closed the BM25 quality gaps. See
    the Notes section for the verified-not-broken embedding-persistence
    story.)

Notes

  • Embedding persistence verified. The 0.1.3 audit suspected that
    read_folder_with(persist=true) re-embedded every chunk on reload
    (paying ~30-60 sec of bge-small cost per cold start on a 1000-chunk
    codebase). The machinery is already correct: embedded_chunks()
    populates the Chunk::embedding field from the retriever cache before
    writing index.json, Embedding is Serialize/Deserialize, and
    LocalRerankRetriever::index short-circuits any chunk that comes back
    with an embedding already set. Round-trip test
    (crates/redhop/tests/embedding_persistence.rs) now pins this — a
    reload triggers exactly 1 embed call (the query), not N+1 (the query +
    every chunk). Locked in as a regression guard.
  • Eleven new tests across the citation-ergonomics theme: 3 for the code
    neighbor default, 3 for the prose heading default, 3 for setext
    headings + frontmatter handling, 2 for the PDF heading heuristic.
    111/111 tests pass under
    cargo test -p redhop --features files.