Releases · vysakh0/redhop

09 Jun 13:26

v0.3.3

6707e20

v0.3.3 Latest

Latest

Answer-quality eval surface (Rust + Python + Node) + audit of
defaulted-on heuristics. Two threads in this release:
(1) a new in-process evaluate(...) / critique(...) surface for
closed-set answer-quality metrics — lexical + LLM-judged, with
claim-decomposed faithfulness and TP/FP/FN correctness; calibrated
against Ragas at n=200 HotpotQA (r=+0.664, MAE=0.151);
(2) the defaulted-on heuristics audit (five measured, one
already-flipped in 0.3.2, two API smells fixed).

Added — eval surface

evaluate(query, ctx, answer=, gold_answer=, judge=, decompose_faithfulness=, decompose_correctness=) —
in-process answer-quality eval that returns one EvalReport blending
lexical (CI-deterministic) and LLM-judged metrics. Available in Rust
(redhop::evaluate(...)), Python (redhop.evaluate(...)), and Node
(evaluateWithJudge(...) for the async judge path). Faithfulness,
relevancy, correctness in _lexical (no LLM) and _judged (opt-in)
flavors; gold-relative metrics (context_recall,
context_precision, answer_token_recall) when gold is provided.
critique(answer, aspects, judge=, context=, query=) — open-ended
user-defined dimensions (harmfulness, conciseness, brand voice,
etc.). One LLM call per aspect; polarity-corrected scores so high =
good across the report regardless of highIsGood. Returns a
CritiqueReport with per-aspect scores in input order.
summarize(reports) — aggregates a sequence of per-case
EvalReports into a means + N + share-flagged summary, the same
shape RedHop's runtime uses for its Decision Report.
Judge surface — Judge.from_callable(fn).cached() (Python),
Judge.fromCallable(fn, name).cached() (Node), and the Rust
Judge trait with CachedJudge and CallableJudge wrappers. One
caching layer for any user-supplied LLM caller; an LRU sized by
the caller. Single primitive supports faithfulness, relevancy,
correctness, critique, and decomposed paths.
Claim-decomposed faithfulness (decompose_faithfulness=True):
extracts atomic claims via a few-shot LLM call, then batch-verifies
all of them in a single second LLM call. Two LLM calls regardless
of how many claims were extracted. gpt-4o-mini correlates with
Ragas's faithfulness at r=+0.664 on n=200 HotpotQA (see
docs/findings/EVAL_JUDGED_CALIBRATION.md).
Verifier prompt includes paraphrase-positive examples + negative
entity-substitution examples to balance strictness and recall.
TP/FP/FN correctness (decompose_correctness=True): mirrors
decomposed-faithfulness on the answer-vs-gold axis. Extracts
claims from both the answer and the gold answer, classifies each as
TP / FP / FN, returns F₁. Diagnostic counters
(n_correctness_tp/fp/fn) surface the intermediate categorisation.
Refusal-aware decomposition — "I don't know" answers correctly
produce mean_faithfulness_judged = None (0 claims extracted)
instead of being scored as a vacuous 1.0. Surfaces refusals as a
distinct category, not as faithfulness=1.
Diagnostic counters on EvalReport:
n_faithfulness_claims_extracted, n_faithfulness_claims_supported,
n_correctness_tp, n_correctness_fp, n_correctness_fn. Surface
intermediate classifier counts so callers can debug WHY a metric
landed where it did.

Added — eval evidence

docs/COMPARISON_RAGAS.md — public-facing
head-to-head with Ragas on claim-decomposed faithfulness. n=200
HotpotQA, gpt-4o-mini, with Claude haiku as third-judge tie-breaker.
docs/findings/EVAL_JUDGED_CALIBRATION.md
rewritten end-to-end: three-layer evidence (5-case wiring probe →
5-case Ragas side-by-side → n=200 HotpotQA correlation + third-judge
subset). Documents the v0→v4 prompt iteration that fixed four
traceable failure modes (paraphrase rejection, comparative
hallucination, compound-attribution dilution, wrong-entity
substitution). Calls out single-shot LLM noise as a measured
property of the workload (gpt-4o-mini at temp=0 is not
deterministic through OpenRouter — ~20–30% per-case variance).
docs/findings/ANSWER_QUALITY_EVAL.md —
full API tour for the new evaluate(...) + critique(...) surface.
docs/findings/EVAL_VS_RAGAS_SOURCE.md —
source-read comparison of the two libraries' implementations.
bench/eval_correlation_hotpot.py — runs the n=200 Pearson r / MAE
measurement on HotpotQA against Ragas with configurable context mode
(supporting / distractor_only / all).
bench/eval_third_judge.py — Claude haiku tie-breaker via the local
claude -p CLI; no API key needed.
bench/eval_faith_trace.py — diagnostic harness for tracing claim
extraction + per-claim verifier votes on specific qids. Has a
--variant v0/v1/v2/v3/v4 flag for prompt iteration without
rebuilding the Rust crate.
bench/eval_judged_calibration.py — the 5-case wiring probe with
optional Ragas side-by-side when installed.
bench/select_third_judge_subset.py — filters contested cases from
a correlation-bench JSON so the third-judge run stays cheap.

Breaking — Rust only

redhop::evaluate(...) signature now takes six parameters:
(query, ctx, answer, gold, judge, config) instead of the prior
three-parameter shape. The Python and Node bindings absorb this via
kwargs / options and are NOT breaking. Pass None for answer and
judge and EvalConfig::default() for config to match the old
behavior.

Added — defaulted-on heuristics audit

code_neighbors_default / codeNeighborsDefault — surfaces the
±N adjacent-chunk auto-pull on code chunks as a constructor kwarg on
Python from_text/from_file/from_bytes/from_folder/from_chunks
and as a field on the Node Options struct. Default 1 (unchanged
behavior). Pass 0 to disable, or 2/3 for more aggressive
expansion under loose token budgets. See
docs/findings/CODE_NEIGHBORS_DEFAULT.md
for the measured budget tradeoff.
prose_heading_default / proseHeadingDefault — surfaces the
auto-attach of section-heading chunks to prose hits as the same
constructor-level kwarg / option. Default true (unchanged). Pass
false for memory-tight workloads where the heading isn't
load-bearing. See
docs/findings/PROSE_HEADING_DEFAULT.md
for the measured +7pt ≥0.8 lift at typical budgets.
crates/redhop/src/load.rs — LoadOptions now exposes
code_neighbors_default and prose_heading_default as
Option<usize> / Option<bool>. Threads through read_folder_with
for parity with the in-memory loaders.
Audit finding docs. Five new findings on the defaulted-on
heuristics audit:
RAW_ANALYZER (flipped in 0.3.2),
HYBRID_CANDIDATE_POOL
(inert knob — don't tune),
PROSE_HEADING_DEFAULT
(+7pt at typical budgets),
BM25_SOURCE_FIELD (+4pt with
signal, 0pt with noise),
CODE_NEIGHBORS_DEFAULT
(budget-dependent compromise).
Cross-binding parity tests for the two new kwargs (Python
test_loaders.py, Node smoke.cjs).

Changed

No default values changed in this release. All defaults remain
what they were after 0.3.2 — the new kwargs default to the existing
Rust values (code_neighbors_default=1, prose_heading_default=true).
Existing callers see zero behavior change; the new kwargs are an
opt-out / tune surface only.

Assets 2

07 Jun 17:18

github-actions

v0.3.0

d41ba84

v0.3.0

The workflow + measurement release. Ships a new public-API surface
that closes the templated-workload retention gap end-to-end, in all three
bindings (Rust, Python, Node): analyze_query_set, the QueryRewrite
trait with two built-in implementations (Stripper and Vocabulary),
Document::context_with_rewrites(...) to compose them with an audit
trail, Vocabulary::enrich(...) as the chunk-side mirror, and evaluate
for deterministic A/B with no LLM judge. On the CUAD framework comparison
the full detect → compile → context_with_rewrites → A/B workflow takes
≥0.8 retention from 81.3% → 90.7% — a 9.4-point lift over raw BM25,
beating LlamaIndex's 86% by 4 points, at native BM25 latency (~2.5ms/query)
on default lexical retrieval. Worked example, hand-curated CUAD clause-name
dictionary, and a 6-arm probe contrasting the workflow vs hybrid+cross-encoder
live in docs/findings/CUAD_CLAUSE_EXPANSION.md and
docs/findings/CUAD_HYBRID_RERANK.md.

Vocabulary.enrich(...) ships with bidirectional measured evidence on the
regime rule it follows. Positive side: docs/findings/SPIDER_ENRICH.md
measured +0.19 mean column recall on Spider-shape schema retrieval (curated
workload synonyms; n=30, candidate_k=10). Negative side:
docs/findings/CUAD_ENRICH_DEFINITIONS_NULL.md measured −2.0 pts on
CUAD prose chunks. The two findings together complete the four-corner
rule with measured evidence on all four corners: workload-pervasive
signal manipulation fails on either side of the pipeline; only
workload-curated semantics work. See docs/findings/VOCABULARY_ENRICH.md
for the regime rule, use-case ranking, and failure modes.

Breaking on the manual-chunks path (Python + Node): the typed
redhop.Chunk(text, *, source=None, id=None, metadata=None, ...) constructor
becomes the only accepted input shape for Document.from_chunks and the
low-level build_context / filter_context / analyze_context /
context_economics entry points. Bare strings and dicts both raise
ValueError with a migration hint pointing at the new constructor. The
trade-off is intentional: the dict path didn't expose chunk metadata at all,
so manually-constructed chunks couldn't carry page / heading / line
into citations — a real functional gap, not just ergonomics. The typed
Chunk closes that gap and surfaces source (provenance) and id
(identity) as the two distinct concepts they already are in the Rust core
(see Breaking below for the migration).

Added

Templated-workload helpers (Rust + Python + Node)

analyze_query_set(queries) → QuerySetReport — diagnostic that takes
a representative sample of your queries and reports whether they share
enough boilerplate to be templated, which terms are doing the dilution,
and a coarse estimated_dilution_cost band. Cross-workload probe
(docs/findings/QUERY_SET_ANALYZER.md): CUAD fires (share 0.66, cost
high); HotpotQA + MuSiQue both stay quiet (0.00 and 0.12, both
is_templated=False). Conservative by design — false positives push
users toward a workaround that won't help, which is worse than staying
quiet.
QueryRewrite trait + Stripper + Vocabulary — compiled,
observable, token-level-correct replacement for the function-form
rewrites originally drafted for this release. Each QueryRewrite
implementation returns a RewriteResult { query, record } so every
stage's {stage, from, to, matched, added, removed} lands on
ContextReport::query_rewrites automatically when called through
the chain.
- Stripper::new(boilerplate) — compiled boilerplate-removal
  rewrite. Matches at token granularity through the analyzer (with a
  surface-form fallback for tokens like "of"/"the" that stem to
  empty), so a single-token strip cannot accidentally erase a
  substring inside a longer word (an "of" strip does not erase
  the "of" inside "office"). Replaces the substring-based
  drop_template_terms function originally drafted for 0.3.0.
- Vocabulary::new(entries) / Vocabulary::bidirectional(entries)
  — compiled workload-curated equivalence classes. Tokenizes keys,
  synonyms, and the query through the same analyzer the BM25 index
  uses, so a vocabulary key "ip" cannot fire on the "ip" inside
  "recipient". Bidirectional mode treats every class member as a
  trigger (PTO ↔ "paid time off" ↔ "vacation"). The CUAD probe
  (docs/findings/CUAD_CLAUSE_EXPANSION.md) shows +3.0 points on top
  of the template-stripped baseline (the new token-level matching
  re-validates at 90.7% vs the substring-based predecessor's 90.3%
  — same workload, +0.4 from analyzer alignment).
- Document::context_with_rewrites(query, &[&stripper, &vocab])
  — runs the chain left-to-right through retrieval. Each stage sees
  the previous stage's output; the per-stage RewriteRecords land on
  ctx.report.query_rewrites automatically.
- Future-extensible. Both Stripper and Vocabulary are
  QueryRewrite implementations; user code can ship its own (e.g. a
  workload-specific normalizer) and chain it alongside the built-ins.
  The trait is exported on the public API surface.
- Vocabulary::enrich(chunk) → RewriteResult — chunk-side
  mirror of apply shipped as a primitive on mechanism reasoning
  with asymmetric measured evidence. The mechanism (a chunk-side
  doc2query variant) and the regime hypothesis
  (expected value ∝ shortness × opacity × dictionary-exists) are
  well-grounded; the positive prediction (short opaque coded
  units — schema columns, API symbols, error codes) is not yet
  measured by RedHop. Spider/BIRD as the schema-regime probe is
  queued, not run. The negative prediction (long prose chunks
  - workload-pervasive vocabulary will dilute, not help) has been
    measured directly:
    CUAD_ENRICH_DEFINITIONS_NULL
    regressed retention −2.0pt vs the 90.7% workflow baseline
    (~24-point loss on the 17/50 affected contracts). This completes
    the four-corner rule from CUAD_PRF_NULL + SUB_IDF_AUTO_DROP_NULL
    onto the chunk side: workload-pervasive signal manipulation fails
    on either side of the pipeline. Users adopting enrich should
    A/B on their own corpus with redhop::evaluate(...) —
    the regime rule is a hypothesis, not a guarantee. Audit trail
    (per-chunk RewriteRecord with stage: "enrich") returned to
    the caller so the A/B is auditable. Synthetic demo (not a
    benchmark): crates/examples/examples/enrich_code_search.rs.
    Full asymmetric-evidence framing + use case predictions + failure
    modes in docs/findings/VOCABULARY_ENRICH.md.
evaluate(query, ctx, gold) → EvalReport — in-process retrieval-eval
scorer, no LLM judge. Self-eval (mean_grounding, evidence_density,
retained_evidence_ratio, second_hop_rescues, low_confidence,
estimated_waste_tokens) is always populated; gold-relative metrics
(context_recall, context_precision, answer_token_recall) are
optionally unlocked by passing gold_chunks and/or gold_answer.
Composite overall blends whichever fields are available. Designed as
a refraction of the same primitives the runtime uses to make its
Decision Report — a low overall and report.low_confidence_retrieval
are the same signal viewed twice, not independent measurements, so eval
and runtime can never disagree. Rationale, contract details, and the
10 / 11 / 9 Rust / Python / Node tests pin in
docs/findings/EVALUATE_API.md.

Findings (the evidence layer)

New findings document what was tried, what worked, and what was
falsified across this release:

Confirmed — QUERY_SET_ANALYZER, CUAD_RECALL_GAP,
CUAD_CLAUSE_EXPANSION, MULTILINGUAL_ANALYZER, EVALUATE_API,
CUAD_HYBRID_RERANK (substitute-not-stack rule), VOCABULARY_ENRICH
(confirmed on both sides of the regime rule), SPIDER_ENRICH
(the positive-side validation for Vocabulary.enrich(...): curated
chunk-side enrichment on a Spider-shape sample lifted mean column
recall +0.19 from 0.77 → 0.97, ≥0.8 retention 63% → 93%).
Null result / falsified — CUAD_PRF_NULL (unweighted PRF on
boilerplate-heavy corpora), CUAD_CHUNK_FRAGMENTATION_NULL (chunker
isn't the CUAD lever), SUB_IDF_AUTO_DROP_NULL (corpus-only IDF
manipulation fails in both directions),
CUAD_ENRICH_DEFINITIONS_NULL (chunk-side enrich on per-contract
Definitions regressed −2.0 pts vs the 90.7% workflow baseline;
~24-point loss on the 17/50 contracts where Definitions were
extractable — chunk-side parallel to CUAD_PRF_NULL's failure mode,
measured directly).
The four-corner rule is now measured on all four corners.
Workload-pervasive signal manipulation fails on either side of the
pipeline; only workload-curated semantics work:
query-side curated wins (CUAD_CLAUSE_EXPANSION +3.0pt) ·
query-side auto fails (CUAD_PRF_NULL −3.7pt) ·
chunk-side curated wins (SPIDER_ENRICH +0.19 mean recall) ·
chunk-side auto fails (CUAD_ENRICH_DEFINITIONS_NULL −2.0pt).

Examples

Eleven new harnesses under crates/examples/examples/:
cuad_query_preprocessing, cuad_chunk_strategy_sweep,
cuad_chunk_fragmentation, cuad_clause_expansion, cuad_hybrid_rerank,
cuad_perf, cuad_prf, cuad_rust_vs_python_path,
multilingual_query_set_probe, query_set_analyzer_probe,
sub_idf_reweighting_probe.

Documentation

New workflow-lift chart .github/workflow_lift.svg embedded in the
root README + binding READMEs — surfaces the 81 → 88 → 90.7% story
visually.
Root README, python/README.md, nodejs/README.md "Templated
workloads" section rewritten to detect → strip → (optional) vocabulary →
A/B with Stripper / Vocabulary / context_with_rewrites tabled.
docs/CHOOSING_A_CONFIG.md step 3 leads with the new "two paths...

Assets 2

06 Jun 12:33

github-actions

v0.2.2

9be8a18

v0.2.2

The binding parity + evidence layer release. No breaking changes for any
binding's callers. The Node binding gains 14 missing Report fields, the
documentation gets its first visual presentation (badges, charts,
architecture diagram), and the evidence layer grows by five new findings
that document what was tried, what worked, and what was falsified honestly.

Added

Node binding — full `Report` field-surface parity with Python

Report gains 14 fields + a permanent alias: strategy,
requestedStrategy, inputTokens, tokenBudget, tokenUtilization,
nInputChunks, nSelected, inputDistractorRatio,
reasoningPreservationDelta, distractorsPruned, removedTotal,
evidenceDensity, distractorRatio, estimatedWasteTokens, plus
secondHopRescueCount (== secondHopRescues, the existing short
name; both names will always be present and equal). Before this
release Node's Report exposed roughly half of Python's surface —
programmatic callers using report.totalTokens could not read
report.nSelected or any of the economics fields. All additions are
non-breaking; no existing field changed name or shape.
docs/API_STABILITY.md gains a "Known call-shape asymmetries"
section documenting the two pre-existing idiomatic differences
between the Python and Node bindings (from_text positional vs
options-bag source; ctx.text() callable in Python vs ctx.text
property in Node). These are stable within 0.x — neither will be
silently flipped.

README + binding-page presentation

Multi-registry badges at the top of every README (root, Python,
Node). PyPI / crates.io / npm version numbers, license, and a link
to the evidence layer. Brand color (#e11d48) on the registry
badges.
A retention-vs-frameworks bar chart (.github/retention_vs_frameworks.svg)
showing the measured head-to-head numbers from
FRAMEWORK_COMPARISON.md:
HotpotQA multi-hop (RedHop 77%, LangChain 71%, LlamaIndex 72%) and
CUAD contracts (RedHop 82%, LangChain 73%, LlamaIndex 86%). Hand-rolled
SVG, no fake screenshots, every number traces to
reports/framework_comparison.txt.
A pipeline architecture diagram (.github/architecture.svg)
showing the five stages — Document → Chunking → Retrieval →
Allocation → BuiltContext — with the calibrating finding named under
each internal stage and a "YOU BRING / REDHOP OWNS / YOU GET" scope
label.
A Decision Report visual (.github/decision_report.svg) —
terminal-styled SVG rendering of ctx.report output. Same content
as the ASCII block (which is preserved under a collapsed <details>
for copy-paste).
A References section at the bottom of the root README, citing
the named work each piece of the runtime leans on: BM25 (Robertson &
Zaragoza 2009), Porter2 (Porter 2001), RRF (Cormack et al. 2009),
Lost-in-the-Middle (Liu et al. 2023), NQC (Shtok et al. 2012), MDR
(Xiong et al. 2021), and the HotpotQA/MuSiQue/CUAD evaluation
datasets. Each citation links to the finding doc that uses that
work.
The binding READMEs (python/README.md, nodejs/README.md) get
the architecture + Decision Report visuals via absolute
raw.githubusercontent.com URLs so they render on PyPI and npm
package pages (not just on GitHub).

Structural test suite (`crates/redhop/tests/`)

proptest_invariants.rs — 9 property-based invariants
pinning build_context's behavior under random valid inputs:
build_context_never_panics, resolved_strategy_is_never_auto,
auto_decision_triangle_holds, selection_is_subset_of_input,
no_duplicate_chunk_ids_in_output, token_budget_respected,
report_counts_match_reality, report_ratios_are_finite_and_in_range,
build_context_is_deterministic. Adds proptest = "1" as a
dev-dependency. Catches edge cases hand-written tests miss by
construction (empty input, NaN scores, all-stopword query, single-token
budget, …) — the bug class structural tests exist to close off.
default_calibration.rs — 9 pins binding each tuned default
on ContextConfig::default() / DocumentConfig::default() to the
finding that calibrated it (token_budget = 8192,
auto_passthrough_max_tokens = 1500, distractor_min_grounding = 0.10,
redundancy_max_cosine = 0.92, link_min_jaccard = 0.12,
low_confidence_max_grounding ≡ distractor_min_grounding,
target_tokens = 128, candidate_k = 20, Document.strategy = Auto).
Silent default drift becomes impossible; intentional drift is
documented in the same commit that makes it.
public_api_snapshot.rs — 5 compile-time guards against silent
rename/removal of public symbols, plus type-bound signature pins for
load-bearing functions and exhaustive-match pins on
ContextStrategy / AutoDecision / RetrievalMethod string
variants. Caught 4 real signature mistakes during authoring.
golden_quality.rs — 6 end-to-end retrieval-quality canaries
on small inline corpora: G01 lexical-keyword match, G02 stemming
finds morphological variants, G03 distractor doesn't dominate
relevant chunk, G04 second-hop chunk survives default assembly, G05
low-confidence signal fires on off-corpus queries, G06 citations
carry correct source. The canary that fires between benchmark runs
when RedHop "got dumber" on a real query shape.
Three field-set parity tests (test_report_field_surface_parity,
test_built_context_field_surface_parity,
test_context_economics_field_surface_parity in
python/tests/test_parity_node.py). Each compares the SET of fields
each binding exposes for the named return type. Auto-catches the gap
class where a new #[getter] (Python) or pub field (Node) appears
on one side without the other keeping up — the failure mode that hid
the 14-field Node Report gap until a smoke test stumbled on
strategy.

Evidence layer — five new finding documents

MUSIQUE_RECALL_GAP.md —
decomposes the dense recall gap between HotpotQA (0.76) and MuSiQue
(0.28) into five distinct contributors (gold density, retrieval signal
type, wide-net coverage, embedder capacity, chunking) and documents an
attempted full-pool RRF refactor of RetrievalMode::Hybrid that an
honest A/B benchmark falsified. Branch feature/hybrid-full-pool-rrf
on origin holds the working refactor as a research record; main keeps
the existing Hybrid behavior. Includes 5 reproducible example
harnesses under crates/examples/examples/musique_*.rs and
hybrid_old_vs_new.rs (the A/B that closed the question).
RERANKING_LIMITS.md Update —
2026-06-06 (kind-label gate) — falsifies both directions of the
HotpotQA-type-label gate proposed in the original finding's "open
problem" section. Closes that probe.
RERANKING_LIMITS.md Update —
2026-06-06 (later, grounding gate) — documents the discovery and
cross-corpus falsification of a grounding_top1 ≤ 0.35 gate that
worked on HotpotQA (+0.031 lift, robust to 5-fold CV) but failed to
generalize to MuSiQue. Also covers an NQC + WIG cross-corpus probe
that didn't port. Closes the CE-gate research direction with a
measured negative result. Includes the Phase A feature-logging
harness (crates/examples/examples/ce_gate_feature_log{,_musique}.rs)
for any future probe.
DENSE_RERANK_CEILING.md
Update — 2026-06-06 — falsifies MDR single-pass as a uniform policy
(−0.05 vs dense baseline) while documenting a real +0.027 lift on the
subset of queries where dense had a gold in the pool but missed it.
Closes the single-shot MDR probe.
LOCAL_RERANK.md Update —
2026-06-06 — notes the status of LocalRerankRetriever after the
MuSiQue investigation: it is now a building block rather than the
default Hybrid, but the "semantic recall without ANN" contract it
established is intact and the working refactor is preserved on
feature/hybrid-full-pool-rrf for future re-evaluation.

Changed

python/tests/test_parity_node.py now pins strategy +
requested_strategy data-value parity (in addition to the field-set
parity above). The harness previously normalized away these fields
rather than testing them — direct dict-key access means a future
regression that drops either field on either side fails with a clear
KeyError instead of silently passing.
Documentation polish: python/README.md's from_text row shows the
optional source= parameter; nodejs/README.md lists the full
report shape; the top-level README's "hybrid" row in the
Retrieval tiers table accurately reflects the shipped semantics.

Notes on the runtime

RetrievalMode::Hybrid is unchanged for this release. A full-pool
RRF refactor was built end-to-end on feature/hybrid-full-pool-rrf
(commit c81ffbe, all tests passing, fmt + clippy clean) but a direct
A/B benchmark falsified the ship decision: at the user-facing
candidate_k = 20 the new composition gave only +0.0074 on MuSiQue
and +0.0017 on HotpotQA (both below the +0.02 pre-registered ship bar)
and regressed HotpotQA at K=4 by −0.011. The wide-K wins are real
(+0.07 MuSiQue@50, +0.034 HotpotQA@50) but not user-facing.
LocalRerankRetriever's BM25-prune-then-RRF composition is still
what RetrievalMode::Hybrid resolves to. Full A/B numbers and the
ship-decision audit are in
MUSIQUE_RECALL_GAP.md.

Assets 2

06 Jun 02:20

github-actions

v0.2.1

f2945af

v0.2.1

The robustness + bugfix patch release. Two real bugs fixed (one BM25
edge case, one cross-binding serde-compat break), one new Python helper,
and ~30 new tests pinning load-bearing contracts across the codebase.

Fixed

BM25: silent wildcard fallback on no-signal queries. Queries whose
every term was filtered out (stopwords only, or all-out-of-vocab)
silently fell back to a match-all wildcard, returning the corpus's
top-BM25 chunks as if the query had matched something. Now returns
an empty result set with a clear signal.
ContextReport.removed and .economics missing #[serde(default)].
A binding payload from an older RedHop binary missing these fields
would error on deserialize — a silent cross-version compatibility break
for Python/Node callers shuttling ContextReport across the FFI as
JSON. Both target types already derive Default; the fix is a no-op
for fresh payloads and gracefully fills in zeros for old ones.

Added

redhop.context_with_timeout (Python). Thin ThreadPoolExecutor
watchdog around Document.context() for agent integrations that need
to bail on slow queries:
```
try:
    ctx = redhop.context_with_timeout(doc, q, timeout_ms=5000)
except TimeoutError:
    ...
```
Forwards budget / neighbors / include_heading. Scope is
deliberately Python-only — true Rust-side cancellation needs hooks in
Tantivy/ONNX that don't exist yet, and the docstring + TimeoutError
message document the limitation.
docs/DEFAULT_PROVENANCE.md — every tuned default in
ContextConfig / DocumentConfig linked back to the finding that
justifies it (so callers can audit which numbers are calibrated vs
arbitrary).

Internal — robustness tests

Seven new test passes (~30 tests) pinning load-bearing contracts that
were previously informal:

Determinism — same input → same output, Rust + cross-binding parity.
Internal invariants — 7+ consistency invariants across the strategy
matrix (selected ⊆ input, removed.total matches drop count, etc.).
Concurrency — Send + Sync audit + 1024-call parallel stress.
Adversarial loaders — 9 tests covering corrupt PDFs, symlink loops,
deep recursion, malformed DOCX/PPTX/XLSX.
Auto-gate boundary — pins the inclusive <= semantics at
1499/1500/1501 input tokens + the custom-gate path.
Serde round-trip — every cross-FFI type (Chunk, Score,
ContextReport, ...) survives JSON round-trip; forward-compat
exercised via a minimal pre-0.1.3 payload.
Strategy semantics — 7 differential tests pinning the contrasts
between all 5 ContextStrategy variants on a shared corpus
(catches accidental strategy convergence).
Persisted cache — incremental cache hit/miss contract for
read_folder_with(persist=true): per-file (mtime, size) skip,
no-op reload doesn't rewrite, fingerprint invalidation on config
change, deleted-file cleanup.

No public API changes. Python and Node callers are unaffected aside
from the new context_with_timeout helper.

Assets 2

03 Jun 13:34

github-actions

v0.2.0

9f8664a

v0.2.0

The binding-parity + non-English release. Three months of incremental
quality work plus a focused arc on cross-binding consistency: Python, Node,
and Rust all expose the same surface, return the same values for the same
inputs, and drift is now actively prevented in CI. The Rust crate also gains
a pluggable lexical analyzer, closing the structural bug class (BM25 ↔
grounding-scorer disagreement) that 0.1.3–0.1.4 fixed by hand four times.

Breaking changes

Two source-level breaks for Rust callers; ..Default::default() and the
pip/npm consumers are unaffected:

ContextConfig + DocumentConfig grew new required fields
(analyzer: Arc<dyn Analyzer>) for the pluggable lexical analyzer.
Callers constructing those structs via field literals from outside
the crate need to add analyzer: redhop::analyzer::default_english().
ContextConfig::default().token_budget changed from 2048 → 8192
to align with the Python binding's long-standing default (which was
shipping to PyPI users that whole time). Rust callers relying on the
old 2048 default will now get a 4× larger assembled context. Set
token_budget: 2048 explicitly to restore the old behavior. Python +
Node users see no change.

Added

Pluggable lexical analyzer

crate::analyzer::Analyzer trait + SnowballAnalyzer (18
Snowball Porter2 languages). First-class extension point: one analyzer
drives BOTH the BM25 retriever AND the grounding scorer, so the two
layers structurally cannot disagree on what "the same term" means.
Design rationale in docs/design/ANALYZER_PLUGIN.md; usage in
docs/LANGUAGE.md.
Document::with_analyzer(Arc<dyn Analyzer>) — mirrors
with_embedder. Swaps the analyzer for both layers in lockstep.
LoadOptions::language: Option<String> — string-routed access to
the 18 builtins (english, german, french, spanish, italian,
portuguese, dutch, russian, swedish, norwegian, danish,
finnish, romanian, hungarian, turkish, arabic, greek,
tamil). Unknown language names return an error (no silent fallback
to English).
Python language kwarg on every Document.from_* constructor.
Node language field on Options.

Binding parity (Node catches up to Python)

Document.analyze(query) — pure diagnostics, returns the same
Report shape as context().report without paying assembly cost.
Document.nFiles getter — number of source files indexed (1
for single-source ctors, the readable count for fromFolder).
Document.skippedFiles getter — SkippedFile[] ({source, reason} pairs) for files fromFolder couldn't parse. Was a silent
skip with no introspection before.
buildContext / filterContext / analyzeContext /
contextEconomics top-level functions — the low-level "I do my own
retrieval, just want RedHop for assembly" surface. Mirrors Python's
same-named functions; takes ChunkInput[] + ContextOptions.
groundingScore(query, text) + linkStrength(a, b) — the
observability primitives the strategies use internally, exposed so
external code reuses RedHop's exact relevance notion instead of
reimplementing.

Tests + infrastructure

crates/redhop/tests/quality_suite.rs — 45-test behavior-level
suite organized by what a user perceives, not by code structure.
Covers tokenization (T01-T07), multi-field reach (T08-T09), document
structure (T10-T13), context assembly (T14-T20), hybrid contract
(T21-T22), edge cases (T23-T26), Unicode/multilingual (T27-T30),
adversarial queries (T31-T34), nested markdown (T35), cross-format
mixed corpus (T36), non-English pinning (T37-T40), and the analyzer
plugin (T41-T45). Found two real bugs on its first runs (an
empty-query BM25 crash and an accent-folding gap), and a binding bug
via T41-T44 (from_chunks silently dropping language= in Python).
python/tests/test_parity_node.py + nodejs/test/parity_runner.cjs
— cross-binding parity harness. 6 tests hand identical inputs to
Python and Node and diff structured outputs (caught the
analyzeContext / contextEconomics token_budget divergence on
its first run).
crates/cli/tests/cli_smoke.rs — first-ever CLI integration
tests. Asserts --help works on each subcommand + a real
analyze-context - stdin pipe.
Node CI job — .github/workflows/ci.yml now builds the napi
addon and runs npm test on PRs. Previously PRs only exercised
Rust + Python.
ASCII folding (café ↔ cafe, Süßigkeit ↔ Sussigkeit,
naïve ↔ naive) in both BM25 and the grounding scorer (via NFKD).
New tests T27, T28, T39 pin this.

Documentation

docs/LANGUAGE.md — honest scope of non-English support, by
family + the Analyzer plugin's public API (Rust / Python / Node).
docs/design/ANALYZER_PLUGIN.md — rewritten to describe the
shipped surface (was originally a proposal with several deviations).
README "Language support" section + per-package READMEs
(python/README.md, nodejs/README.md) — language= examples.
docs/ARCHITECTURE.md — refreshed against the post-consolidation
workspace (the pre-0.2 split of redhop-{core,context,…} into
separate crates was rolled into one published redhop crate; diagram
and crate-name references updated).
docs/API_STABILITY.md — full Node section added; Python section
updated with language=, n_files, skipped_files; Rust section
updated with the consolidated module paths.

Changed

Python folder walker unified with Rust's read_folder_with —
−429 LOC in python/src/lib.rs (≈25% of the file). Removed the
parallel build_folder_persisted, collect_files, PersistedIndex,
CachedFile, fingerprint, etc. Both bindings now share Rust's
single implementation; on-disk index format is byte-compatible with
the previous Python writer, so existing caches reload cleanly.
strategy_from_str + retrieval_from_str consolidated to a
single source of truth in redhop::load. Python's wrappers now
forward to the Rust functions with map_err instead of duplicating
the match arms.
Document carries n_files() and skipped_files() accessors on
the Rust struct. Single-source constructors default to 1 / empty;
read_folder_with (both simple and persisted paths) now records
(source, reason) for each skipped file instead of silently dropping
them.
MSRV bumped 1.75 → 1.77 across all three workspace declarations
(workspace, python/Cargo.toml, nodejs/Cargo.toml) — the napi-rs
2.x in the Node binding sets the actual floor; the inconsistency
meant a 1.75 user hit a mysterious napi error instead of a clear MSRV
one.

Fixed

All-stopword query no longer crashes BM25. A query the analyzer
pipeline reduces to zero positive terms ("", " ", "the and is of in or") used to surface as a hard Tantivy error (Invalid query: Only excluding terms given). The retriever now traps that error
class (and the empty query class) and returns an empty result.
Caught by quality_suite::t25 on its first run.
Python Document.from_chunks silently dropped language= — the
pyo3 signature accepted the kwarg but the call into doc_config
passed None instead of the user's value. Caught by the new Python
analyzer test suite on its first run.
Node analyzeContext / contextEconomics were honoring the
user's token_budget option; Python's equivalents hardcode
usize::MAX because these are no-budget pure-analysis surfaces.
Caught by the cross-binding parity tests on their first run.
Node index.d.ts was stale — the language field and
minCandidates field were present on the Rust Options struct but
hadn't been regenerated. TypeScript users got "Object literal may
only specify known properties" on perfectly valid options.

Notes

unicode-normalization promoted from transitive (via tantivy) to a
direct dep of redhop. Used for the grounding scorer's NFKD fold.
Workspace test count: 320/320 (Rust) + 81/81 (Python, +1 BGE
fixture skip) + Node smoke + analyzer suites. Was 260 at the v0.1.4
tag.
CI gates: cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace, cargo doc --workspace --no-deps --features files,semantic (warning-free), the
cross-binding parity suite, and the Node smoke + analyzer suites.
All six CI jobs green.
[package.metadata.docs.rs] all-features = true added to the
redhop crate so the published doc page on docs.rs shows the
files + semantic items instead of just the lean lexical surface.
21 example files swept clean of hardcoded /Users/vysakh/... paths;
they resolve datasets/models/exports through
redhop_examples::{data_path, exports_path, model_path, bge_small_paths, ms_marco_paths} helpers that honor
REDHOP_{DATA,EXPORTS,MODELS}_DIR env vars.

Assets 2

02 Jun 10:36

github-actions

v0.1.4

68255f0

v0.1.4

Citation ergonomics — for both code and prose. Follow-on to 0.1.3's
BM25-quality theme: retrieval is now precise, but the assembled context
returned to the LLM left the user staring at a def line without the
implementation, or at a deep section paragraph without its parent heading.
Both gaps closed by default; both have explicit opt-out knobs. Plus three
prose-side fixes that surfaced during the audit (setext headings, PDF
heading heuristic, plumbing).

Changed

Document::context(query) on a code chunk now attaches ±1 neighbor
chunks by default. Code is chunked as fixed-token windows so a 50-line
function often spans 2-3 chunks; a hit on the chunk containing the def
line would previously cite only the signature, omitting the body in the
next chunk. With DocumentConfig::code_neighbors_default = 1
(the new default), citations on code hits include the surrounding
implementation. Behavior change for code-shaped corpora — set
code_neighbors_default: 0 to restore the old chunk-only behavior. No
effect on prose corpora (fires only on chunks tagged
metadata["kind"] == "code").
Document::context(query) on a prose chunk with a section heading
now attaches the section's opener chunk by default. A query that
lands deep inside ## Refunds → ### Eligibility previously cited only
the matched chunk — the LLM lost the section title. With
DocumentConfig::prose_heading_default = true (the new default), the
section's first chunk is attached. Behavior change for hierarchical
prose — set prose_heading_default: false to disable. Only fires on
chunks that carry non-empty metadata["heading"] (markdown, DOCX,
PPTX, XLSX, and — new in this release — PDF).
Markdown sections now recognize setext headings (Title\n=====
for H1, Title\n----- for H2) in addition to ATX (#/##/…). YAML
frontmatter (--- ... --- at file start) is detected and excluded
from setext scanning so its closing fence doesn't get treated as an
H2 underline. Pandoc output / older docs / man pages now produce the
same section structure as their ATX equivalents.
PDF chunks now carry best-effort heading metadata. A per-page
heuristic lifts the first short, non-paragraph-shaped line into
Section::heading (rejecting page-number footers, body lines ending
in sentence punctuation, and lines ending in a digit). Lets the BM25
heading-field search added in 0.1.3 actually reach PDF chunks by
topic; previously metadata["heading"] was always None on PDFs.

Added

DocumentConfig::code_neighbors_default: usize (default 1).
DocumentConfig::prose_heading_default: bool (default true).
Both inherited via the Python / Node bindings' default config; no
binding-surface change for callers who don't override.

Fixed

(No code-bug fixes — 0.1.3 already closed the BM25 quality gaps. See
the Notes section for the verified-not-broken embedding-persistence
story.)

Notes

Embedding persistence verified. The 0.1.3 audit suspected that
read_folder_with(persist=true) re-embedded every chunk on reload
(paying ~30-60 sec of bge-small cost per cold start on a 1000-chunk
codebase). The machinery is already correct: embedded_chunks()
populates the Chunk::embedding field from the retriever cache before
writing index.json, Embedding is Serialize/Deserialize, and
LocalRerankRetriever::index short-circuits any chunk that comes back
with an embedding already set. Round-trip test
(crates/redhop/tests/embedding_persistence.rs) now pins this — a
reload triggers exactly 1 embed call (the query), not N+1 (the query +
every chunk). Locked in as a regression guard.
Eleven new tests across the citation-ergonomics theme: 3 for the code
neighbor default, 3 for the prose heading default, 3 for setext
headings + frontmatter handling, 2 for the PDF heading heuristic.
111/111 tests pass under
cargo test -p redhop --features files.

Assets 2

Releases: vysakh0/redhop

v0.3.3

Added — eval surface

Added — eval evidence

Breaking — Rust only

Added — defaulted-on heuristics audit

Changed

Uh oh!

v0.3.0

Added

Templated-workload helpers (Rust + Python + Node)

Findings (the evidence layer)

Examples

Documentation

Uh oh!

v0.2.2

Added

Node binding — full Report field-surface parity with Python

README + binding-page presentation

Structural test suite (crates/redhop/tests/)

Evidence layer — five new finding documents

Changed

Notes on the runtime

Uh oh!

v0.2.1

Fixed

Added

Internal — robustness tests

Uh oh!

v0.2.0

Breaking changes

Added

Pluggable lexical analyzer

Binding parity (Node catches up to Python)

Tests + infrastructure

Documentation

Changed

Fixed

Notes

Uh oh!

v0.1.4

Changed

Added

Fixed

Notes

Uh oh!

Node binding — full `Report` field-surface parity with Python

Structural test suite (`crates/redhop/tests/`)