Releases: vysakh0/redhop
v0.3.3
Answer-quality eval surface (Rust + Python + Node) + audit of
defaulted-on heuristics. Two threads in this release:
(1) a new in-process evaluate(...) / critique(...) surface for
closed-set answer-quality metrics — lexical + LLM-judged, with
claim-decomposed faithfulness and TP/FP/FN correctness; calibrated
against Ragas at n=200 HotpotQA (r=+0.664, MAE=0.151);
(2) the defaulted-on heuristics audit (five measured, one
already-flipped in 0.3.2, two API smells fixed).
Added — eval surface
evaluate(query, ctx, answer=, gold_answer=, judge=, decompose_faithfulness=, decompose_correctness=)—
in-process answer-quality eval that returns oneEvalReportblending
lexical (CI-deterministic) and LLM-judged metrics. Available in Rust
(redhop::evaluate(...)), Python (redhop.evaluate(...)), and Node
(evaluateWithJudge(...)for the async judge path). Faithfulness,
relevancy, correctness in_lexical(no LLM) and_judged(opt-in)
flavors; gold-relative metrics (context_recall,
context_precision,answer_token_recall) whengoldis provided.critique(answer, aspects, judge=, context=, query=)— open-ended
user-defined dimensions (harmfulness, conciseness, brand voice,
etc.). One LLM call per aspect; polarity-corrected scores so high =
good across the report regardless ofhighIsGood. Returns a
CritiqueReportwith per-aspect scores in input order.summarize(reports)— aggregates a sequence of per-case
EvalReports into a means + N + share-flagged summary, the same
shape RedHop's runtime uses for its Decision Report.- Judge surface —
Judge.from_callable(fn).cached()(Python),
Judge.fromCallable(fn, name).cached()(Node), and the Rust
Judgetrait withCachedJudgeandCallableJudgewrappers. One
caching layer for any user-supplied LLM caller; an LRU sized by
the caller. Single primitive supports faithfulness, relevancy,
correctness, critique, and decomposed paths. - Claim-decomposed faithfulness (
decompose_faithfulness=True):
extracts atomic claims via a few-shot LLM call, then batch-verifies
all of them in a single second LLM call. Two LLM calls regardless
of how many claims were extracted.gpt-4o-minicorrelates with
Ragas's faithfulness at r=+0.664 on n=200 HotpotQA (see
docs/findings/EVAL_JUDGED_CALIBRATION.md).
Verifier prompt includes paraphrase-positive examples + negative
entity-substitution examples to balance strictness and recall. - TP/FP/FN correctness (
decompose_correctness=True): mirrors
decomposed-faithfulness on the answer-vs-gold axis. Extracts
claims from both the answer and the gold answer, classifies each as
TP / FP / FN, returns F₁. Diagnostic counters
(n_correctness_tp/fp/fn) surface the intermediate categorisation. - Refusal-aware decomposition — "I don't know" answers correctly
producemean_faithfulness_judged = None(0 claims extracted)
instead of being scored as a vacuous 1.0. Surfaces refusals as a
distinct category, not as faithfulness=1. - Diagnostic counters on
EvalReport:
n_faithfulness_claims_extracted,n_faithfulness_claims_supported,
n_correctness_tp,n_correctness_fp,n_correctness_fn. Surface
intermediate classifier counts so callers can debug WHY a metric
landed where it did.
Added — eval evidence
- docs/COMPARISON_RAGAS.md — public-facing
head-to-head with Ragas on claim-decomposed faithfulness. n=200
HotpotQA, gpt-4o-mini, with Claude haiku as third-judge tie-breaker. - docs/findings/EVAL_JUDGED_CALIBRATION.md
rewritten end-to-end: three-layer evidence (5-case wiring probe →
5-case Ragas side-by-side → n=200 HotpotQA correlation + third-judge
subset). Documents the v0→v4 prompt iteration that fixed four
traceable failure modes (paraphrase rejection, comparative
hallucination, compound-attribution dilution, wrong-entity
substitution). Calls out single-shot LLM noise as a measured
property of the workload (gpt-4o-mini at temp=0 is not
deterministic through OpenRouter — ~20–30% per-case variance). - docs/findings/ANSWER_QUALITY_EVAL.md —
full API tour for the newevaluate(...)+critique(...)surface. - docs/findings/EVAL_VS_RAGAS_SOURCE.md —
source-read comparison of the two libraries' implementations. bench/eval_correlation_hotpot.py— runs the n=200 Pearson r / MAE
measurement on HotpotQA against Ragas with configurable context mode
(supporting/distractor_only/all).bench/eval_third_judge.py— Claude haiku tie-breaker via the local
claude -pCLI; no API key needed.bench/eval_faith_trace.py— diagnostic harness for tracing claim
extraction + per-claim verifier votes on specific qids. Has a
--variant v0/v1/v2/v3/v4flag for prompt iteration without
rebuilding the Rust crate.bench/eval_judged_calibration.py— the 5-case wiring probe with
optional Ragas side-by-side when installed.bench/select_third_judge_subset.py— filters contested cases from
a correlation-bench JSON so the third-judge run stays cheap.
Breaking — Rust only
redhop::evaluate(...)signature now takes six parameters:
(query, ctx, answer, gold, judge, config)instead of the prior
three-parameter shape. The Python and Node bindings absorb this via
kwargs / options and are NOT breaking. PassNoneforanswerand
judgeandEvalConfig::default()forconfigto match the old
behavior.
Added — defaulted-on heuristics audit
code_neighbors_default/codeNeighborsDefault— surfaces the
±N adjacent-chunk auto-pull on code chunks as a constructor kwarg on
Pythonfrom_text/from_file/from_bytes/from_folder/from_chunks
and as a field on the NodeOptionsstruct. Default1(unchanged
behavior). Pass0to disable, or2/3for more aggressive
expansion under loose token budgets. See
docs/findings/CODE_NEIGHBORS_DEFAULT.md
for the measured budget tradeoff.prose_heading_default/proseHeadingDefault— surfaces the
auto-attach of section-heading chunks to prose hits as the same
constructor-level kwarg / option. Defaulttrue(unchanged). Pass
falsefor memory-tight workloads where the heading isn't
load-bearing. See
docs/findings/PROSE_HEADING_DEFAULT.md
for the measured +7pt ≥0.8 lift at typical budgets.crates/redhop/src/load.rs—LoadOptionsnow exposes
code_neighbors_defaultandprose_heading_defaultas
Option<usize>/Option<bool>. Threads throughread_folder_with
for parity with the in-memory loaders.- Audit finding docs. Five new findings on the defaulted-on
heuristics audit:
RAW_ANALYZER(flipped in 0.3.2),
HYBRID_CANDIDATE_POOL
(inert knob — don't tune),
PROSE_HEADING_DEFAULT
(+7pt at typical budgets),
BM25_SOURCE_FIELD(+4pt with
signal, 0pt with noise),
CODE_NEIGHBORS_DEFAULT
(budget-dependent compromise). - Cross-binding parity tests for the two new kwargs (Python
test_loaders.py, Nodesmoke.cjs).
Changed
- No default values changed in this release. All defaults remain
what they were after 0.3.2 — the new kwargs default to the existing
Rust values (code_neighbors_default=1,prose_heading_default=true).
Existing callers see zero behavior change; the new kwargs are an
opt-out / tune surface only.
v0.3.0
The workflow + measurement release. Ships a new public-API surface
that closes the templated-workload retention gap end-to-end, in all three
bindings (Rust, Python, Node): analyze_query_set, the QueryRewrite
trait with two built-in implementations (Stripper and Vocabulary),
Document::context_with_rewrites(...) to compose them with an audit
trail, Vocabulary::enrich(...) as the chunk-side mirror, and evaluate
for deterministic A/B with no LLM judge. On the CUAD framework comparison
the full detect → compile → context_with_rewrites → A/B workflow takes
≥0.8 retention from 81.3% → 90.7% — a 9.4-point lift over raw BM25,
beating LlamaIndex's 86% by 4 points, at native BM25 latency (~2.5ms/query)
on default lexical retrieval. Worked example, hand-curated CUAD clause-name
dictionary, and a 6-arm probe contrasting the workflow vs hybrid+cross-encoder
live in docs/findings/CUAD_CLAUSE_EXPANSION.md and
docs/findings/CUAD_HYBRID_RERANK.md.
Vocabulary.enrich(...) ships with bidirectional measured evidence on the
regime rule it follows. Positive side: docs/findings/SPIDER_ENRICH.md
measured +0.19 mean column recall on Spider-shape schema retrieval (curated
workload synonyms; n=30, candidate_k=10). Negative side:
docs/findings/CUAD_ENRICH_DEFINITIONS_NULL.md measured −2.0 pts on
CUAD prose chunks. The two findings together complete the four-corner
rule with measured evidence on all four corners: workload-pervasive
signal manipulation fails on either side of the pipeline; only
workload-curated semantics work. See docs/findings/VOCABULARY_ENRICH.md
for the regime rule, use-case ranking, and failure modes.
Breaking on the manual-chunks path (Python + Node): the typed
redhop.Chunk(text, *, source=None, id=None, metadata=None, ...) constructor
becomes the only accepted input shape for Document.from_chunks and the
low-level build_context / filter_context / analyze_context /
context_economics entry points. Bare strings and dicts both raise
ValueError with a migration hint pointing at the new constructor. The
trade-off is intentional: the dict path didn't expose chunk metadata at all,
so manually-constructed chunks couldn't carry page / heading / line
into citations — a real functional gap, not just ergonomics. The typed
Chunk closes that gap and surfaces source (provenance) and id
(identity) as the two distinct concepts they already are in the Rust core
(see Breaking below for the migration).
Added
Templated-workload helpers (Rust + Python + Node)
analyze_query_set(queries) → QuerySetReport— diagnostic that takes
a representative sample of your queries and reports whether they share
enough boilerplate to be templated, which terms are doing the dilution,
and a coarseestimated_dilution_costband. Cross-workload probe
(docs/findings/QUERY_SET_ANALYZER.md): CUAD fires (share 0.66, cost
high); HotpotQA + MuSiQue both stay quiet (0.00 and 0.12, both
is_templated=False). Conservative by design — false positives push
users toward a workaround that won't help, which is worse than staying
quiet.QueryRewritetrait +Stripper+Vocabulary— compiled,
observable, token-level-correct replacement for the function-form
rewrites originally drafted for this release. EachQueryRewrite
implementation returns aRewriteResult { query, record }so every
stage's{stage, from, to, matched, added, removed}lands on
ContextReport::query_rewritesautomatically when called through
the chain.Stripper::new(boilerplate)— compiled boilerplate-removal
rewrite. Matches at token granularity through the analyzer (with a
surface-form fallback for tokens like "of"/"the" that stem to
empty), so a single-token strip cannot accidentally erase a
substring inside a longer word (an"of"strip does not erase
the"of"inside"office"). Replaces the substring-based
drop_template_termsfunction originally drafted for 0.3.0.Vocabulary::new(entries)/Vocabulary::bidirectional(entries)
— compiled workload-curated equivalence classes. Tokenizes keys,
synonyms, and the query through the same analyzer the BM25 index
uses, so a vocabulary key"ip"cannot fire on the"ip"inside
"recipient". Bidirectional mode treats every class member as a
trigger (PTO ↔ "paid time off" ↔ "vacation"). The CUAD probe
(docs/findings/CUAD_CLAUSE_EXPANSION.md) shows +3.0 points on top
of the template-stripped baseline (the new token-level matching
re-validates at 90.7% vs the substring-based predecessor's 90.3%
— same workload, +0.4 from analyzer alignment).Document::context_with_rewrites(query, &[&stripper, &vocab])
— runs the chain left-to-right through retrieval. Each stage sees
the previous stage's output; the per-stageRewriteRecords land on
ctx.report.query_rewritesautomatically.- Future-extensible. Both
StripperandVocabularyare
QueryRewriteimplementations; user code can ship its own (e.g. a
workload-specific normalizer) and chain it alongside the built-ins.
The trait is exported on the public API surface. Vocabulary::enrich(chunk) → RewriteResult— chunk-side
mirror ofapplyshipped as a primitive on mechanism reasoning
with asymmetric measured evidence. The mechanism (a chunk-side
doc2query variant) and the regime hypothesis
(expected value ∝ shortness × opacity × dictionary-exists) are
well-grounded; the positive prediction (short opaque coded
units — schema columns, API symbols, error codes) is not yet
measured by RedHop. Spider/BIRD as the schema-regime probe is
queued, not run. The negative prediction (long prose chunks- workload-pervasive vocabulary will dilute, not help) has been
measured directly:
CUAD_ENRICH_DEFINITIONS_NULL
regressed retention −2.0pt vs the 90.7% workflow baseline
(~24-point loss on the 17/50 affected contracts). This completes
the four-corner rule from CUAD_PRF_NULL + SUB_IDF_AUTO_DROP_NULL
onto the chunk side: workload-pervasive signal manipulation fails
on either side of the pipeline. Users adoptingenrichshould
A/B on their own corpus withredhop::evaluate(...)—
the regime rule is a hypothesis, not a guarantee. Audit trail
(per-chunkRewriteRecordwithstage: "enrich") returned to
the caller so the A/B is auditable. Synthetic demo (not a
benchmark):crates/examples/examples/enrich_code_search.rs.
Full asymmetric-evidence framing + use case predictions + failure
modes indocs/findings/VOCABULARY_ENRICH.md.
- workload-pervasive vocabulary will dilute, not help) has been
evaluate(query, ctx, gold) → EvalReport— in-process retrieval-eval
scorer, no LLM judge. Self-eval (mean_grounding,evidence_density,
retained_evidence_ratio,second_hop_rescues,low_confidence,
estimated_waste_tokens) is always populated; gold-relative metrics
(context_recall,context_precision,answer_token_recall) are
optionally unlocked by passinggold_chunksand/orgold_answer.
Compositeoverallblends whichever fields are available. Designed as
a refraction of the same primitives the runtime uses to make its
Decision Report — a lowoverallandreport.low_confidence_retrieval
are the same signal viewed twice, not independent measurements, so eval
and runtime can never disagree. Rationale, contract details, and the
10 / 11 / 9 Rust / Python / Node tests pin in
docs/findings/EVALUATE_API.md.
Findings (the evidence layer)
New findings document what was tried, what worked, and what was
falsified across this release:
- Confirmed —
QUERY_SET_ANALYZER,CUAD_RECALL_GAP,
CUAD_CLAUSE_EXPANSION,MULTILINGUAL_ANALYZER,EVALUATE_API,
CUAD_HYBRID_RERANK(substitute-not-stack rule),VOCABULARY_ENRICH
(confirmed on both sides of the regime rule),SPIDER_ENRICH
(the positive-side validation forVocabulary.enrich(...): curated
chunk-side enrichment on a Spider-shape sample lifted mean column
recall +0.19 from 0.77 → 0.97, ≥0.8 retention 63% → 93%). - Null result / falsified —
CUAD_PRF_NULL(unweighted PRF on
boilerplate-heavy corpora),CUAD_CHUNK_FRAGMENTATION_NULL(chunker
isn't the CUAD lever),SUB_IDF_AUTO_DROP_NULL(corpus-only IDF
manipulation fails in both directions),
CUAD_ENRICH_DEFINITIONS_NULL(chunk-side enrich on per-contract
Definitions regressed −2.0 pts vs the 90.7% workflow baseline;
~24-point loss on the 17/50 contracts where Definitions were
extractable — chunk-side parallel to CUAD_PRF_NULL's failure mode,
measured directly). - The four-corner rule is now measured on all four corners.
Workload-pervasive signal manipulation fails on either side of the
pipeline; only workload-curated semantics work:
query-side curated wins (CUAD_CLAUSE_EXPANSION+3.0pt) ·
query-side auto fails (CUAD_PRF_NULL−3.7pt) ·
chunk-side curated wins (SPIDER_ENRICH+0.19 mean recall) ·
chunk-side auto fails (CUAD_ENRICH_DEFINITIONS_NULL−2.0pt).
Examples
Eleven new harnesses under crates/examples/examples/:
cuad_query_preprocessing, cuad_chunk_strategy_sweep,
cuad_chunk_fragmentation, cuad_clause_expansion, cuad_hybrid_rerank,
cuad_perf, cuad_prf, cuad_rust_vs_python_path,
multilingual_query_set_probe, query_set_analyzer_probe,
sub_idf_reweighting_probe.
Documentation
- New workflow-lift chart
.github/workflow_lift.svgembedded in the
root README + binding READMEs — surfaces the 81 → 88 → 90.7% story
visually. - Root README,
python/README.md,nodejs/README.md"Templated
workloads" section rewritten to detect → strip → (optional) vocabulary →
A/B withStripper/Vocabulary/context_with_rewritestabled. docs/CHOOSING_A_CONFIG.mdstep 3 leads with the new "two paths...
v0.2.2
The binding parity + evidence layer release. No breaking changes for any
binding's callers. The Node binding gains 14 missing Report fields, the
documentation gets its first visual presentation (badges, charts,
architecture diagram), and the evidence layer grows by five new findings
that document what was tried, what worked, and what was falsified honestly.
Added
Node binding — full Report field-surface parity with Python
Reportgains 14 fields + a permanent alias:strategy,
requestedStrategy,inputTokens,tokenBudget,tokenUtilization,
nInputChunks,nSelected,inputDistractorRatio,
reasoningPreservationDelta,distractorsPruned,removedTotal,
evidenceDensity,distractorRatio,estimatedWasteTokens, plus
secondHopRescueCount(==secondHopRescues, the existing short
name; both names will always be present and equal). Before this
release Node'sReportexposed roughly half of Python's surface —
programmatic callers usingreport.totalTokenscould not read
report.nSelectedor any of the economics fields. All additions are
non-breaking; no existing field changed name or shape.docs/API_STABILITY.mdgains a "Known call-shape asymmetries"
section documenting the two pre-existing idiomatic differences
between the Python and Node bindings (from_textpositional vs
options-bagsource;ctx.text()callable in Python vsctx.text
property in Node). These are stable within 0.x — neither will be
silently flipped.
README + binding-page presentation
- Multi-registry badges at the top of every README (root, Python,
Node). PyPI / crates.io / npm version numbers, license, and a link
to the evidence layer. Brand color (#e11d48) on the registry
badges. - A retention-vs-frameworks bar chart (
.github/retention_vs_frameworks.svg)
showing the measured head-to-head numbers from
FRAMEWORK_COMPARISON.md:
HotpotQA multi-hop (RedHop 77%, LangChain 71%, LlamaIndex 72%) and
CUAD contracts (RedHop 82%, LangChain 73%, LlamaIndex 86%). Hand-rolled
SVG, no fake screenshots, every number traces to
reports/framework_comparison.txt. - A pipeline architecture diagram (
.github/architecture.svg)
showing the five stages — Document → Chunking → Retrieval →
Allocation → BuiltContext — with the calibrating finding named under
each internal stage and a "YOU BRING / REDHOP OWNS / YOU GET" scope
label. - A Decision Report visual (
.github/decision_report.svg) —
terminal-styled SVG rendering ofctx.reportoutput. Same content
as the ASCII block (which is preserved under a collapsed<details>
for copy-paste). - A References section at the bottom of the root README, citing
the named work each piece of the runtime leans on: BM25 (Robertson &
Zaragoza 2009), Porter2 (Porter 2001), RRF (Cormack et al. 2009),
Lost-in-the-Middle (Liu et al. 2023), NQC (Shtok et al. 2012), MDR
(Xiong et al. 2021), and the HotpotQA/MuSiQue/CUAD evaluation
datasets. Each citation links to the finding doc that uses that
work. - The binding READMEs (
python/README.md,nodejs/README.md) get
the architecture + Decision Report visuals via absolute
raw.githubusercontent.comURLs so they render on PyPI and npm
package pages (not just on GitHub).
Structural test suite (crates/redhop/tests/)
proptest_invariants.rs— 9 property-based invariants
pinningbuild_context's behavior under random valid inputs:
build_context_never_panics,resolved_strategy_is_never_auto,
auto_decision_triangle_holds,selection_is_subset_of_input,
no_duplicate_chunk_ids_in_output,token_budget_respected,
report_counts_match_reality,report_ratios_are_finite_and_in_range,
build_context_is_deterministic. Addsproptest = "1"as a
dev-dependency. Catches edge cases hand-written tests miss by
construction (empty input, NaN scores, all-stopword query, single-token
budget, …) — the bug class structural tests exist to close off.default_calibration.rs— 9 pins binding each tuned default
onContextConfig::default()/DocumentConfig::default()to the
finding that calibrated it (token_budget = 8192,
auto_passthrough_max_tokens = 1500,distractor_min_grounding = 0.10,
redundancy_max_cosine = 0.92,link_min_jaccard = 0.12,
low_confidence_max_grounding ≡ distractor_min_grounding,
target_tokens = 128,candidate_k = 20,Document.strategy = Auto).
Silent default drift becomes impossible; intentional drift is
documented in the same commit that makes it.public_api_snapshot.rs— 5 compile-time guards against silent
rename/removal of public symbols, plus type-bound signature pins for
load-bearing functions and exhaustive-match pins on
ContextStrategy/AutoDecision/RetrievalMethodstring
variants. Caught 4 real signature mistakes during authoring.golden_quality.rs— 6 end-to-end retrieval-quality canaries
on small inline corpora: G01 lexical-keyword match, G02 stemming
finds morphological variants, G03 distractor doesn't dominate
relevant chunk, G04 second-hop chunk survives default assembly, G05
low-confidence signal fires on off-corpus queries, G06 citations
carry correct source. The canary that fires between benchmark runs
when RedHop "got dumber" on a real query shape.- Three field-set parity tests (
test_report_field_surface_parity,
test_built_context_field_surface_parity,
test_context_economics_field_surface_parityin
python/tests/test_parity_node.py). Each compares the SET of fields
each binding exposes for the named return type. Auto-catches the gap
class where a new#[getter](Python) orpubfield (Node) appears
on one side without the other keeping up — the failure mode that hid
the 14-field NodeReportgap until a smoke test stumbled on
strategy.
Evidence layer — five new finding documents
MUSIQUE_RECALL_GAP.md—
decomposes the dense recall gap between HotpotQA (0.76) and MuSiQue
(0.28) into five distinct contributors (gold density, retrieval signal
type, wide-net coverage, embedder capacity, chunking) and documents an
attempted full-pool RRF refactor ofRetrievalMode::Hybridthat an
honest A/B benchmark falsified. Branchfeature/hybrid-full-pool-rrf
on origin holds the working refactor as a research record; main keeps
the existing Hybrid behavior. Includes 5 reproducible example
harnesses undercrates/examples/examples/musique_*.rsand
hybrid_old_vs_new.rs(the A/B that closed the question).RERANKING_LIMITS.mdUpdate —
2026-06-06 (kind-label gate) — falsifies both directions of the
HotpotQA-type-label gate proposed in the original finding's "open
problem" section. Closes that probe.RERANKING_LIMITS.mdUpdate —
2026-06-06 (later, grounding gate) — documents the discovery and
cross-corpus falsification of agrounding_top1 ≤ 0.35gate that
worked on HotpotQA (+0.031 lift, robust to 5-fold CV) but failed to
generalize to MuSiQue. Also covers an NQC + WIG cross-corpus probe
that didn't port. Closes the CE-gate research direction with a
measured negative result. Includes the Phase A feature-logging
harness (crates/examples/examples/ce_gate_feature_log{,_musique}.rs)
for any future probe.DENSE_RERANK_CEILING.md
Update — 2026-06-06 — falsifies MDR single-pass as a uniform policy
(−0.05 vs dense baseline) while documenting a real +0.027 lift on the
subset of queries where dense had a gold in the pool but missed it.
Closes the single-shot MDR probe.LOCAL_RERANK.mdUpdate —
2026-06-06 — notes the status ofLocalRerankRetrieverafter the
MuSiQue investigation: it is now a building block rather than the
defaultHybrid, but the "semantic recall without ANN" contract it
established is intact and the working refactor is preserved on
feature/hybrid-full-pool-rrffor future re-evaluation.
Changed
python/tests/test_parity_node.pynow pinsstrategy+
requested_strategydata-value parity (in addition to the field-set
parity above). The harness previously normalized away these fields
rather than testing them — direct dict-key access means a future
regression that drops either field on either side fails with a clear
KeyErrorinstead of silently passing.- Documentation polish:
python/README.md'sfrom_textrow shows the
optionalsource=parameter;nodejs/README.mdlists the full
reportshape; the top-level README's"hybrid"row in the
Retrieval tiers table accurately reflects the shipped semantics.
Notes on the runtime
RetrievalMode::Hybridis unchanged for this release. A full-pool
RRF refactor was built end-to-end onfeature/hybrid-full-pool-rrf
(commitc81ffbe, all tests passing, fmt + clippy clean) but a direct
A/B benchmark falsified the ship decision: at the user-facing
candidate_k = 20the new composition gave only +0.0074 on MuSiQue
and +0.0017 on HotpotQA (both below the +0.02 pre-registered ship bar)
and regressed HotpotQA at K=4 by −0.011. The wide-K wins are real
(+0.07 MuSiQue@50, +0.034 HotpotQA@50) but not user-facing.
LocalRerankRetriever's BM25-prune-then-RRF composition is still
whatRetrievalMode::Hybridresolves to. Full A/B numbers and the
ship-decision audit are in
MUSIQUE_RECALL_GAP.md.
v0.2.1
The robustness + bugfix patch release. Two real bugs fixed (one BM25
edge case, one cross-binding serde-compat break), one new Python helper,
and ~30 new tests pinning load-bearing contracts across the codebase.
Fixed
- BM25: silent wildcard fallback on no-signal queries. Queries whose
every term was filtered out (stopwords only, or all-out-of-vocab)
silently fell back to a match-all wildcard, returning the corpus's
top-BM25 chunks as if the query had matched something. Now returns
an empty result set with a clear signal. ContextReport.removedand.economicsmissing#[serde(default)].
A binding payload from an older RedHop binary missing these fields
would error on deserialize — a silent cross-version compatibility break
for Python/Node callers shuttlingContextReportacross the FFI as
JSON. Both target types already deriveDefault; the fix is a no-op
for fresh payloads and gracefully fills in zeros for old ones.
Added
-
redhop.context_with_timeout(Python). ThinThreadPoolExecutor
watchdog aroundDocument.context()for agent integrations that need
to bail on slow queries:try: ctx = redhop.context_with_timeout(doc, q, timeout_ms=5000) except TimeoutError: ...
Forwards
budget/neighbors/include_heading. Scope is
deliberately Python-only — true Rust-side cancellation needs hooks in
Tantivy/ONNX that don't exist yet, and the docstring +TimeoutError
message document the limitation. -
docs/DEFAULT_PROVENANCE.md— every tuned default in
ContextConfig/DocumentConfiglinked back to the finding that
justifies it (so callers can audit which numbers are calibrated vs
arbitrary).
Internal — robustness tests
Seven new test passes (~30 tests) pinning load-bearing contracts that
were previously informal:
- Determinism — same input → same output, Rust + cross-binding parity.
- Internal invariants — 7+ consistency invariants across the strategy
matrix (selected ⊆ input,removed.totalmatches drop count, etc.). - Concurrency —
Send + Syncaudit + 1024-call parallel stress. - Adversarial loaders — 9 tests covering corrupt PDFs, symlink loops,
deep recursion, malformed DOCX/PPTX/XLSX. - Auto-gate boundary — pins the inclusive
<=semantics at
1499/1500/1501 input tokens + the custom-gate path. - Serde round-trip — every cross-FFI type (
Chunk,Score,
ContextReport, ...) survives JSON round-trip; forward-compat
exercised via a minimal pre-0.1.3 payload. - Strategy semantics — 7 differential tests pinning the contrasts
between all 5ContextStrategyvariants on a shared corpus
(catches accidental strategy convergence). - Persisted cache — incremental cache hit/miss contract for
read_folder_with(persist=true): per-file(mtime, size)skip,
no-op reload doesn't rewrite, fingerprint invalidation on config
change, deleted-file cleanup.
No public API changes. Python and Node callers are unaffected aside
from the new context_with_timeout helper.
v0.2.0
The binding-parity + non-English release. Three months of incremental
quality work plus a focused arc on cross-binding consistency: Python, Node,
and Rust all expose the same surface, return the same values for the same
inputs, and drift is now actively prevented in CI. The Rust crate also gains
a pluggable lexical analyzer, closing the structural bug class (BM25 ↔
grounding-scorer disagreement) that 0.1.3–0.1.4 fixed by hand four times.
Breaking changes
Two source-level breaks for Rust callers; ..Default::default() and the
pip/npm consumers are unaffected:
ContextConfig+DocumentConfiggrew new required fields
(analyzer: Arc<dyn Analyzer>) for the pluggable lexical analyzer.
Callers constructing those structs via field literals from outside
the crate need to addanalyzer: redhop::analyzer::default_english().ContextConfig::default().token_budgetchanged from 2048 → 8192
to align with the Python binding's long-standing default (which was
shipping to PyPI users that whole time). Rust callers relying on the
old 2048 default will now get a 4× larger assembled context. Set
token_budget: 2048explicitly to restore the old behavior. Python +
Node users see no change.
Added
Pluggable lexical analyzer
crate::analyzer::Analyzertrait +SnowballAnalyzer(18
Snowball Porter2 languages). First-class extension point: one analyzer
drives BOTH the BM25 retriever AND the grounding scorer, so the two
layers structurally cannot disagree on what "the same term" means.
Design rationale indocs/design/ANALYZER_PLUGIN.md; usage in
docs/LANGUAGE.md.Document::with_analyzer(Arc<dyn Analyzer>)— mirrors
with_embedder. Swaps the analyzer for both layers in lockstep.LoadOptions::language: Option<String>— string-routed access to
the 18 builtins (english,german,french,spanish,italian,
portuguese,dutch,russian,swedish,norwegian,danish,
finnish,romanian,hungarian,turkish,arabic,greek,
tamil). Unknown language names return an error (no silent fallback
to English).- Python
languagekwarg on everyDocument.from_*constructor. - Node
languagefield onOptions.
Binding parity (Node catches up to Python)
Document.analyze(query)— pure diagnostics, returns the same
Reportshape ascontext().reportwithout paying assembly cost.Document.nFilesgetter — number of source files indexed (1
for single-source ctors, the readable count forfromFolder).Document.skippedFilesgetter —SkippedFile[]({source, reason}pairs) for filesfromFoldercouldn't parse. Was a silent
skip with no introspection before.buildContext/filterContext/analyzeContext/
contextEconomicstop-level functions — the low-level "I do my own
retrieval, just want RedHop for assembly" surface. Mirrors Python's
same-named functions; takesChunkInput[]+ContextOptions.groundingScore(query, text)+linkStrength(a, b)— the
observability primitives the strategies use internally, exposed so
external code reuses RedHop's exact relevance notion instead of
reimplementing.
Tests + infrastructure
crates/redhop/tests/quality_suite.rs— 45-test behavior-level
suite organized by what a user perceives, not by code structure.
Covers tokenization (T01-T07), multi-field reach (T08-T09), document
structure (T10-T13), context assembly (T14-T20), hybrid contract
(T21-T22), edge cases (T23-T26), Unicode/multilingual (T27-T30),
adversarial queries (T31-T34), nested markdown (T35), cross-format
mixed corpus (T36), non-English pinning (T37-T40), and the analyzer
plugin (T41-T45). Found two real bugs on its first runs (an
empty-query BM25 crash and an accent-folding gap), and a binding bug
via T41-T44 (from_chunkssilently droppinglanguage=in Python).python/tests/test_parity_node.py+nodejs/test/parity_runner.cjs
— cross-binding parity harness. 6 tests hand identical inputs to
Python and Node and diff structured outputs (caught the
analyzeContext/contextEconomicstoken_budgetdivergence on
its first run).crates/cli/tests/cli_smoke.rs— first-ever CLI integration
tests. Asserts--helpworks on each subcommand + a real
analyze-context -stdin pipe.- Node CI job —
.github/workflows/ci.ymlnow builds the napi
addon and runsnpm teston PRs. Previously PRs only exercised
Rust + Python. - ASCII folding (
café↔cafe,Süßigkeit↔Sussigkeit,
naïve↔naive) in both BM25 and the grounding scorer (via NFKD).
New tests T27, T28, T39 pin this.
Documentation
docs/LANGUAGE.md— honest scope of non-English support, by
family + theAnalyzerplugin's public API (Rust / Python / Node).docs/design/ANALYZER_PLUGIN.md— rewritten to describe the
shipped surface (was originally a proposal with several deviations).- README "Language support" section + per-package READMEs
(python/README.md,nodejs/README.md) —language=examples. docs/ARCHITECTURE.md— refreshed against the post-consolidation
workspace (the pre-0.2 split ofredhop-{core,context,…}into
separate crates was rolled into one publishedredhopcrate; diagram
and crate-name references updated).docs/API_STABILITY.md— full Node section added; Python section
updated withlanguage=,n_files,skipped_files; Rust section
updated with the consolidated module paths.
Changed
- Python folder walker unified with Rust's
read_folder_with—
−429 LOC inpython/src/lib.rs(≈25% of the file). Removed the
parallelbuild_folder_persisted,collect_files,PersistedIndex,
CachedFile,fingerprint, etc. Both bindings now share Rust's
single implementation; on-disk index format is byte-compatible with
the previous Python writer, so existing caches reload cleanly. strategy_from_str+retrieval_from_strconsolidated to a
single source of truth inredhop::load. Python's wrappers now
forward to the Rust functions withmap_errinstead of duplicating
the match arms.Documentcarriesn_files()andskipped_files()accessors on
the Rust struct. Single-source constructors default to1/ empty;
read_folder_with(both simple and persisted paths) now records
(source, reason)for each skipped file instead of silently dropping
them.- MSRV bumped 1.75 → 1.77 across all three workspace declarations
(workspace,python/Cargo.toml,nodejs/Cargo.toml) — the napi-rs
2.x in the Node binding sets the actual floor; the inconsistency
meant a 1.75 user hit a mysterious napi error instead of a clear MSRV
one.
Fixed
- All-stopword query no longer crashes BM25. A query the analyzer
pipeline reduces to zero positive terms (""," ","the and is of in or") used to surface as a hard Tantivy error (Invalid query: Only excluding terms given). The retriever now traps that error
class (and theempty queryclass) and returns an empty result.
Caught byquality_suite::t25on its first run. - Python
Document.from_chunkssilently droppedlanguage=— the
pyo3 signature accepted the kwarg but the call intodoc_config
passedNoneinstead of the user's value. Caught by the new Python
analyzer test suite on its first run. - Node
analyzeContext/contextEconomicswere honoring the
user'stoken_budgetoption; Python's equivalents hardcode
usize::MAXbecause these are no-budget pure-analysis surfaces.
Caught by the cross-binding parity tests on their first run. - Node
index.d.tswas stale — thelanguagefield and
minCandidatesfield were present on the RustOptionsstruct but
hadn't been regenerated. TypeScript users got "Object literal may
only specify known properties" on perfectly valid options.
Notes
unicode-normalizationpromoted from transitive (via tantivy) to a
direct dep of redhop. Used for the grounding scorer's NFKD fold.- Workspace test count: 320/320 (Rust) + 81/81 (Python, +1 BGE
fixture skip) + Node smoke + analyzer suites. Was 260 at the v0.1.4
tag. - CI gates:
cargo fmt --all -- --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspace,cargo doc --workspace --no-deps --features files,semantic(warning-free), the
cross-binding parity suite, and the Node smoke + analyzer suites.
All six CI jobs green. [package.metadata.docs.rs] all-features = trueadded to the
redhop crate so the published doc page on docs.rs shows the
files+semanticitems instead of just the lean lexical surface.- 21 example files swept clean of hardcoded
/Users/vysakh/...paths;
they resolve datasets/models/exports through
redhop_examples::{data_path, exports_path, model_path, bge_small_paths, ms_marco_paths}helpers that honor
REDHOP_{DATA,EXPORTS,MODELS}_DIRenv vars.
v0.1.4
Citation ergonomics — for both code and prose. Follow-on to 0.1.3's
BM25-quality theme: retrieval is now precise, but the assembled context
returned to the LLM left the user staring at a def line without the
implementation, or at a deep section paragraph without its parent heading.
Both gaps closed by default; both have explicit opt-out knobs. Plus three
prose-side fixes that surfaced during the audit (setext headings, PDF
heading heuristic, plumbing).
Changed
Document::context(query)on a code chunk now attaches ±1 neighbor
chunks by default. Code is chunked as fixed-token windows so a 50-line
function often spans 2-3 chunks; a hit on the chunk containing thedef
line would previously cite only the signature, omitting the body in the
next chunk. WithDocumentConfig::code_neighbors_default = 1
(the new default), citations on code hits include the surrounding
implementation. Behavior change for code-shaped corpora — set
code_neighbors_default: 0to restore the old chunk-only behavior. No
effect on prose corpora (fires only on chunks tagged
metadata["kind"] == "code").Document::context(query)on a prose chunk with a section heading
now attaches the section's opener chunk by default. A query that
lands deep inside## Refunds → ### Eligibilitypreviously cited only
the matched chunk — the LLM lost the section title. With
DocumentConfig::prose_heading_default = true(the new default), the
section's first chunk is attached. Behavior change for hierarchical
prose — setprose_heading_default: falseto disable. Only fires on
chunks that carry non-emptymetadata["heading"](markdown, DOCX,
PPTX, XLSX, and — new in this release — PDF).- Markdown sections now recognize setext headings (
Title\n=====
for H1,Title\n-----for H2) in addition to ATX (#/##/…). YAML
frontmatter (---...---at file start) is detected and excluded
from setext scanning so its closing fence doesn't get treated as an
H2 underline. Pandoc output / older docs / man pages now produce the
same section structure as their ATX equivalents. - PDF chunks now carry best-effort heading metadata. A per-page
heuristic lifts the first short, non-paragraph-shaped line into
Section::heading(rejecting page-number footers, body lines ending
in sentence punctuation, and lines ending in a digit). Lets the BM25
heading-field search added in 0.1.3 actually reach PDF chunks by
topic; previouslymetadata["heading"]was alwaysNoneon PDFs.
Added
DocumentConfig::code_neighbors_default: usize(default1).DocumentConfig::prose_heading_default: bool(defaulttrue).
Both inherited via the Python / Node bindings' default config; no
binding-surface change for callers who don't override.
Fixed
- (No code-bug fixes — 0.1.3 already closed the BM25 quality gaps. See
the Notes section for the verified-not-broken embedding-persistence
story.)
Notes
- Embedding persistence verified. The 0.1.3 audit suspected that
read_folder_with(persist=true)re-embedded every chunk on reload
(paying ~30-60 sec of bge-small cost per cold start on a 1000-chunk
codebase). The machinery is already correct:embedded_chunks()
populates theChunk::embeddingfield from the retriever cache before
writingindex.json,EmbeddingisSerialize/Deserialize, and
LocalRerankRetriever::indexshort-circuits any chunk that comes back
with an embedding already set. Round-trip test
(crates/redhop/tests/embedding_persistence.rs) now pins this — a
reload triggers exactly 1 embed call (the query), not N+1 (the query +
every chunk). Locked in as a regression guard. - Eleven new tests across the citation-ergonomics theme: 3 for the code
neighbor default, 3 for the prose heading default, 3 for setext
headings + frontmatter handling, 2 for the PDF heading heuristic.
111/111 tests pass under
cargo test -p redhop --features files.