v0.2.2
The binding parity + evidence layer release. No breaking changes for any
binding's callers. The Node binding gains 14 missing Report fields, the
documentation gets its first visual presentation (badges, charts,
architecture diagram), and the evidence layer grows by five new findings
that document what was tried, what worked, and what was falsified honestly.
Added
Node binding — full Report field-surface parity with Python
Reportgains 14 fields + a permanent alias:strategy,
requestedStrategy,inputTokens,tokenBudget,tokenUtilization,
nInputChunks,nSelected,inputDistractorRatio,
reasoningPreservationDelta,distractorsPruned,removedTotal,
evidenceDensity,distractorRatio,estimatedWasteTokens, plus
secondHopRescueCount(==secondHopRescues, the existing short
name; both names will always be present and equal). Before this
release Node'sReportexposed roughly half of Python's surface —
programmatic callers usingreport.totalTokenscould not read
report.nSelectedor any of the economics fields. All additions are
non-breaking; no existing field changed name or shape.docs/API_STABILITY.mdgains a "Known call-shape asymmetries"
section documenting the two pre-existing idiomatic differences
between the Python and Node bindings (from_textpositional vs
options-bagsource;ctx.text()callable in Python vsctx.text
property in Node). These are stable within 0.x — neither will be
silently flipped.
README + binding-page presentation
- Multi-registry badges at the top of every README (root, Python,
Node). PyPI / crates.io / npm version numbers, license, and a link
to the evidence layer. Brand color (#e11d48) on the registry
badges. - A retention-vs-frameworks bar chart (
.github/retention_vs_frameworks.svg)
showing the measured head-to-head numbers from
FRAMEWORK_COMPARISON.md:
HotpotQA multi-hop (RedHop 77%, LangChain 71%, LlamaIndex 72%) and
CUAD contracts (RedHop 82%, LangChain 73%, LlamaIndex 86%). Hand-rolled
SVG, no fake screenshots, every number traces to
reports/framework_comparison.txt. - A pipeline architecture diagram (
.github/architecture.svg)
showing the five stages — Document → Chunking → Retrieval →
Allocation → BuiltContext — with the calibrating finding named under
each internal stage and a "YOU BRING / REDHOP OWNS / YOU GET" scope
label. - A Decision Report visual (
.github/decision_report.svg) —
terminal-styled SVG rendering ofctx.reportoutput. Same content
as the ASCII block (which is preserved under a collapsed<details>
for copy-paste). - A References section at the bottom of the root README, citing
the named work each piece of the runtime leans on: BM25 (Robertson &
Zaragoza 2009), Porter2 (Porter 2001), RRF (Cormack et al. 2009),
Lost-in-the-Middle (Liu et al. 2023), NQC (Shtok et al. 2012), MDR
(Xiong et al. 2021), and the HotpotQA/MuSiQue/CUAD evaluation
datasets. Each citation links to the finding doc that uses that
work. - The binding READMEs (
python/README.md,nodejs/README.md) get
the architecture + Decision Report visuals via absolute
raw.githubusercontent.comURLs so they render on PyPI and npm
package pages (not just on GitHub).
Structural test suite (crates/redhop/tests/)
proptest_invariants.rs— 9 property-based invariants
pinningbuild_context's behavior under random valid inputs:
build_context_never_panics,resolved_strategy_is_never_auto,
auto_decision_triangle_holds,selection_is_subset_of_input,
no_duplicate_chunk_ids_in_output,token_budget_respected,
report_counts_match_reality,report_ratios_are_finite_and_in_range,
build_context_is_deterministic. Addsproptest = "1"as a
dev-dependency. Catches edge cases hand-written tests miss by
construction (empty input, NaN scores, all-stopword query, single-token
budget, …) — the bug class structural tests exist to close off.default_calibration.rs— 9 pins binding each tuned default
onContextConfig::default()/DocumentConfig::default()to the
finding that calibrated it (token_budget = 8192,
auto_passthrough_max_tokens = 1500,distractor_min_grounding = 0.10,
redundancy_max_cosine = 0.92,link_min_jaccard = 0.12,
low_confidence_max_grounding ≡ distractor_min_grounding,
target_tokens = 128,candidate_k = 20,Document.strategy = Auto).
Silent default drift becomes impossible; intentional drift is
documented in the same commit that makes it.public_api_snapshot.rs— 5 compile-time guards against silent
rename/removal of public symbols, plus type-bound signature pins for
load-bearing functions and exhaustive-match pins on
ContextStrategy/AutoDecision/RetrievalMethodstring
variants. Caught 4 real signature mistakes during authoring.golden_quality.rs— 6 end-to-end retrieval-quality canaries
on small inline corpora: G01 lexical-keyword match, G02 stemming
finds morphological variants, G03 distractor doesn't dominate
relevant chunk, G04 second-hop chunk survives default assembly, G05
low-confidence signal fires on off-corpus queries, G06 citations
carry correct source. The canary that fires between benchmark runs
when RedHop "got dumber" on a real query shape.- Three field-set parity tests (
test_report_field_surface_parity,
test_built_context_field_surface_parity,
test_context_economics_field_surface_parityin
python/tests/test_parity_node.py). Each compares the SET of fields
each binding exposes for the named return type. Auto-catches the gap
class where a new#[getter](Python) orpubfield (Node) appears
on one side without the other keeping up — the failure mode that hid
the 14-field NodeReportgap until a smoke test stumbled on
strategy.
Evidence layer — five new finding documents
MUSIQUE_RECALL_GAP.md—
decomposes the dense recall gap between HotpotQA (0.76) and MuSiQue
(0.28) into five distinct contributors (gold density, retrieval signal
type, wide-net coverage, embedder capacity, chunking) and documents an
attempted full-pool RRF refactor ofRetrievalMode::Hybridthat an
honest A/B benchmark falsified. Branchfeature/hybrid-full-pool-rrf
on origin holds the working refactor as a research record; main keeps
the existing Hybrid behavior. Includes 5 reproducible example
harnesses undercrates/examples/examples/musique_*.rsand
hybrid_old_vs_new.rs(the A/B that closed the question).RERANKING_LIMITS.mdUpdate —
2026-06-06 (kind-label gate) — falsifies both directions of the
HotpotQA-type-label gate proposed in the original finding's "open
problem" section. Closes that probe.RERANKING_LIMITS.mdUpdate —
2026-06-06 (later, grounding gate) — documents the discovery and
cross-corpus falsification of agrounding_top1 ≤ 0.35gate that
worked on HotpotQA (+0.031 lift, robust to 5-fold CV) but failed to
generalize to MuSiQue. Also covers an NQC + WIG cross-corpus probe
that didn't port. Closes the CE-gate research direction with a
measured negative result. Includes the Phase A feature-logging
harness (crates/examples/examples/ce_gate_feature_log{,_musique}.rs)
for any future probe.DENSE_RERANK_CEILING.md
Update — 2026-06-06 — falsifies MDR single-pass as a uniform policy
(−0.05 vs dense baseline) while documenting a real +0.027 lift on the
subset of queries where dense had a gold in the pool but missed it.
Closes the single-shot MDR probe.LOCAL_RERANK.mdUpdate —
2026-06-06 — notes the status ofLocalRerankRetrieverafter the
MuSiQue investigation: it is now a building block rather than the
defaultHybrid, but the "semantic recall without ANN" contract it
established is intact and the working refactor is preserved on
feature/hybrid-full-pool-rrffor future re-evaluation.
Changed
python/tests/test_parity_node.pynow pinsstrategy+
requested_strategydata-value parity (in addition to the field-set
parity above). The harness previously normalized away these fields
rather than testing them — direct dict-key access means a future
regression that drops either field on either side fails with a clear
KeyErrorinstead of silently passing.- Documentation polish:
python/README.md'sfrom_textrow shows the
optionalsource=parameter;nodejs/README.mdlists the full
reportshape; the top-level README's"hybrid"row in the
Retrieval tiers table accurately reflects the shipped semantics.
Notes on the runtime
RetrievalMode::Hybridis unchanged for this release. A full-pool
RRF refactor was built end-to-end onfeature/hybrid-full-pool-rrf
(commitc81ffbe, all tests passing, fmt + clippy clean) but a direct
A/B benchmark falsified the ship decision: at the user-facing
candidate_k = 20the new composition gave only +0.0074 on MuSiQue
and +0.0017 on HotpotQA (both below the +0.02 pre-registered ship bar)
and regressed HotpotQA at K=4 by −0.011. The wide-K wins are real
(+0.07 MuSiQue@50, +0.034 HotpotQA@50) but not user-facing.
LocalRerankRetriever's BM25-prune-then-RRF composition is still
whatRetrievalMode::Hybridresolves to. Full A/B numbers and the
ship-decision audit are in
MUSIQUE_RECALL_GAP.md.