v0.3.3
Answer-quality eval surface (Rust + Python + Node) + audit of
defaulted-on heuristics. Two threads in this release:
(1) a new in-process evaluate(...) / critique(...) surface for
closed-set answer-quality metrics — lexical + LLM-judged, with
claim-decomposed faithfulness and TP/FP/FN correctness; calibrated
against Ragas at n=200 HotpotQA (r=+0.664, MAE=0.151);
(2) the defaulted-on heuristics audit (five measured, one
already-flipped in 0.3.2, two API smells fixed).
Added — eval surface
evaluate(query, ctx, answer=, gold_answer=, judge=, decompose_faithfulness=, decompose_correctness=)—
in-process answer-quality eval that returns oneEvalReportblending
lexical (CI-deterministic) and LLM-judged metrics. Available in Rust
(redhop::evaluate(...)), Python (redhop.evaluate(...)), and Node
(evaluateWithJudge(...)for the async judge path). Faithfulness,
relevancy, correctness in_lexical(no LLM) and_judged(opt-in)
flavors; gold-relative metrics (context_recall,
context_precision,answer_token_recall) whengoldis provided.critique(answer, aspects, judge=, context=, query=)— open-ended
user-defined dimensions (harmfulness, conciseness, brand voice,
etc.). One LLM call per aspect; polarity-corrected scores so high =
good across the report regardless ofhighIsGood. Returns a
CritiqueReportwith per-aspect scores in input order.summarize(reports)— aggregates a sequence of per-case
EvalReports into a means + N + share-flagged summary, the same
shape RedHop's runtime uses for its Decision Report.- Judge surface —
Judge.from_callable(fn).cached()(Python),
Judge.fromCallable(fn, name).cached()(Node), and the Rust
Judgetrait withCachedJudgeandCallableJudgewrappers. One
caching layer for any user-supplied LLM caller; an LRU sized by
the caller. Single primitive supports faithfulness, relevancy,
correctness, critique, and decomposed paths. - Claim-decomposed faithfulness (
decompose_faithfulness=True):
extracts atomic claims via a few-shot LLM call, then batch-verifies
all of them in a single second LLM call. Two LLM calls regardless
of how many claims were extracted.gpt-4o-minicorrelates with
Ragas's faithfulness at r=+0.664 on n=200 HotpotQA (see
docs/findings/EVAL_JUDGED_CALIBRATION.md).
Verifier prompt includes paraphrase-positive examples + negative
entity-substitution examples to balance strictness and recall. - TP/FP/FN correctness (
decompose_correctness=True): mirrors
decomposed-faithfulness on the answer-vs-gold axis. Extracts
claims from both the answer and the gold answer, classifies each as
TP / FP / FN, returns F₁. Diagnostic counters
(n_correctness_tp/fp/fn) surface the intermediate categorisation. - Refusal-aware decomposition — "I don't know" answers correctly
producemean_faithfulness_judged = None(0 claims extracted)
instead of being scored as a vacuous 1.0. Surfaces refusals as a
distinct category, not as faithfulness=1. - Diagnostic counters on
EvalReport:
n_faithfulness_claims_extracted,n_faithfulness_claims_supported,
n_correctness_tp,n_correctness_fp,n_correctness_fn. Surface
intermediate classifier counts so callers can debug WHY a metric
landed where it did.
Added — eval evidence
- docs/COMPARISON_RAGAS.md — public-facing
head-to-head with Ragas on claim-decomposed faithfulness. n=200
HotpotQA, gpt-4o-mini, with Claude haiku as third-judge tie-breaker. - docs/findings/EVAL_JUDGED_CALIBRATION.md
rewritten end-to-end: three-layer evidence (5-case wiring probe →
5-case Ragas side-by-side → n=200 HotpotQA correlation + third-judge
subset). Documents the v0→v4 prompt iteration that fixed four
traceable failure modes (paraphrase rejection, comparative
hallucination, compound-attribution dilution, wrong-entity
substitution). Calls out single-shot LLM noise as a measured
property of the workload (gpt-4o-mini at temp=0 is not
deterministic through OpenRouter — ~20–30% per-case variance). - docs/findings/ANSWER_QUALITY_EVAL.md —
full API tour for the newevaluate(...)+critique(...)surface. - docs/findings/EVAL_VS_RAGAS_SOURCE.md —
source-read comparison of the two libraries' implementations. bench/eval_correlation_hotpot.py— runs the n=200 Pearson r / MAE
measurement on HotpotQA against Ragas with configurable context mode
(supporting/distractor_only/all).bench/eval_third_judge.py— Claude haiku tie-breaker via the local
claude -pCLI; no API key needed.bench/eval_faith_trace.py— diagnostic harness for tracing claim
extraction + per-claim verifier votes on specific qids. Has a
--variant v0/v1/v2/v3/v4flag for prompt iteration without
rebuilding the Rust crate.bench/eval_judged_calibration.py— the 5-case wiring probe with
optional Ragas side-by-side when installed.bench/select_third_judge_subset.py— filters contested cases from
a correlation-bench JSON so the third-judge run stays cheap.
Breaking — Rust only
redhop::evaluate(...)signature now takes six parameters:
(query, ctx, answer, gold, judge, config)instead of the prior
three-parameter shape. The Python and Node bindings absorb this via
kwargs / options and are NOT breaking. PassNoneforanswerand
judgeandEvalConfig::default()forconfigto match the old
behavior.
Added — defaulted-on heuristics audit
code_neighbors_default/codeNeighborsDefault— surfaces the
±N adjacent-chunk auto-pull on code chunks as a constructor kwarg on
Pythonfrom_text/from_file/from_bytes/from_folder/from_chunks
and as a field on the NodeOptionsstruct. Default1(unchanged
behavior). Pass0to disable, or2/3for more aggressive
expansion under loose token budgets. See
docs/findings/CODE_NEIGHBORS_DEFAULT.md
for the measured budget tradeoff.prose_heading_default/proseHeadingDefault— surfaces the
auto-attach of section-heading chunks to prose hits as the same
constructor-level kwarg / option. Defaulttrue(unchanged). Pass
falsefor memory-tight workloads where the heading isn't
load-bearing. See
docs/findings/PROSE_HEADING_DEFAULT.md
for the measured +7pt ≥0.8 lift at typical budgets.crates/redhop/src/load.rs—LoadOptionsnow exposes
code_neighbors_defaultandprose_heading_defaultas
Option<usize>/Option<bool>. Threads throughread_folder_with
for parity with the in-memory loaders.- Audit finding docs. Five new findings on the defaulted-on
heuristics audit:
RAW_ANALYZER(flipped in 0.3.2),
HYBRID_CANDIDATE_POOL
(inert knob — don't tune),
PROSE_HEADING_DEFAULT
(+7pt at typical budgets),
BM25_SOURCE_FIELD(+4pt with
signal, 0pt with noise),
CODE_NEIGHBORS_DEFAULT
(budget-dependent compromise). - Cross-binding parity tests for the two new kwargs (Python
test_loaders.py, Nodesmoke.cjs).
Changed
- No default values changed in this release. All defaults remain
what they were after 0.3.2 — the new kwargs default to the existing
Rust values (code_neighbors_default=1,prose_heading_default=true).
Existing callers see zero behavior change; the new kwargs are an
opt-out / tune surface only.