Skip to content

v0.3.3

Choose a tag to compare

@github-actions github-actions released this 09 Jun 13:26
· 34 commits to main since this release

Answer-quality eval surface (Rust + Python + Node) + audit of
defaulted-on heuristics.
Two threads in this release:
(1) a new in-process evaluate(...) / critique(...) surface for
closed-set answer-quality metrics — lexical + LLM-judged, with
claim-decomposed faithfulness and TP/FP/FN correctness; calibrated
against Ragas at n=200 HotpotQA (r=+0.664, MAE=0.151);
(2) the defaulted-on heuristics audit (five measured, one
already-flipped in 0.3.2, two API smells fixed).

Added — eval surface

  • evaluate(query, ctx, answer=, gold_answer=, judge=, decompose_faithfulness=, decompose_correctness=)
    in-process answer-quality eval that returns one EvalReport blending
    lexical (CI-deterministic) and LLM-judged metrics. Available in Rust
    (redhop::evaluate(...)), Python (redhop.evaluate(...)), and Node
    (evaluateWithJudge(...) for the async judge path). Faithfulness,
    relevancy, correctness in _lexical (no LLM) and _judged (opt-in)
    flavors; gold-relative metrics (context_recall,
    context_precision, answer_token_recall) when gold is provided.
  • critique(answer, aspects, judge=, context=, query=) — open-ended
    user-defined dimensions (harmfulness, conciseness, brand voice,
    etc.). One LLM call per aspect; polarity-corrected scores so high =
    good across the report regardless of highIsGood. Returns a
    CritiqueReport with per-aspect scores in input order.
  • summarize(reports) — aggregates a sequence of per-case
    EvalReports into a means + N + share-flagged summary, the same
    shape RedHop's runtime uses for its Decision Report.
  • Judge surfaceJudge.from_callable(fn).cached() (Python),
    Judge.fromCallable(fn, name).cached() (Node), and the Rust
    Judge trait with CachedJudge and CallableJudge wrappers. One
    caching layer for any user-supplied LLM caller; an LRU sized by
    the caller. Single primitive supports faithfulness, relevancy,
    correctness, critique, and decomposed paths.
  • Claim-decomposed faithfulness (decompose_faithfulness=True):
    extracts atomic claims via a few-shot LLM call, then batch-verifies
    all of them in a single second LLM call. Two LLM calls regardless
    of how many claims were extracted. gpt-4o-mini correlates with
    Ragas's faithfulness at r=+0.664 on n=200 HotpotQA (see
    docs/findings/EVAL_JUDGED_CALIBRATION.md).
    Verifier prompt includes paraphrase-positive examples + negative
    entity-substitution examples to balance strictness and recall.
  • TP/FP/FN correctness (decompose_correctness=True): mirrors
    decomposed-faithfulness on the answer-vs-gold axis. Extracts
    claims from both the answer and the gold answer, classifies each as
    TP / FP / FN, returns F₁. Diagnostic counters
    (n_correctness_tp/fp/fn) surface the intermediate categorisation.
  • Refusal-aware decomposition — "I don't know" answers correctly
    produce mean_faithfulness_judged = None (0 claims extracted)
    instead of being scored as a vacuous 1.0. Surfaces refusals as a
    distinct category, not as faithfulness=1.
  • Diagnostic counters on EvalReport:
    n_faithfulness_claims_extracted, n_faithfulness_claims_supported,
    n_correctness_tp, n_correctness_fp, n_correctness_fn. Surface
    intermediate classifier counts so callers can debug WHY a metric
    landed where it did.

Added — eval evidence

  • docs/COMPARISON_RAGAS.md — public-facing
    head-to-head with Ragas on claim-decomposed faithfulness. n=200
    HotpotQA, gpt-4o-mini, with Claude haiku as third-judge tie-breaker.
  • docs/findings/EVAL_JUDGED_CALIBRATION.md
    rewritten end-to-end: three-layer evidence (5-case wiring probe →
    5-case Ragas side-by-side → n=200 HotpotQA correlation + third-judge
    subset). Documents the v0→v4 prompt iteration that fixed four
    traceable failure modes (paraphrase rejection, comparative
    hallucination, compound-attribution dilution, wrong-entity
    substitution). Calls out single-shot LLM noise as a measured
    property of the workload (gpt-4o-mini at temp=0 is not
    deterministic through OpenRouter — ~20–30% per-case variance).
  • docs/findings/ANSWER_QUALITY_EVAL.md
    full API tour for the new evaluate(...) + critique(...) surface.
  • docs/findings/EVAL_VS_RAGAS_SOURCE.md
    source-read comparison of the two libraries' implementations.
  • bench/eval_correlation_hotpot.py — runs the n=200 Pearson r / MAE
    measurement on HotpotQA against Ragas with configurable context mode
    (supporting / distractor_only / all).
  • bench/eval_third_judge.py — Claude haiku tie-breaker via the local
    claude -p CLI; no API key needed.
  • bench/eval_faith_trace.py — diagnostic harness for tracing claim
    extraction + per-claim verifier votes on specific qids. Has a
    --variant v0/v1/v2/v3/v4 flag for prompt iteration without
    rebuilding the Rust crate.
  • bench/eval_judged_calibration.py — the 5-case wiring probe with
    optional Ragas side-by-side when installed.
  • bench/select_third_judge_subset.py — filters contested cases from
    a correlation-bench JSON so the third-judge run stays cheap.

Breaking — Rust only

  • redhop::evaluate(...) signature now takes six parameters:
    (query, ctx, answer, gold, judge, config) instead of the prior
    three-parameter shape. The Python and Node bindings absorb this via
    kwargs / options and are NOT breaking. Pass None for answer and
    judge and EvalConfig::default() for config to match the old
    behavior.

Added — defaulted-on heuristics audit

  • code_neighbors_default / codeNeighborsDefault — surfaces the
    ±N adjacent-chunk auto-pull on code chunks as a constructor kwarg on
    Python from_text/from_file/from_bytes/from_folder/from_chunks
    and as a field on the Node Options struct. Default 1 (unchanged
    behavior). Pass 0 to disable, or 2/3 for more aggressive
    expansion under loose token budgets. See
    docs/findings/CODE_NEIGHBORS_DEFAULT.md
    for the measured budget tradeoff.
  • prose_heading_default / proseHeadingDefault — surfaces the
    auto-attach of section-heading chunks to prose hits as the same
    constructor-level kwarg / option. Default true (unchanged). Pass
    false for memory-tight workloads where the heading isn't
    load-bearing. See
    docs/findings/PROSE_HEADING_DEFAULT.md
    for the measured +7pt ≥0.8 lift at typical budgets.
  • crates/redhop/src/load.rsLoadOptions now exposes
    code_neighbors_default and prose_heading_default as
    Option<usize> / Option<bool>. Threads through read_folder_with
    for parity with the in-memory loaders.
  • Audit finding docs. Five new findings on the defaulted-on
    heuristics audit:
    RAW_ANALYZER (flipped in 0.3.2),
    HYBRID_CANDIDATE_POOL
    (inert knob — don't tune),
    PROSE_HEADING_DEFAULT
    (+7pt at typical budgets),
    BM25_SOURCE_FIELD (+4pt with
    signal, 0pt with noise),
    CODE_NEIGHBORS_DEFAULT
    (budget-dependent compromise).
  • Cross-binding parity tests for the two new kwargs (Python
    test_loaders.py, Node smoke.cjs).

Changed

  • No default values changed in this release. All defaults remain
    what they were after 0.3.2 — the new kwargs default to the existing
    Rust values (code_neighbors_default=1, prose_heading_default=true).
    Existing callers see zero behavior change; the new kwargs are an
    opt-out / tune surface only.