Skip to content

v0.1.17

Choose a tag to compare

@silversurfer562 silversurfer562 released this 16 May 03:32
d98fabe

Fixed

  • attune_rag.__version__ was stale. Both v0.1.15 and
    v0.1.16 shipped with __version__ = "0.1.14" because the
    release-prep flow only bumped pyproject.toml (the version
    PyPI uses) and never touched the in-source constant. Synced
    __version__ to 0.1.16 and added
    tests/unit/test_package_metadata.py asserting it matches
    importlib.metadata.version("attune-rag") so the next
    release-prep PR will fail CI if it forgets the bump.

Added

  • Calibration ground-truth labeling kit. Two scripts under
    scripts/:

    • build_calibration_labeling_kit.py picks N queries from a
      --compare-thinking --json artifact (largest shifts + a
      few unchanged controls) and emits a markdown labeling
      template.
    • score_against_ground_truth.py reads the labeled markdown
      plus the artifact and reports which judge pass (off / on)
      aligned more closely with the human labels — the empirical
      signal that gates a future --thinking default-flip
      decision.

    Workflow documented in
    docs/rag/faithfulness-thinking-calibration.md. First kit
    for the 2026-05-15 run is committed at
    artifacts/calibration/ground-truth-2026-05-15.template.md
    (8 queries: 5 highest-shift + 3 controls). Known gap: the
    benchmark JSON doesn't yet capture the generator's answer
    text or retrieved passages, so the kit surfaces the judge's
    claim lists as a proxy; a follow-up will enrich the JSON.

  • Larger calibration kit (17 queries, v2). A re-run of
    --compare-thinking against the 40-query golden set on
    enriched-JSON output (post-#26, with answer + context
    embedded per query) produced a fresh artifact at
    artifacts/calibration/thinking-2026-05-15-v2.json. From
    that, a 17-query labeling kit at
    artifacts/calibration/ground-truth-2026-05-15-v2.template.md
    (13 highest-shift + 4 controls). Since answer and
    context are embedded per query, the kit is now
    self-contained — no live API calls needed at label time.
    Surfaced a call-to-call-variance observation worth noting:
    v2's high-shift set differs significantly from v1's
    (e.g., gq-017 went from Δ=+0.182 to Δ=−0.250 across the
    two runs); judge non-determinism means each calibration
    captures a snapshot, not ground truth itself. See
    docs/rag/faithfulness-thinking-calibration.md.

  • Ground-truth validation of the v2 calibration (17 queries).
    Follow-up to the v1 round below. Patrick labeled the larger
    17-query v2 kit under the same strict-lens philosophy.
    Outcome: off-closer 6, on-closer 5, tied 6 — option B
    confirmed at 2× sample size. V1's off-vs-on margin narrowed
    from 1.5× to 1.2× — on more competitive than v1 suggested,
    but not enough to flip the call. Labels at
    artifacts/calibration/ground-truth-2026-05-15-v2.md;
    results appended to
    docs/rag/faithfulness-thinking-calibration.md.

  • Ground-truth validation of the v0.1.15 calibration.
    Patrick labeled the 8-query kit interactively under a strict
    lens; results are committed at
    artifacts/calibration/ground-truth-2026-05-15.md and
    written into the calibration doc. Outcome: among the 5
    high-shift queries, off-closer 3, on-closer 2, tied 0
    (3 controls were tied). Also surfaced a phantom-claim
    pattern in judge-on (paraphrases the answer into more
    specific claims, then flags its own paraphrases). Decision
    Option B (keep --thinking opt-in) is now empirically
    backed rather than absence-of-evidence-based.