v0.1.17

silversurfer562 released this 16 May 03:32

d98fabe

Fixed

attune_rag.__version__ was stale. Both v0.1.15 and
v0.1.16 shipped with __version__ = "0.1.14" because the
release-prep flow only bumped pyproject.toml (the version
PyPI uses) and never touched the in-source constant. Synced
__version__ to 0.1.16 and added
tests/unit/test_package_metadata.py asserting it matches
importlib.metadata.version("attune-rag") so the next
release-prep PR will fail CI if it forgets the bump.

Added

Calibration ground-truth labeling kit. Two scripts under
scripts/:
- build_calibration_labeling_kit.py picks N queries from a
  --compare-thinking --json artifact (largest shifts + a
  few unchanged controls) and emits a markdown labeling
  template.
- score_against_ground_truth.py reads the labeled markdown
  plus the artifact and reports which judge pass (off / on)
  aligned more closely with the human labels — the empirical
  signal that gates a future --thinking default-flip
  decision.
Workflow documented in
docs/rag/faithfulness-thinking-calibration.md. First kit
for the 2026-05-15 run is committed at
artifacts/calibration/ground-truth-2026-05-15.template.md
(8 queries: 5 highest-shift + 3 controls). Known gap: the
benchmark JSON doesn't yet capture the generator's answer
text or retrieved passages, so the kit surfaces the judge's
claim lists as a proxy; a follow-up will enrich the JSON.
Larger calibration kit (17 queries, v2). A re-run of
--compare-thinking against the 40-query golden set on
enriched-JSON output (post-#26, with answer + context
embedded per query) produced a fresh artifact at
artifacts/calibration/thinking-2026-05-15-v2.json. From
that, a 17-query labeling kit at
artifacts/calibration/ground-truth-2026-05-15-v2.template.md
(13 highest-shift + 4 controls). Since answer and
context are embedded per query, the kit is now
self-contained — no live API calls needed at label time.
Surfaced a call-to-call-variance observation worth noting:
v2's high-shift set differs significantly from v1's
(e.g., gq-017 went from Δ=+0.182 to Δ=−0.250 across the
two runs); judge non-determinism means each calibration
captures a snapshot, not ground truth itself. See
docs/rag/faithfulness-thinking-calibration.md.
Ground-truth validation of the v2 calibration (17 queries).
Follow-up to the v1 round below. Patrick labeled the larger
17-query v2 kit under the same strict-lens philosophy.
Outcome: off-closer 6, on-closer 5, tied 6 — option B
confirmed at 2× sample size. V1's off-vs-on margin narrowed
from 1.5× to 1.2× — on more competitive than v1 suggested,
but not enough to flip the call. Labels at
artifacts/calibration/ground-truth-2026-05-15-v2.md;
results appended to
docs/rag/faithfulness-thinking-calibration.md.
Ground-truth validation of the v0.1.15 calibration.
Patrick labeled the 8-query kit interactively under a strict
lens; results are committed at
artifacts/calibration/ground-truth-2026-05-15.md and
written into the calibration doc. Outcome: among the 5
high-shift queries, off-closer 3, on-closer 2, tied 0
(3 controls were tied). Also surfaced a phantom-claim
pattern in judge-on (paraphrases the answer into more
specific claims, then flags its own paraphrases). Decision
Option B (keep --thinking opt-in) is now empirically
backed rather than absence-of-evidence-based.

Assets 2