v0.1.17
Fixed
attune_rag.__version__was stale. Both v0.1.15 and
v0.1.16 shipped with__version__ = "0.1.14"because the
release-prep flow only bumpedpyproject.toml(the version
PyPI uses) and never touched the in-source constant. Synced
__version__to0.1.16and added
tests/unit/test_package_metadata.pyasserting it matches
importlib.metadata.version("attune-rag")so the next
release-prep PR will fail CI if it forgets the bump.
Added
-
Calibration ground-truth labeling kit. Two scripts under
scripts/:build_calibration_labeling_kit.pypicks N queries from a
--compare-thinking --jsonartifact (largest shifts + a
few unchanged controls) and emits a markdown labeling
template.score_against_ground_truth.pyreads the labeled markdown
plus the artifact and reports which judge pass (off / on)
aligned more closely with the human labels — the empirical
signal that gates a future--thinkingdefault-flip
decision.
Workflow documented in
docs/rag/faithfulness-thinking-calibration.md. First kit
for the 2026-05-15 run is committed at
artifacts/calibration/ground-truth-2026-05-15.template.md
(8 queries: 5 highest-shift + 3 controls). Known gap: the
benchmark JSON doesn't yet capture the generator's answer
text or retrieved passages, so the kit surfaces the judge's
claim lists as a proxy; a follow-up will enrich the JSON. -
Larger calibration kit (17 queries, v2). A re-run of
--compare-thinkingagainst the 40-query golden set on
enriched-JSON output (post-#26, withanswer+context
embedded per query) produced a fresh artifact at
artifacts/calibration/thinking-2026-05-15-v2.json. From
that, a 17-query labeling kit at
artifacts/calibration/ground-truth-2026-05-15-v2.template.md
(13 highest-shift + 4 controls). Sinceanswerand
contextare embedded per query, the kit is now
self-contained — no live API calls needed at label time.
Surfaced a call-to-call-variance observation worth noting:
v2's high-shift set differs significantly from v1's
(e.g., gq-017 went from Δ=+0.182 to Δ=−0.250 across the
two runs); judge non-determinism means each calibration
captures a snapshot, not ground truth itself. See
docs/rag/faithfulness-thinking-calibration.md. -
Ground-truth validation of the v2 calibration (17 queries).
Follow-up to the v1 round below. Patrick labeled the larger
17-query v2 kit under the same strict-lens philosophy.
Outcome: off-closer 6, on-closer 5, tied 6 — option B
confirmed at 2× sample size. V1's off-vs-on margin narrowed
from 1.5× to 1.2× — on more competitive than v1 suggested,
but not enough to flip the call. Labels at
artifacts/calibration/ground-truth-2026-05-15-v2.md;
results appended to
docs/rag/faithfulness-thinking-calibration.md. -
Ground-truth validation of the v0.1.15 calibration.
Patrick labeled the 8-query kit interactively under a strict
lens; results are committed at
artifacts/calibration/ground-truth-2026-05-15.mdand
written into the calibration doc. Outcome: among the 5
high-shift queries, off-closer 3, on-closer 2, tied 0
(3 controls were tied). Also surfaced a phantom-claim
pattern in judge-on (paraphrases the answer into more
specific claims, then flags its own paraphrases). Decision
Option B (keep--thinkingopt-in) is now empirically
backed rather than absence-of-evidence-based.