v0.1.19

silversurfer562 released this 16 May 12:02

8edd93e

Phase 2 of the v1.0 roadmap — the --thinking default
decision is locked. No behavioral change ships in this
version (the default was already OFF and stays OFF). The
v3 ground-truth round (n = 30) gives a bootstrap 95 % CI
on (wins_off − wins_on) of [−1, +13] — point estimate
+6, CI includes zero. The decision rests on
"off-favored but not statistically distinguishable at this
sample size" plus judge variance well below the
escalation threshold, NOT on a positive CI. Phase 3's
API-surface groundwork (snapshot tests + deprecation
policy + __all__ audit) landed in 0.1.18 in parallel;
the formal API freeze still targets 0.2.0. See
ROADMAP-v1.md for sequencing.

Changed

--thinking default decision locked: stays OFF (Phase 2
of v1.0 roadmap). v3 ground-truth round at n=30 (15 shift
- 15 random + 2 controls). Bootstrap 95 % CI on
  (wins_off − wins_on) = [−1, +13] — point estimate +6,
  CI includes 0 (off favored but not statistically
  distinguishable). v3 off-to-on ratio 2.5× reverses the
  v1 → v2 narrowing (1.5× → 1.2× → 2.5×). Judge variance is
  small: margin_stdev = 0.0189, far below the 0.10
  escalation threshold (K=8 random-bucket queries × M=5 runs;
  5 of 8 hit σ=0 in both conditions). No baseline
  re-measurement needed because the default doesn't flip.
  Locked record:
  docs/specs/faithfulness-thinking-decision/decision.md.
  Calibration writeup:
  docs/rag/faithfulness-thinking-calibration.md.
docs/rag/faithfulness-thinking-calibration.md rewritten.
Top blockquote now states the locked Phase 2 decision
(was "pending"). The duplicated v2 ground-truth section
(two near-identical copies in the file) has been
deduplicated. A new "v3 (2026-05-16, n = 30 rubric + 2
controls)" section adds the aggregate-alignment table,
bootstrap CI, judge-variance discussion with the σ=0
finding, phantom-claim examples, the v1 → v2 → v3
comparison table, and a trace of all six rubric rules.
docs/specs/ROADMAP-v1.md marks Phase 2 complete
and Phase 3 unblocked; current-version row reflects
0.1.19.

Added

scripts/measure_judge_variance.py — judge-only
variance measurement. Re-runs the FaithfulnessJudge M times
per query in each condition (off, on) against captured
answer + context from a calibration artifact. Outputs
per-query mean/stdev + aggregate pooled stdevs +
margin_stdev. Used by Phase 2 to anchor the rubric's
noise-floor escalation rule.
Bootstrap CI + phantom-claim rate in
scripts/score_against_ground_truth.py. Extended with
--rubric-rule {legacy,design}, --control-ids,
--bootstrap-iters, --seed, and --variance. The
design-rule classifier flags a tie iff |off−on|,
|off−label|, |on−label| are all < 0.025. Bootstrap
resamples per-query verdicts B times and reports the 2.5 %
/ 97.5 % quantiles. Phantom-claim detection uses a
content-word overlap heuristic (overlap < 0.40 ⇒
flagged); honest about the limits — a literal-substring
matcher was tried first and yielded a meaningless 100 %
rate. The 6-rule acceptance rubric from design.md runs at
the bottom of every score run.
build_calibration_labeling_kit.py --n-random N +
--seed. New bucket: N queries drawn uniformly from the
remaining pool after shift+controls. Anchors the
noise-floor measurement on typical queries instead of
high-shift outliers. _select_queries now returns
(shifted, controls, random); the kit script's previous
flat-list return is gone (M2.1 / M2.2 of Phase 2).
docs/specs/faithfulness-thinking-decision/ — full
Phase 2 spec (requirements, design, tasks).
docs/specs/faithfulness-thinking-decision/decision.md —
locked, machine-readable YAML record of the Phase 2 verdict:
per-round win counts, bootstrap CI bounds, phantom rate,
variance numbers, prior-round comparisons, and the
methodology footnote about the mid-round labeler shift.
Future re-evaluations should start a successor spec
directory rather than amending this record.
v3 calibration artifacts at
artifacts/calibration/thinking-2026-05-16.json (paired
off+on benchmark at n=40),
ground-truth-2026-05-16.md (n=32 labels; 30 rubric +
2 controls), and variance-2026-05-16.json (K=8 × M=5
judge-variance measurement).
New unit-test coverage for the calibration toolchain.
tests/unit/test_measure_judge_variance.py (9 tests:
fake-judge integration, aggregate stdev math, CLI
validation) and an expanded tests/unit/test_calibration_scripts.py
(27 new tests across the design tie rule, content-word
tokenizer, phantom-rate detector, bootstrap CI, and all
six rubric branches; total 52 tests pass in 0.13 s).

Assets 2