Skip to content

v0.1.19

Choose a tag to compare

@silversurfer562 silversurfer562 released this 16 May 12:02
8edd93e

Phase 2 of the v1.0 roadmap — the --thinking default
decision is locked. No behavioral change ships in this
version (the default was already OFF and stays OFF). The
v3 ground-truth round (n = 30) gives a bootstrap 95 % CI
on (wins_off − wins_on) of [−1, +13] — point estimate
+6, CI includes zero. The decision rests on
"off-favored but not statistically distinguishable at this
sample size" plus judge variance well below the
escalation threshold, NOT on a positive CI. Phase 3's
API-surface groundwork (snapshot tests + deprecation
policy + __all__ audit) landed in 0.1.18 in parallel;
the formal API freeze still targets 0.2.0. See
ROADMAP-v1.md for sequencing.

Changed

  • --thinking default decision locked: stays OFF (Phase 2
    of v1.0 roadmap).
    v3 ground-truth round at n=30 (15 shift
    • 15 random + 2 controls). Bootstrap 95 % CI on
      (wins_off − wins_on) = [−1, +13] — point estimate +6,
      CI includes 0 (off favored but not statistically
      distinguishable). v3 off-to-on ratio 2.5× reverses the
      v1 → v2 narrowing (1.5× → 1.2× → 2.5×). Judge variance is
      small: margin_stdev = 0.0189, far below the 0.10
      escalation threshold (K=8 random-bucket queries × M=5 runs;
      5 of 8 hit σ=0 in both conditions). No baseline
      re-measurement needed because the default doesn't flip.
      Locked record:
      docs/specs/faithfulness-thinking-decision/decision.md.
      Calibration writeup:
      docs/rag/faithfulness-thinking-calibration.md.
  • docs/rag/faithfulness-thinking-calibration.md rewritten.
    Top blockquote now states the locked Phase 2 decision
    (was "pending"). The duplicated v2 ground-truth section
    (two near-identical copies in the file) has been
    deduplicated. A new "v3 (2026-05-16, n = 30 rubric + 2
    controls)" section adds the aggregate-alignment table,
    bootstrap CI, judge-variance discussion with the σ=0
    finding, phantom-claim examples, the v1 → v2 → v3
    comparison table, and a trace of all six rubric rules.
  • docs/specs/ROADMAP-v1.md marks Phase 2 complete
    and Phase 3 unblocked; current-version row reflects
    0.1.19.

Added

  • scripts/measure_judge_variance.py — judge-only
    variance measurement. Re-runs the FaithfulnessJudge M times
    per query in each condition (off, on) against captured
    answer + context from a calibration artifact. Outputs
    per-query mean/stdev + aggregate pooled stdevs +
    margin_stdev. Used by Phase 2 to anchor the rubric's
    noise-floor escalation rule.
  • Bootstrap CI + phantom-claim rate in
    scripts/score_against_ground_truth.py.
    Extended with
    --rubric-rule {legacy,design}, --control-ids,
    --bootstrap-iters, --seed, and --variance. The
    design-rule classifier flags a tie iff |off−on|,
    |off−label|, |on−label| are all < 0.025. Bootstrap
    resamples per-query verdicts B times and reports the 2.5 %
    / 97.5 % quantiles. Phantom-claim detection uses a
    content-word overlap heuristic (overlap < 0.40
    flagged); honest about the limits — a literal-substring
    matcher was tried first and yielded a meaningless 100 %
    rate. The 6-rule acceptance rubric from design.md runs at
    the bottom of every score run.
  • build_calibration_labeling_kit.py --n-random N +
    --seed.
    New bucket: N queries drawn uniformly from the
    remaining pool after shift+controls. Anchors the
    noise-floor measurement on typical queries instead of
    high-shift outliers. _select_queries now returns
    (shifted, controls, random); the kit script's previous
    flat-list return is gone (M2.1 / M2.2 of Phase 2).
  • docs/specs/faithfulness-thinking-decision/ — full
    Phase 2 spec (requirements, design, tasks).
  • docs/specs/faithfulness-thinking-decision/decision.md
    locked, machine-readable YAML record of the Phase 2 verdict:
    per-round win counts, bootstrap CI bounds, phantom rate,
    variance numbers, prior-round comparisons, and the
    methodology footnote about the mid-round labeler shift.
    Future re-evaluations should start a successor spec
    directory rather than amending this record.
  • v3 calibration artifacts at
    artifacts/calibration/thinking-2026-05-16.json (paired
    off+on benchmark at n=40),
    ground-truth-2026-05-16.md (n=32 labels; 30 rubric +
    2 controls), and variance-2026-05-16.json (K=8 × M=5
    judge-variance measurement).
  • New unit-test coverage for the calibration toolchain.
    tests/unit/test_measure_judge_variance.py (9 tests:
    fake-judge integration, aggregate stdev math, CLI
    validation) and an expanded tests/unit/test_calibration_scripts.py
    (27 new tests across the design tie rule, content-word
    tokenizer, phantom-rate detector, bootstrap CI, and all
    six rubric branches; total 52 tests pass in 0.13 s).