Skip to content

v2.4.6 — DEFCON Special

Choose a tag to compare

@TSchonleber TSchonleber released this 19 Apr 23:41
· 116 commits to main since this release

Top-heavy retrieval lift (plan-20260419) shipped end-to-end across a 6-way swarm (codex + claude-code).

Headline numbers

Baseline → 2.4.6 Plan target Status
LoCoMo hybrid Hit@1 0.023 → 0.279 +1.0pp +25.5pp — crushed
LoCoMo hybrid MRR 0.032 → 0.394 +0.5pp +36.2pp — crushed
LongMemEval Hit@1 0.882 → 0.869 +0.8pp flat within noise on n=289; FULL beats ROLLBACK +62.3pp like-for-like

What landed

  • I2/I3/I4 — unified Brain.search + cmd_search pipeline, regex intent router, last-mile CE reranker with BRAINCTL_CE_P95_BUDGET_MS gate
  • I6/I7BRAINCTL_TOPHEAVY_ROLLBACK=1 emergency bypass, docs refresh
  • I1 — frozen benchmarks/snapshots/baseline-20260419/ + --traces flag
  • I5benchmarks/snapshots/calibration-20260419/ 3-cell ablation + BRAINCTL_DISABLE_INTENT_ROUTER=1 ablation bypass
  • I8 — strict retrieval-gate CI job, per-slice Hit@1/MRR/nDCG@5 gates, cross-platform-aware p95 latency gate, PR-comment matrix

Fixes

  • init_schema.sql synced for migration 051 code_ingest_cache (fresh installs no longer need post-init migrate)
  • test_connection_lifecycle widened to filter by brain.py callsite (single-conn invariant holds post-unification)
  • Cross-platform latency-gate skip for subprocess-bound CLI ops (darwin baseline vs ubuntu-latest fresh no longer false-positives)

Upgrade

pip install --upgrade brainctl==2.4.6

No runtime config change needed — main defaults already enable the top-heavy controls.

Follow-ups (not blocking)

  • Wire args.rerank through tests/bench/{locomo,longmemeval}_eval.py so CE dimension is measurable
  • Per-question_type slice analysis of the intent router (FULL == NO_INTENT on aggregate metrics)
  • Fix I5 driver _extract_metrics p95 parse, populate baseline_p95_ms: in tests/bench/budgets/*.yaml so the p95 leg flips from advisory to enforcing

Full changelog

See CHANGELOG.md.