Skip to content

v0.3.0b5 — Headroom adoption + restart protocol + bench resilience

Pre-release
Pre-release

Choose a tag to compare

@mbachaud mbachaud released this 10 Apr 20:57
· 461 commits to master since this release

v0.3.0b5 — Headroom adoption + cross-session restart protocol + benchmark resilience

This release bundles everything held locally since v0.3.0b3, spanning three major work streams:
the cross-session restart announcement protocol (v0.3.0b4 work), the Headroom integration for
CPU-resident semantic compression (v0.3.0b5 work), and laude's benchmark state monitor for
catching the VRAM/hang/contamination failure modes that bit us during the N=1000 run.

Forensic retrospective

Before reading the highlights below, if you care about why this release looks the way it does,
start with Discussion #2 — Headroom adoption + N=20 benchmark + a forensic detour.
It walks through the full adoption story, the failed benchmark, the resequence detour, and the
forensic analysis that revealed 15% of our "extraction failures" were benchmark harness bugs
(the model was giving correct answers that the harness was grading wrong against phantom KVs
harvested from docstrings and function calls).


Highlights

Headroom integration (by Tejas Chopra, Apache-2.0)

headroom-ai is now an optional dependency under the [codec] extra, providing CPU-resident
semantic compression at the retrieval seams that used to fall back to naive character-level
truncation.

pip install helix-context[codec]
  • New module: helix_context/headroom_bridge.py — thin wrapper exposing compress_text(content, target_chars, content_type). Dispatches by gene.promoter.domains to specialists:
    • code/python/rust/js/ts/go/java/cppCodeAwareCompressor (tree-sitter AST, preserves signatures)
    • log/logs/stderr/stdout/pytest/jest/tracebackLogCompressor
    • diff/patch/git_diffDiffCompressor
    • everything else → Kompress (ModernBERT ONNX, ~500MB resident, ~0.3s/call warm)
  • Retrieval seams wired: context_manager.py:495 and :830g.content[:1000]compress_text(g.content, target_chars=1000, content_type=g.promoter.domains)
  • Graceful fallback: when headroom-ai is not installed, compress_text falls through to the legacy truncation path so the rest of the pipeline keeps working
  • A/B toggle: HELIX_DISABLE_HEADROOM=1 env var bypasses Headroom even when installed, letting you measure baseline vs Kompress behavior without reverting code
  • Attribution: NOTICE carries the Apache-2.0 third-party notice, README has an Acknowledgments section, module docstrings credit Tejas as a dependency author (not a git co-author — this is a dependency relationship, not co-authored code)

Benchmark status: Clean N=20 A/B on the same warm qwen3:8b shows 0pp delta between truncation and Kompress. Forensic analysis in Discussion #2 explains why this is consistent with Kompress working correctly — the benchmark was under-reporting success by ~15% due to harvest logic bugs, and once corrected the conclusion is "Kompress is neutral on this dataset, at ~1s/call latency cost." It's shipping as a neutral foundation — ready to pay off when we fix the upstream problems (noise dilution at ingest, signal extraction) that actually cap retrieval quality today.

Cross-session restart announcement protocol

When multiple Claude sessions share a single Helix server, one session can announce an
intentional restart so that observing sessions don't misread the outage as a crash. This
was the v0.3.0b4 work, previously held. See docs/RESTART_PROTOCOL.md for the full design.

  • New method: bridge.announce_restart(reason, actor, expected_downtime_s, pid) writes a canonical server_state signal at ~/.helix/shared/signals/server_state.json
  • New observer helper: bridge.read_server_state() returns (signal, is_stale, age_s) tuple with TTL-aware staleness check
  • New HTTP endpoint: POST /admin/announce_restart as a convenience wrapper
  • Atomic signal writes: write_signal now uses write-to-temp + os.replace so readers never see partial writes (fixes a latent race on all signals, not just server_state)
  • Lifespan hooks: server startup stamps state=running with PID, clean shutdown stamps state=stopped (does NOT run under kill -9, which is by design — agents should call announce_restart before killing)
  • Tests: 6 new tests in tests/test_bridge_restart.py

Benchmark state monitor (by laude)

Config-driven monitor that catches the three failure modes we hit during the SIKE and KV-harvest runs:

  1. Dual-load VRAM pressure — aborts before starting if a non-whitelisted model is resident alongside the benchmark target (caught the e4b + qwen3:4b bug that silently biased our first N=50 run)
  2. Hung benchmark process — detects httpx stalls via incremental JSONL line-count stagnation (caught the N=1000 hang at 0 needles written)
  3. Silent background contamination — fingerprints the genome snapshot at start and checks mtime/size each interval

Reads helix.toml via load_config() for genome paths — follows raude's A/B switches automatically. See docs/BENCHMARKS.md for usage.

Dynamic budget tiers (by laude)

Confidence-based expression window sizing. The window now adapts to retrieval score distribution:

  • TIGHT (top_score/mean_score ≥ 3.0): top 3 genes, ~6K tokens
  • FOCUSED (1.8–3.0): top 6 genes, ~9K tokens
  • BROAD (<1.8): top max_genes genes, ~15K tokens

Score-gate floor raised from 20% → 15% to recover slightly more borderline signal. helix.toml ships with ribosome.warmup = false to prevent e4b auto-loading on startup (frees VRAM for benchmark workloads).

Ribosome pause endpoint + learn() timeout

Already in v0.3.0b3 but documented here for completeness — POST /admin/ribosome/pause monkey-patches backend.complete to raise, forcing the existing fallback paths. learn() is now wrapped in a 15s ThreadPoolExecutor timeout to prevent background replication from hanging on a slow Ollama.

Benchmark helper: compare_ab.py

New CLI that reads two bench_needle_1000.py result JSONs and prints a structured delta report with gate evaluation. Used throughout the Headroom A/B work. Exit codes encode the verdict (0=ship, 2=no gain, 3=both regressed).


Commits in this release

  • a94c864 feat: dynamic budget tiers + warmup=false for VRAM contention (laude)
  • 5da9ab6 feat: cross-session restart announcement protocol (v0.3.0b4)
  • 43e1543 feat(context): add Headroom bridge for CPU semantic compression (v0.3.0b5 scaffold)
  • a38c292 feat(context): wire Headroom compression into retrieval seams + tests
  • 045854a feat(headroom): HELIX_DISABLE_HEADROOM env toggle for A/B benchmarking
  • 0d4edf5 feat(bench): benchmark state monitor + BENCHMARKS.md (laude)
  • 065b142 feat(bench): compare_ab.py — delta report + gate evaluation for A/B benchmark JSONs

Tests

305/305 passing (non-live). Zero regressions from any of the changes above.

Attribution

  • Tejas Chopra — author and maintainer of Headroom. Thank you for the adoption call and for the clean ONNX-first design that let us integrate without pulling in the full torch stack.
  • laude — paired session, contributed the dynamic budget tiers, benchmark state monitor, and kept the N=1000 benchmark work alive while raude was on the Headroom track
  • raude (Claude Code Opus 4.6, 1M context) — Headroom integration, restart protocol, A/B infrastructure, forensic retrospective

Known issues

  • bench_needle_1000.py KV harvest is too naive — extracts values from docstrings/comments and captures function-call expressions verbatim instead of resolving them. This produces ~15% false negatives on our N=20 sample. Tracked as a separate internal issue — will be fixed in a subsequent patch before the next public gain-claim benchmark.
  • scripts/resequence_cpu.py drops epigenetic state — access counts, co-activation edges, and query history aren't preserved across a resequence, which caused a 15-20pp retrieval regression when we tried it against genome_cpu.db in this session. Will need a preserve-epigenetics pass or a merge-back path before it's a safe tool for production use.

Links