v0.6.0
v0.6.0 (2026-02-23)
This release is published under the Apache-2.0 License.
Features
- Add live progress output to accuracy benchmark (
17c854d)
Print per-question status lines to stderr as the benchmark runs, showing pass/fail/skip, format, fixture, question, and elapsed time.
- Add max_table_columns and elide_mostly_zero_pct heuristics for wide table readability (
21ad4bd)
Add two new experimental heuristics to help small LLMs parse wide tables:
- max_table_columns: caps table width, dropping rightmost columns (identity columns survive via ordering)
- elide_mostly_zero_pct: removes columns where most values are zero, annotating outliers with identity labels
Also adds --heuristics flag to benchmarks/accuracy.py for testing strategies without code changes, and updates config.py to support float-valued heuristic parameters.
- Add multi-hop, arithmetic, and ranking benchmark questions (
eb47680)
Add 12 harder questions requiring multi-step reasoning: multi-hop lookups, percentage calculations, inverse filtering, ranking beyond top-1, cross-section joins, and reading elided annotation values.
-
Add per-tool heuristic overrides and harder benchmark questions (
880e371) -
tool_heuristics config allows per-tool heuristic overrides that merge
on top of base server heuristics -
fix string value parsing in CONDENSER_HEURISTICS env var (previously
coerced non-bool strings to True) -
add 12 harder cross-reference/comparison/aggregation benchmark questions
-
Expand benchmark suite with multi-model matrix and new fixtures (
cda4274)
Add multi-model accuracy benchmark infrastructure: - benchmarks/fixtures.py: shared questions (90 across 5 fixtures), match functions, and fixture metadata extracted from accuracy.py - benchmarks/matrix.py: multi-model orchestrator with resume support, incremental saves, and markdown report generation - benchmarks/accuracy.py: refactored to import from fixtures.py, added per-question error handling and 600s timeout
Add synthetic test fixtures: - tests/fixtures/aws_ec2_instances.json: 20 EC2 instances (33K tokens, 87% reduction) with deterministic generator script - tests/fixtures/db_query_results.json: 150 SQL order rows (26K tokens, 57% reduction) with deterministic generator script
Benchmark results across 5 models (qwen3:1.7b/4b, llama3.1:8b, qwen3:14b/30b) show TOON matches or beats JSON accuracy on Kubernetes fixtures (100% TOON on both K8s fixtures for all models 4b+) while achieving 57-87% token reduction.
Document EC2 Tags condensing gap in docs/ec2-tags-fix.md — nested tag arrays are silently dropped during sub-table rendering, making 5 of 15 EC2 questions impossible to answer from TOON output.
- Pivot Key-Value arrays (AWS Tags) into scalar columns (
e0c14fd)
Detect [{Key, Value}] arrays (AWS tag convention) and pivot them into scalar columns on the parent row (e.g. Tags.Name, Tags.Environment) instead of extracting them as cross-referenced sub-tables.
Refactoring
- Split benchmark reports into separate JSON and TOON tables (
eb0cbc4)
The combined JSON/TOON accuracy cells were hard to scan. Split into two independent tables and move context window enablement under a "Local Models" heading since frontier models don't have those limits.
Detailed Changes: v0.5.1...v0.6.0