Skip to content

v0.2.0b1 — SIKE validation & MoE decoder

Choose a tag to compare

@mbachaud mbachaud released this 10 Apr 06:52
· 475 commits to master since this release

Highlights

This release establishes scale-invariant retrieval across model sizes from 0.6B to 8B parameters, validated by the SIKE benchmark. Retrieval is no longer the bottleneck — it's consistent at 10/10 across all tested models.

SIKE Benchmark Results (q4_0 KV cache)

Model Retrieval Accuracy Notes
qwen3:0.6b 10/10 2/10 Parameter floor — retrieval works, model can't use it
qwen3:1.7b 10/10 3/10
qwen3:4b 10/10 9/10 Sweet spot — 2.5GB VRAM
gemma4:e4b 10/10 9/10 MoE decoder enabled
qwen3:8b 10/10 9/10

MoE-aware decoder

  • Front-loads KV answer slate in first 200 tokens for SWA (sliding-window attention) models
  • Relevance-first gene ordering for MoE/small models (vs sequence_index for dense)
  • Automatic activation via MOE_MODEL_FAMILIES = (\"gemma4\",)
  • gemma4:e4b jumped from 5/10 → 9/10 accuracy with slate enabled

Per-request model detection

  • Server reads body[\"model\"] and adapts expression strategy per request
  • _should_use_slate() gates on downstream model name + param count
  • SMALL_MODEL_THRESHOLD_B = 3.2 — excludes qwen3:4b which works without slate

Think-mode suppression for sub-3.2B models

  • Small models' reasoning loops consume the entire output budget without producing answers
  • Injects /no_think prefix and sets temperature=0 for Qwen3 sub-3.2B
  • q8_0 tested: worse than q4_0 (think mode gets more rope to hang itself)

Storage & operations

  • New Genome.vacuum() method + /admin/vacuum endpoint (752 MB → 523 MB, -30.4%)
  • Clear documentation distinguishing checkpoint / refresh / compact / vacuum operations
  • README refresh with badges, TOC, glossary, sample output
  • Test corpus composition breakdown with public/private repo split

Cumulative changes since v0.1.0b2

  • MoE-aware decoder with answer slate + relevance-first ordering
  • SIKE benchmark validation across 5 model scales
  • Per-request downstream model detection
  • Think suppression for sub-3.2B models
  • Genome.vacuum() + storage optimizations
  • README overhaul + SIKE benchmark docs

All 179 tests passing.

🤖 Generated with Claude Code