Skip to content

feat: Ternary-everywhere refactor + STE shadow weight infrastructure#59

Merged
shift merged 3 commits intomainfrom
feat/ternary-everywhere
Apr 22, 2026
Merged

feat: Ternary-everywhere refactor + STE shadow weight infrastructure#59
shift merged 3 commits intomainfrom
feat/ternary-everywhere

Conversation

@shift
Copy link
Copy Markdown
Owner

@shift shift commented Apr 22, 2026

Ternary-Everywhere Refactor + STE Infrastructure

Summary

Two major changes in this PR:

1. Ternary-everywhere refactor (all base weights as {-1, 0, +1})

  • CpuLinear::from_weight() immediately quantizes FP32 → ternary
  • CpuMoELayer stores TernaryExpert for all expert gate/up/down projections
  • TernaryLinear::from_cpu_linear() uses raw ternary values (no FP32 round-trip)
  • TernaryMoELayer::from_cpu_moe() copies ternary directly
  • Backward pass dequantizes via .to_fp32() where needed
  • Memory reduction: 2-expert MoE ~2.1 GB ternary vs ~17 GB FP32

2. STE shadow weight infrastructure (BF16-ready)

  • ShadowPrecision trait: switchable BF16/FP32 for shadow weights
  • BitLinear<P>: ternary base + optional shadow, STE forward, stochastic rounding, boundary noise injection, running average α
  • BitMoELayer: per-block synchronized α across experts + PLE projections, ScaleSync enum (PerBlock vs PerExpert)
  • Added half crate with num-traits/serde/bytemuck features
  • 20 new tests (13 shadow_weights + 7 bit_moe)

3. Research papers

  • ste_ternary_training.md — STE viability analysis, shadow weight memory constraints
  • ternary_quality_ceiling.md — rank sweep experiment design, LoRA vs STE ceiling
  • expert_count_tradeoff.md — 4 big vs 128 small experts tradeoff analysis

Test Results

  • 1596 tests passing (20 new, 0 failures)
  • Clean build with --features vulkan
  • Pre-existing clippy warnings unchanged (516 main → 520 branch, delta from half crate)

Architecture Decisions

Decision Choice Rationale
Scale α Running average (not learnable) Prevents feedback loop death spiral
Scale sync Per-block (all experts + PLE share α) Prevents routing bias from scale differences
Shadow precision BF16 50% memory savings, same dynamic range as FP32
Gradient precision FP32 Accumulation stability

Memory Impact

Component Before After
Base weights (2-expert) ~17 GB FP32 ~2.1 GB ternary
Base weights (4-expert) ~29 GB FP32 ~3.7 GB ternary
LoRA + optimizer ~200 MB ~200 MB (unchanged)
STE shadow weights N/A ~8.5 GB BF16 (future, not wired yet)

Next Steps (post-merge)

  1. Wire ternary inference path (end-to-end test with quantized model)
  2. Rank sweep experiment (ranks 4/8/16/32/64) to find quality ceiling
  3. STE training pipeline (shadow weights → LoRA replacement)
  4. GPU kernel for block-wide α reduction in WGSL

[3da81652]

shift added 3 commits April 22, 2026 03:13
CpuLinear now quantizes to ternary on creation via from_weight().
CpuMoELayer stores TernaryExpert for all expert weights.
TernaryLinear::from_cpu_linear() uses raw ternary (no FP32 round-trip).
TernaryMoELayer::from_cpu_moe() copies ternary values directly.
Backward pass dequantizes via to_fp32() where needed.
Memory: 2-expert MoE ~2.1 GB ternary vs ~17 GB FP32.

[3da81652]
ShadowPrecision trait: switchable BF16/FP32 for shadow weights.
BitLinear<P>: ternary base + optional shadow, STE forward path,
stochastic rounding, boundary noise injection, running average α.
BitMoELayer: per-block synchronized α across experts + PLE projections,
ScaleSync enum for PerBlock vs PerExpert modes.
Added half crate with num-traits/serde/bytemuck features.
20 new tests (13 shadow_weights + 7 bit_moe), all passing.

[3da81652]
3 research papers + renderdoc added to flake for GPU kernel debugging.
STE viability: shadow weight memory constraint (BF16 saves 50%).
Quality ceiling: rank sweep needed to determine if LoRA can close gap.
Expert count: 4 big vs 128 small experts with same param budget.

[3da81652]
@shift shift merged commit 874a2c9 into main Apr 22, 2026
4 checks passed
@shift shift deleted the feat/ternary-everywhere branch April 22, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant