Skip to content

feat: Ternary student model inference path#60

Merged
shift merged 1 commit intomainfrom
feat/ternary-inference-path
Apr 22, 2026
Merged

feat: Ternary student model inference path#60
shift merged 1 commit intomainfrom
feat/ternary-inference-path

Conversation

@shift
Copy link
Copy Markdown
Owner

@shift shift commented Apr 22, 2026

Ternary Student Inference Path

Summary

Adds --student and --moe-experts flags to the infer command for end-to-end inference through the FerrisRes student model (ternary base + optional MoE).

Usage

# Dense student (no MoE) — ternary base weights only
ferrisres infer --model-path model.safetensors --config e2b \
  --tokenizer tokenizer.json --student \
  --prompt "Hello, world"

# MoE student (2 experts) — ternary experts
ferrisres infer --model-path model.safetensors --config e2b \
  --tokenizer tokenizer.json --student --moe-experts 2 \
  --prompt "Hello, world"

Pipeline

  1. Load Gemma 4 weights → gemma4_to_block_attnres() → ternary CpuBlockAttnResModel
  2. Optional: dense_ffn_to_moe() → MoE expansion (2 or 4 experts)
  3. generate_student() → autoregressive generation with per-step logging

Memory

All base weights are ternary {-1, 0, +1} from creation:

  • Dense student: ~1.5 GB ternary (vs ~10 GB BF16 teacher)
  • 2-expert MoE: ~2.1 GB ternary (vs ~17 GB FP32)
  • 4-expert MoE: ~3.7 GB ternary (vs ~29 GB FP32)

New code

  • generate_student(): autoregressive generator for CpuBlockAttnResModel
  • Student inference path in cmd_infer() with memory diagnostics
  • --student / --moe-experts CLI flags

Depends on #59 (ternary-everywhere refactor).

[3da81652]

New --student and --moe-experts flags on infer command.
Loads model → converts to CpuBlockAttnResModel (ternary base) →
optional MoE expansion → autoregressive generation.
generate_student() function: full forward per step with logging.
Memory diagnostic reports ternary footprint on load.

[3da81652]
@shift shift force-pushed the feat/ternary-inference-path branch from 9f8c141 to 25f4438 Compare April 22, 2026 07:19
@shift shift merged commit b4e9fe7 into main Apr 22, 2026
4 checks passed
@shift shift deleted the feat/ternary-inference-path branch April 22, 2026 07:23
shift added a commit that referenced this pull request Apr 22, 2026
…ontention (#63)

rayon par_iter on 35 layers caused memory bandwidth saturation on 32GB system
with 10GB mmap + 4GB student model. Sequential conversion is faster because
single thread has exclusive bandwidth access.

Conversion: 770s (rayon) → expected ~470s (sequential, matching PR #60)
Prefill: 123s (rayon build) → expected ~42s (matching PR #60)

Keep rayon for ternary_matmul_parallel() only (small matrices, no contention).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant