feat: Ternary student model inference path by shift · Pull Request #60 · shift/FerrisRes

shift · 2026-04-22T07:18:31Z

Ternary Student Inference Path

Summary

Adds --student and --moe-experts flags to the infer command for end-to-end inference through the FerrisRes student model (ternary base + optional MoE).

Usage

# Dense student (no MoE) — ternary base weights only
ferrisres infer --model-path model.safetensors --config e2b \
  --tokenizer tokenizer.json --student \
  --prompt "Hello, world"

# MoE student (2 experts) — ternary experts
ferrisres infer --model-path model.safetensors --config e2b \
  --tokenizer tokenizer.json --student --moe-experts 2 \
  --prompt "Hello, world"

Pipeline

Load Gemma 4 weights → gemma4_to_block_attnres() → ternary CpuBlockAttnResModel
Optional: dense_ffn_to_moe() → MoE expansion (2 or 4 experts)
generate_student() → autoregressive generation with per-step logging

Memory

All base weights are ternary {-1, 0, +1} from creation:

Dense student: ~1.5 GB ternary (vs ~10 GB BF16 teacher)
2-expert MoE: ~2.1 GB ternary (vs ~17 GB FP32)
4-expert MoE: ~3.7 GB ternary (vs ~29 GB FP32)

New code

generate_student(): autoregressive generator for CpuBlockAttnResModel
Student inference path in cmd_infer() with memory diagnostics
--student / --moe-experts CLI flags

Depends on #59 (ternary-everywhere refactor).

[3da81652]

New --student and --moe-experts flags on infer command. Loads model → converts to CpuBlockAttnResModel (ternary base) → optional MoE expansion → autoregressive generation. generate_student() function: full forward per step with logging. Memory diagnostic reports ternary footprint on load. [3da81652]

…ontention (#63) rayon par_iter on 35 layers caused memory bandwidth saturation on 32GB system with 10GB mmap + 4GB student model. Sequential conversion is faster because single thread has exclusive bandwidth access. Conversion: 770s (rayon) → expected ~470s (sequential, matching PR #60) Prefill: 123s (rayon build) → expected ~42s (matching PR #60) Keep rayon for ternary_matmul_parallel() only (small matrices, no contention).

shift force-pushed the feat/ternary-inference-path branch from 9f8c141 to 25f4438 Compare April 22, 2026 07:19

shift merged commit b4e9fe7 into main Apr 22, 2026
4 checks passed

shift deleted the feat/ternary-inference-path branch April 22, 2026 07:23

shift mentioned this pull request Apr 22, 2026

fix: Revert rayon from weight conversion (63% slower) #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Ternary student model inference path#60

feat: Ternary student model inference path#60
shift merged 1 commit intomainfrom
feat/ternary-inference-path

shift commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shift commented Apr 22, 2026

Ternary Student Inference Path

Summary

Usage

Pipeline

Memory

New code

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant