feat: GPU ternary decode + wire GPU into --student infer by shift · Pull Request #70 · shift/FerrisRes

shift · 2026-04-23T02:12:13Z

GPU-accelerated student inference

New: `forward_decode_gpu()`

Single-token decode using GPU ternary matmul with CPU KV cache:

Q/K/V/O projections → gpu_ternary_matmul WGSL kernel
MoE expert gate/up/down → GPU ternary matmul
Router, attention scores, PLE, KV cache → CPU (small matrices)
LM head → CPU (too large for iGPU 256MB buffer)

Wire GPU into `cmd_infer --student`

Initialize GpuMatmulAccelerator in student inference path
Create DispatchPlan from device profile
generate_student() accepts GPU + dispatch params
Decode uses forward_decode_gpu when GPU available
Full CPU fallback for RPi and other edge devices

Remaining GPU work

Prefill GPU path (currently CPU-only to populate KV cache)
Upload weights to VRAM once (resident mode)
Fix distillation double-forward (subtask 6)

New: forward_decode_gpu() — GPU-accelerated decode with KV cache. - Q/K/V/O projections via gpu_ternary_matmul WGSL kernel - MoE expert gate/up/down via GPU ternary matmul - Router, attention scores, PLE, KV cache stay on CPU - LM head stays on CPU (too large for iGPU buffers) Wire GPU into cmd_infer --student path: - Initialize GpuMatmulAccelerator in student inference path - Create DispatchPlan from device profile - generate_student now accepts GPU + dispatch params - Decode uses forward_decode_gpu when GPU available, CPU fallback - Prefill stays on CPU (must populate KV cache for decode) This gives GPU-accelerated decode on capable hardware while maintaining full CPU fallback for RPi and other edge devices.

shift merged commit eff89a8 into main Apr 23, 2026
4 checks passed

shift deleted the feat/gpu-student-infer branch April 23, 2026 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU ternary decode + wire GPU into --student infer#70

feat: GPU ternary decode + wire GPU into --student infer#70
shift merged 1 commit intomainfrom
feat/gpu-student-infer

shift commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shift commented Apr 23, 2026

GPU-accelerated student inference

New: forward_decode_gpu()

Wire GPU into cmd_infer --student

Remaining GPU work

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `forward_decode_gpu()`

Wire GPU into `cmd_infer --student`