Skip to content

feat: GPU ternary decode + wire GPU into --student infer#70

Merged
shift merged 1 commit intomainfrom
feat/gpu-student-infer
Apr 23, 2026
Merged

feat: GPU ternary decode + wire GPU into --student infer#70
shift merged 1 commit intomainfrom
feat/gpu-student-infer

Conversation

@shift
Copy link
Copy Markdown
Owner

@shift shift commented Apr 23, 2026

GPU-accelerated student inference

New: forward_decode_gpu()

Single-token decode using GPU ternary matmul with CPU KV cache:

  • Q/K/V/O projections → gpu_ternary_matmul WGSL kernel
  • MoE expert gate/up/down → GPU ternary matmul
  • Router, attention scores, PLE, KV cache → CPU (small matrices)
  • LM head → CPU (too large for iGPU 256MB buffer)

Wire GPU into cmd_infer --student

  • Initialize GpuMatmulAccelerator in student inference path
  • Create DispatchPlan from device profile
  • generate_student() accepts GPU + dispatch params
  • Decode uses forward_decode_gpu when GPU available
  • Full CPU fallback for RPi and other edge devices

Remaining GPU work

  • Prefill GPU path (currently CPU-only to populate KV cache)
  • Upload weights to VRAM once (resident mode)
  • Fix distillation double-forward (subtask 6)

New: forward_decode_gpu() — GPU-accelerated decode with KV cache.
- Q/K/V/O projections via gpu_ternary_matmul WGSL kernel
- MoE expert gate/up/down via GPU ternary matmul
- Router, attention scores, PLE, KV cache stay on CPU
- LM head stays on CPU (too large for iGPU buffers)

Wire GPU into cmd_infer --student path:
- Initialize GpuMatmulAccelerator in student inference path
- Create DispatchPlan from device profile
- generate_student now accepts GPU + dispatch params
- Decode uses forward_decode_gpu when GPU available, CPU fallback
- Prefill stays on CPU (must populate KV cache for decode)

This gives GPU-accelerated decode on capable hardware while
maintaining full CPU fallback for RPi and other edge devices.
@shift shift merged commit eff89a8 into main Apr 23, 2026
4 checks passed
@shift shift deleted the feat/gpu-student-infer branch April 23, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant