Releases · ReaNAiveD/inferaived

First public release of inferaived, an LLM inference engine written in Rust on top of wgpu.

Status: early / experimental. APIs will change. Expect rough edges.

Highlights

GPU-resident inference on wgpu 29 — runs on Vulkan, Metal, DX12, and browser WebGPU; no CUDA or external runtime required.
Supported models: Qwen3.5 and MiniCPM5 (including MiniCPM5's parallel hybrid layer stack). Weights load directly from Hugging Face safetensors.
Custom WGSL kernels: matmul, RoPE, RMSNorm, masked attention, mamba scan, delta rule, and samplers.
GPU KV cache with a continuous decode loop and argmax sampler.
Runnable examples: generate, chat_qwen35, chat_minicpm5, parallel_minicpm5, bench_decode.

[dependencies]
inferaived = "0.1"

Or:

cargo add inferaived

MIT