Summary
Add Flare LLM as a third inference engine option alongside MLC WebLLM and Transformers.js. Flare is a pure Rust → WASM inference engine with WebGPU acceleration that loads standard GGUF files directly (no TVM compilation step).
Why
|
MLC WebLLM |
Transformers.js |
Flare |
| Model format |
TVM artifacts |
ONNX |
Standard GGUF |
| Compilation |
Need TVM compile |
Need ONNX export |
Direct HuggingFace GGUF |
| Language |
C++ (emscripten) |
C++ (ONNX RT) |
Pure Rust → WASM |
| Progressive loading |
No |
No |
Yes |
| LoRA hot-swap |
No |
No |
Yes |
| BitNet ternary |
No |
No |
Yes |
| Speculative decoding |
No |
No |
Yes |
| WASM binary size |
~15MB |
~10MB |
~5MB (est.) |
Key advantage: users can grab any GGUF model from HuggingFace and use it immediately — no conversion pipeline.
Integration plan
- Publish
@aspect/flare npm package via wasm-pack
- Create
FlareEngine adapter implementing the BrowserAI engine interface
- Map BrowserAI API to Flare WASM API:
loadModel() → FlareEngine.load() + init_gpu()
generateText() → begin_stream() + next_token() loop
transcribeAudio() → N/A (Flare doesn't do STT yet)
- Add GGUF models to model registry
- Allow engine selection:
new BrowserAI({ engine: 'flare' })
Depends on
- #[flare-npm] Publish @aspect/flare npm package
- #[flare-adapter] FlareEngine adapter implementation
- #[flare-models] Add GGUF models to registry
Summary
Add Flare LLM as a third inference engine option alongside MLC WebLLM and Transformers.js. Flare is a pure Rust → WASM inference engine with WebGPU acceleration that loads standard GGUF files directly (no TVM compilation step).
Why
Key advantage: users can grab any GGUF model from HuggingFace and use it immediately — no conversion pipeline.
Integration plan
@aspect/flarenpm package via wasm-packFlareEngineadapter implementing the BrowserAI engine interfaceloadModel()→FlareEngine.load()+init_gpu()generateText()→begin_stream()+next_token()looptranscribeAudio()→ N/A (Flare doesn't do STT yet)new BrowserAI({ engine: 'flare' })Depends on