Skip to content

Add Flare as a third inference engine backend #293

@sauravpanda

Description

@sauravpanda

Summary

Add Flare LLM as a third inference engine option alongside MLC WebLLM and Transformers.js. Flare is a pure Rust → WASM inference engine with WebGPU acceleration that loads standard GGUF files directly (no TVM compilation step).

Why

MLC WebLLM Transformers.js Flare
Model format TVM artifacts ONNX Standard GGUF
Compilation Need TVM compile Need ONNX export Direct HuggingFace GGUF
Language C++ (emscripten) C++ (ONNX RT) Pure Rust → WASM
Progressive loading No No Yes
LoRA hot-swap No No Yes
BitNet ternary No No Yes
Speculative decoding No No Yes
WASM binary size ~15MB ~10MB ~5MB (est.)

Key advantage: users can grab any GGUF model from HuggingFace and use it immediately — no conversion pipeline.

Integration plan

  1. Publish @aspect/flare npm package via wasm-pack
  2. Create FlareEngine adapter implementing the BrowserAI engine interface
  3. Map BrowserAI API to Flare WASM API:
    • loadModel()FlareEngine.load() + init_gpu()
    • generateText()begin_stream() + next_token() loop
    • transcribeAudio() → N/A (Flare doesn't do STT yet)
  4. Add GGUF models to model registry
  5. Allow engine selection: new BrowserAI({ engine: 'flare' })

Depends on

  • #[flare-npm] Publish @aspect/flare npm package
  • #[flare-adapter] FlareEngine adapter implementation
  • #[flare-models] Add GGUF models to registry

Metadata

Metadata

Assignees

No one assigned

    Labels

    engineInference engine backendflare-integrationFlare WASM inference engine integration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions