Skip to content

feat(wasm): Qwen3/Llama model selector + real-time streaming#20

Merged
unamedkr merged 1 commit intomainfrom
feat/wasm-model-selector-streaming
Apr 10, 2026
Merged

feat(wasm): Qwen3/Llama model selector + real-time streaming#20
unamedkr merged 1 commit intomainfrom
feat/wasm-model-selector-streaming

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Two improvements to the WASM browser demo:

1. Model selector — replace SmolLM2-135M with better models

The 135M model produced near-garbage output. Now users choose between:

Model Size Quality Use case
Qwen3 0.6B Q4_K_M ~378 MB Good for demo Recommended default
Llama 3.2 1B Q4_K_M ~770 MB Better reasoning "Higher quality" option
  • Card-based UI with "Recommended" / "Higher quality" badges
  • Per-model chat templates (ChatML vs Llama 3 format)
  • Independent IndexedDB cache per model — switching doesn't evict
  • Auto-detects cached models on page load ("instant load" badge)
  • Custom GGUF drag-and-drop still works with generic ChatML template

2. Real-time token streaming

Previously the entire generation blocked the main thread — tokens appeared all at once after completion. Now:

  • wasm_generate_async() calls emscripten_sleep(0) after each token
  • Browser repaints between tokens — true streaming experience
  • Live tok/s counter updates during generation
  • Requires -sASYNCIFY build flag (added to build.sh)
  • JS generate() tries _wasm_generate_async first, falls back to sync

Also

  • Added Qwen3-0.6B to Python _MODEL_REGISTRY
  • build.sh: added -sASYNCIFY, ASYNCIFY_IMPORTS, ASYNCIFY_STACK_SIZE=65536
  • Exported _wasm_generate_async in EXPORTED_FUNCTIONS

Files changed

  • wasm/index.html — model selector UI + streaming JS
  • wasm/quant_wasm.cwasm_generate_async() with emscripten_sleep(0)
  • wasm/build.sh — ASYNCIFY flags
  • bindings/python/quantcpp/__init__.py — Qwen3-0.6B registry entry

Test plan

  • WASM rebuild with cd wasm && bash build.sh
  • Click "Qwen3 0.6B" — downloads, caches, loads, generates with streaming
  • Click "Llama 3.2 1B" — same flow, different model
  • Reload page — cached models show "instant load" badge
  • Drop custom GGUF — works with generic ChatML
  • Tokens appear one-by-one (not all at once after generation completes)
  • tok/s counter updates live during generation
  • Native build (cmake --build build) passes

🤖 Generated with Claude Code

…reaming

Replace the single SmolLM2-135M demo button with a two-card model
selector:

  - Qwen3 0.6B Q4_K_M (~378 MB) — recommended default. Much better
    quality than 135M, multilingual, reasonable download size.
  - Llama 3.2 1B Q4_K_M (~770 MB) — "higher quality" option for
    users willing to wait.

Each model has its own chat template (ChatML for Qwen, Llama 3
format for Llama) and IndexedDB cache key, so switching models
doesn't evict the other from cache.

Real-time streaming:
  - Add wasm_generate_async() in quant_wasm.c which calls
    emscripten_sleep(0) after each token, yielding control back
    to the browser event loop for DOM repaint.
  - Build with -sASYNCIFY + ASYNCIFY_IMPORTS=["emscripten_sleep"].
  - JS generate() now awaits _wasm_generate_async() with fallback
    to sync _wasm_generate() for non-ASYNCIFY builds.
  - Live tok/s counter updates during generation.

Also adds Qwen3-0.6B to the Python model registry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 10c49ff into main Apr 10, 2026
3 checks passed
@unamedkr unamedkr deleted the feat/wasm-model-selector-streaming branch April 10, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant