feat(wasm): Qwen3/Llama model selector + real-time streaming by unamedkr · Pull Request #20 · quantumaikr/quant.cpp

unamedkr · 2026-04-10T04:57:24Z

Summary

Two improvements to the WASM browser demo:

1. Model selector — replace SmolLM2-135M with better models

The 135M model produced near-garbage output. Now users choose between:

Model	Size	Quality	Use case
Qwen3 0.6B Q4_K_M	~378 MB	Good for demo	Recommended default
Llama 3.2 1B Q4_K_M	~770 MB	Better reasoning	"Higher quality" option

Card-based UI with "Recommended" / "Higher quality" badges
Per-model chat templates (ChatML vs Llama 3 format)
Independent IndexedDB cache per model — switching doesn't evict
Auto-detects cached models on page load ("instant load" badge)
Custom GGUF drag-and-drop still works with generic ChatML template

2. Real-time token streaming

Previously the entire generation blocked the main thread — tokens appeared all at once after completion. Now:

wasm_generate_async() calls emscripten_sleep(0) after each token
Browser repaints between tokens — true streaming experience
Live tok/s counter updates during generation
Requires -sASYNCIFY build flag (added to build.sh)
JS generate() tries _wasm_generate_async first, falls back to sync

Also

Added Qwen3-0.6B to Python _MODEL_REGISTRY
build.sh: added -sASYNCIFY, ASYNCIFY_IMPORTS, ASYNCIFY_STACK_SIZE=65536
Exported _wasm_generate_async in EXPORTED_FUNCTIONS

Files changed

wasm/index.html — model selector UI + streaming JS
wasm/quant_wasm.c — wasm_generate_async() with emscripten_sleep(0)
wasm/build.sh — ASYNCIFY flags
bindings/python/quantcpp/__init__.py — Qwen3-0.6B registry entry

Test plan

WASM rebuild with cd wasm && bash build.sh
Click "Qwen3 0.6B" — downloads, caches, loads, generates with streaming
Click "Llama 3.2 1B" — same flow, different model
Reload page — cached models show "instant load" badge
Drop custom GGUF — works with generic ChatML
Tokens appear one-by-one (not all at once after generation completes)
tok/s counter updates live during generation
Native build (cmake --build build) passes

🤖 Generated with Claude Code

…reaming Replace the single SmolLM2-135M demo button with a two-card model selector: - Qwen3 0.6B Q4_K_M (~378 MB) — recommended default. Much better quality than 135M, multilingual, reasonable download size. - Llama 3.2 1B Q4_K_M (~770 MB) — "higher quality" option for users willing to wait. Each model has its own chat template (ChatML for Qwen, Llama 3 format for Llama) and IndexedDB cache key, so switching models doesn't evict the other from cache. Real-time streaming: - Add wasm_generate_async() in quant_wasm.c which calls emscripten_sleep(0) after each token, yielding control back to the browser event loop for DOM repaint. - Build with -sASYNCIFY + ASYNCIFY_IMPORTS=["emscripten_sleep"]. - JS generate() now awaits _wasm_generate_async() with fallback to sync _wasm_generate() for non-ASYNCIFY builds. - Live tok/s counter updates during generation. Also adds Qwen3-0.6B to the Python model registry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 10c49ff into main Apr 10, 2026
3 checks passed

unamedkr deleted the feat/wasm-model-selector-streaming branch April 10, 2026 05:04

unamedkr mentioned this pull request Apr 10, 2026

perf: sort vocab before merge parsing + rebuild WASM with ASYNCIFY #22

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wasm): Qwen3/Llama model selector + real-time streaming#20

feat(wasm): Qwen3/Llama model selector + real-time streaming#20
unamedkr merged 1 commit intomainfrom
feat/wasm-model-selector-streaming

unamedkr commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 10, 2026

Summary

1. Model selector — replace SmolLM2-135M with better models

2. Real-time token streaming

Also

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant