feat(wasm): Qwen3/Llama model selector + real-time streaming#20
Merged
feat(wasm): Qwen3/Llama model selector + real-time streaming#20
Conversation
…reaming
Replace the single SmolLM2-135M demo button with a two-card model
selector:
- Qwen3 0.6B Q4_K_M (~378 MB) — recommended default. Much better
quality than 135M, multilingual, reasonable download size.
- Llama 3.2 1B Q4_K_M (~770 MB) — "higher quality" option for
users willing to wait.
Each model has its own chat template (ChatML for Qwen, Llama 3
format for Llama) and IndexedDB cache key, so switching models
doesn't evict the other from cache.
Real-time streaming:
- Add wasm_generate_async() in quant_wasm.c which calls
emscripten_sleep(0) after each token, yielding control back
to the browser event loop for DOM repaint.
- Build with -sASYNCIFY + ASYNCIFY_IMPORTS=["emscripten_sleep"].
- JS generate() now awaits _wasm_generate_async() with fallback
to sync _wasm_generate() for non-ASYNCIFY builds.
- Live tok/s counter updates during generation.
Also adds Qwen3-0.6B to the Python model registry.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two improvements to the WASM browser demo:
1. Model selector — replace SmolLM2-135M with better models
The 135M model produced near-garbage output. Now users choose between:
2. Real-time token streaming
Previously the entire generation blocked the main thread — tokens appeared all at once after completion. Now:
wasm_generate_async()callsemscripten_sleep(0)after each token-sASYNCIFYbuild flag (added tobuild.sh)generate()tries_wasm_generate_asyncfirst, falls back to syncAlso
_MODEL_REGISTRYbuild.sh: added-sASYNCIFY,ASYNCIFY_IMPORTS,ASYNCIFY_STACK_SIZE=65536_wasm_generate_asyncinEXPORTED_FUNCTIONSFiles changed
wasm/index.html— model selector UI + streaming JSwasm/quant_wasm.c—wasm_generate_async()withemscripten_sleep(0)wasm/build.sh— ASYNCIFY flagsbindings/python/quantcpp/__init__.py— Qwen3-0.6B registry entryTest plan
cd wasm && bash build.shcmake --build build) passes🤖 Generated with Claude Code