perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed#28
Merged
perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed#28
Conversation
User feedback: "quantcpp command not found" + "garbage text from 135M" 1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts]) - `quantcpp "question"` — one-shot - `quantcpp` — interactive chat - `quantcpp --model path.gguf` — custom model 2. Default model changed from SmolLM2-135M to Llama-3.2-1B - 135M produces garbage text — terrible first impression - 1B is 750MB (bigger download) but actually useful output - SmolLM2-135M still available for bandwidth-constrained users 3. README Quick Start now shows `quantcpp` CLI first, Python second Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c8c6a05 to
d8c7e3b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace ASYNCIFY-based streaming with a Web Worker architecture for maximum WASM inference speed.
Architecture change
ASYNCIFY adds ~30-50% overhead by saving/restoring the entire call stack at every yield point. With a Web Worker, the inference loop blocks freely in the worker while the main thread handles DOM updates via message events — zero overhead.
All optimizations combined
Expected total: 8-15x vs original 0.9 tok/s → ~7-14 tok/s
Binary size
Test plan
🤖 Generated with Claude Code