perf(wasm): SIMD128 + O3 + LTO for 2-4x faster browser inference#25
Merged
perf(wasm): SIMD128 + O3 + LTO for 2-4x faster browser inference#25
Conversation
- -msimd128: 128-bit WASM SIMD auto-vectorization (5116 SIMD ops). All modern browsers support it (Chrome 91+, Firefox 89+, Safari 16.4+). - -O3 + -flto: aggressive optimization + link-time inlining. - Yield every 4 tokens instead of every token: 75% less ASYNCIFY stack unwind/rewind overhead while keeping UI responsive. Binary: 244K → 320K (+31%, SIMD instruction encoding). Expected: 2-4x faster matmul/attention inference in browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 10, 2026
Enable WASM pthreads so inference uses multiple CPU cores in the
browser. Three changes:
1. coi-serviceworker.js: injects Cross-Origin-Opener-Policy and
Cross-Origin-Embedder-Policy headers into all responses via
Service Worker. This enables SharedArrayBuffer on GitHub Pages
and other static hosts that don't support custom HTTP headers.
Well-established pattern (used by FFmpeg.wasm, SQL.js, etc.).
2. build.sh: add -pthread, PTHREAD_POOL_SIZE=4, ENVIRONMENT=web,worker.
WASM binary now includes multi-threaded libc and pthread support.
3. quant_wasm.c: detect navigator.hardwareConcurrency (capped at 4)
and pass to quant_config.n_threads. Model load message shows
thread count ("Model loaded! Ready to chat. (4 threads)").
Expected speedup: 3-4x on multi-core devices (most modern laptops).
Combined with SIMD128 from PR #25: total 6-12x vs original build.
Binary: 320K → 384K (pthread runtime overhead).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
unamedkr
added a commit
that referenced
this pull request
Apr 10, 2026
Enable WASM pthreads so inference uses multiple CPU cores in the
browser. Three changes:
1. coi-serviceworker.js: injects Cross-Origin-Opener-Policy and
Cross-Origin-Embedder-Policy headers into all responses via
Service Worker. This enables SharedArrayBuffer on GitHub Pages
and other static hosts that don't support custom HTTP headers.
Well-established pattern (used by FFmpeg.wasm, SQL.js, etc.).
2. build.sh: add -pthread, PTHREAD_POOL_SIZE=4, ENVIRONMENT=web,worker.
WASM binary now includes multi-threaded libc and pthread support.
3. quant_wasm.c: detect navigator.hardwareConcurrency (capped at 4)
and pass to quant_config.n_threads. Model load message shows
thread count ("Model loaded! Ready to chat. (4 threads)").
Expected speedup: 3-4x on multi-core devices (most modern laptops).
Combined with SIMD128 from PR #25: total 6-12x vs original build.
Binary: 320K → 384K (pthread runtime overhead).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 10, 2026
Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
unamedkr
added a commit
that referenced
this pull request
Apr 10, 2026
Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 10, 2026
* fix: quantcpp CLI command + default to Llama-3.2-1B (user feedback) User feedback: "quantcpp command not found" + "garbage text from 135M" 1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts]) - `quantcpp "question"` — one-shot - `quantcpp` — interactive chat - `quantcpp --model path.gguf` — custom model 2. Default model changed from SmolLM2-135M to Llama-3.2-1B - 135M produces garbage text — terrible first impression - 1B is 750MB (bigger download) but actually useful output - SmolLM2-135M still available for bandwidth-constrained users 3. README Quick Start now shows `quantcpp` CLI first, Python second Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf(wasm): Web Worker architecture — eliminate ASYNCIFY for max speed Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three build optimizations for the WASM demo to address slow inference speed:
Changes
-msimd128-O3 -flto% 4 == 0Compatibility
WASM SIMD is supported by all modern browsers: Chrome 91+, Firefox 89+, Safari 16.4+ (covers ~96% of users as of 2026).
Binary size
244K → 320K (+31%, from SIMD instruction encoding). Acceptable tradeoff for 2-4x speed.
Test plan
🤖 Generated with Claude Code