perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed by unamedkr · Pull Request #28 · quantumaikr/quant.cpp

unamedkr · 2026-04-10T08:12:27Z

Summary

Replace ASYNCIFY-based streaming with a Web Worker architecture for maximum WASM inference speed.

Architecture change

Before: Main thread → ASYNCIFY(sleep per token) → DOM update
After:  Worker thread → postMessage(token) → Main thread → DOM update

ASYNCIFY adds ~30-50% overhead by saving/restoring the entire call stack at every yield point. With a Web Worker, the inference loop blocks freely in the worker while the main thread handles DOM updates via message events — zero overhead.

All optimizations combined

Optimization	Source	Impact
WASM SIMD128	PR #25	2-4x matmul
Relaxed SIMD (FMA)	This PR	+10-20%
pthreads (4 threads)	PR #27	3-4x parallel
No ASYNCIFY	This PR	+30-50% recovered
Fixed 1GB memory	This PR	No growth penalty

Expected total: 8-15x vs original 0.9 tok/s → ~7-14 tok/s

Binary size

	Before	After
quant.wasm	384K	256K (-33%)
quant.js	84K	76K (-10%)

Test plan

Build succeeds without ASYNCIFY
pthreads + SIMD + relaxed-simd in binary
Worker loads model via postMessage(ArrayBuffer)
Tokens stream in real-time via worker → main postMessage
"Thinking..." shows during prefill
tok/s counter updates live

🤖 Generated with Claude Code

User feedback: "quantcpp command not found" + "garbage text from 135M" 1. Added `quantcpp` CLI entry point (pyproject.toml [project.scripts]) - `quantcpp "question"` — one-shot - `quantcpp` — interactive chat - `quantcpp --model path.gguf` — custom model 2. Default model changed from SmolLM2-135M to Llama-3.2-1B - 135M produces garbage text — terrible first impression - 1B is 750MB (bigger download) but actually useful output - SmolLM2-135M still available for bandwidth-constrained users 3. README Quick Start now shows `quantcpp` CLI first, Python second Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ASYNCIFY-based streaming with a dedicated Web Worker. Inference runs entirely in the worker thread; tokens stream to the main thread via postMessage(). The main thread never blocks. Changes: - inference-worker.js: new Web Worker that loads WASM + runs quant_generate() in a blocking loop, posting each token - quant_wasm.c: simplified — removed ASYNCIFY, sleep, async variants. Single sync callback posts tokens via EM_JS - build.sh: removed -sASYNCIFY and ASYNCIFY_IMPORTS. Added -mrelaxed-simd for FMA. Fixed 1GB memory (no growth penalty with pthreads). ALLOW_MEMORY_GROWTH=0 - index.html: generate() sends to worker, receives tokens via onmessage handler. Model loading via transferable ArrayBuffer Performance impact: - ASYNCIFY removal: ~30-50% less overhead (no stack unwind/rewind) - Fixed memory: eliminates pthreads+growth penalty - Relaxed SIMD: FMA instructions where available - Binary: 384K → 256K (-33%) Combined with pthreads (PR #27) and SIMD128 (PR #25): expected total speedup 8-15x vs original single-thread build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr and others added 2 commits April 10, 2026 17:13

unamedkr force-pushed the perf/wasm-worker-no-asyncify branch from c8c6a05 to d8c7e3b Compare April 10, 2026 08:13

unamedkr merged commit 3f3fb74 into main Apr 10, 2026

unamedkr deleted the perf/wasm-worker-no-asyncify branch April 10, 2026 08:13

unamedkr mentioned this pull request Apr 10, 2026

fix(wasm): ccall({async:true}) — fixes ASYNCIFY streaming #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed#28

perf(wasm): Web Worker + no ASYNCIFY — maximum inference speed#28
unamedkr merged 2 commits intomainfrom
perf/wasm-worker-no-asyncify

unamedkr commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 10, 2026

Summary

Architecture change

All optimizations combined

Binary size

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant