linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199
linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199czoli1976 wants to merge 2 commits intosonos:mainfrom
Conversation
300406c to
031eec0
Compare
|
|
for 1. can we "probe" for relaxed and basically implement detection ourselves ? |
Extends sonos#2195's relaxed-simd FMA pattern from sigmoid/tanh to the 6 WASM f32 MMM kernels via a cfg-gated madd_f32x4! macro. 70 substitutions across wasm_f32_4x4 / 4x1 / 8x1 / 16x1 / 32x1 / 8x8 (35 in AddMatMul + 35 in AddRowColProducts). Speedup vs +simd128 baseline (M1 Pro, wasmtime 44.0.0): - microbench_8x8 GEMM: 1.40-1.55x across DFN3-style and large shapes - E2E across 8 models: 1.08-1.46x (Inception/SqueezeNet/MobileNet/modnet/ MiniLM/DFN3 erb_dec/df_dec) Quality: L2 norms identical to 7 sig figs; per-element diff at 7th-8th decimal (FMA single-rounding, within Approximation::Close). Bytecode: 1044 f32x4.relaxed_madd ops in the relaxed test binary, 0 in baseline. Confirms sonos#2195's "LLVM does not auto-emit FMA" finding extends to MMM kernels. Adds examples/wasm-model-bench/ harness for the E2E numbers. See examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf for full data.
031eec0 to
fdfe675
Compare
|
Thanks for the quick read. On the easy ones: 2. naming — kept 3. AddRowColProducts — kept the 35 sites for consistency, as you suggest. 4. 4x4 — already converted in this PR (8 sites: 4 AddMatMul + 4 AddRowColProducts). The "keep 4x4 at all?" debate I'd happily punt to a separate PR — it's currently dead code ( 5. linalg/benches — moved both RUSTFLAGS='-C target-feature=+simd128,+relaxed-simd' \
CARGO_TARGET_WASM32_WASIP1_RUNNER='wasmtime --env RUST_TEST_NOCAPTURE=1 --' \
cargo bench --target wasm32-wasip1 -p tract-linalg --bench wasm(And Tests still pass clean on both flavors (1727 / 1740 with 0 / 1044 Force-pushed ( 1. runtime vs build-time dispatch — taking that one offline to think through; I'll come back with a separate post once I've actually walked through what's possible. Short version of the constraint: WASM relaxed-simd opcodes fail validation at module instantiation if the host runtime doesn't support the proposal, so a single binary can't degrade gracefully across hosts. Module-selection happens at the JS host (or wasmtime config) layer. Want to think through what shape that should take inside tract before I reply. |
Build flags, consumer-side dispatch playbook (browser + wasmtime), and quality notes. Documents the host-layer module-selection pattern as the WASM equivalent of x86/ARM runtime CPU detection — the validation-at- instantiation constraint rules out in-binary runtime dispatch.
|
Thanks for pushing on this — we share the dislike for build-time config and spent real time looking for an in-binary runtime alternative before defaulting to The constraintWASM validates the entire module at instantiation, before any code runs. A binary containing This is also why the existing tract pattern doesn't transfer 1:1 here. Where runtime dispatch moves toOne layer up — host-side module selection at load time: const wantRelaxed = WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] });
const url = wantRelaxed ? '/tract-relaxed.wasm' : '/tract.wasm';Universal browser support since 2023 (Chrome 114+, Firefox 120+, Safari 17+), wasmtime 16+. Tract internals stay build-time gated; the consumer ships two binaries and picks at host-init time. The user-facing contract — "tract picks the fast path automatically on capable hosts" — is preserved; the detect-and-dispatch site just lives in the host shim instead of in the kernel registration logic. I've added Why macro vs separate kernels collapses under thisBoth patterns produce the same two binaries (no-relaxed / relaxed-required) and rely on the same host-side selection. Separate kernels would only matter if WASM had x86's runtime-trap behaviour — it doesn't. So the macro is strictly cheaper on source weight (~25 LoC vs ~3000 LoC for 6 duplicated kernel bodies) for an identical outcome. What I ruled out
Happy to dig further if you see an angle I missed. |
|
Direct answer to the probe question: not from inside the wasm binary, for the same validation-at-instantiation reason as the broader runtime-dispatch issue. A few interpretations of "probe" and where each fails:
The general rule: any wasm code path that contains relaxed-simd opcodes blocks the host from loading the module without support. So self-detection in a single tract binary collapses to "always require relaxed-simd, then check at runtime" — which forfeits the fallback and means the binary doesn't load on legacy hosts at all. Host-side detection ( Also: noted on #6 — kept |
|
Understood. The binary does not even load if it features the new opcodes, killing more or less any possibility for runtime selection within a single object. So we have two wasm build configurations. Do we need to test both configuration ? I think maybe the linalg tests, not necessarily the rest. Which configuration makes the most of sense as a default in 2026 ? The conservative one for compatibility ? or the ecosystem has wide adoption for the new opcodes already and we keep the strict opcodes for people with application needing back compatibility and who will pay the extra complexity tax ? |
|
Two questions: *(a)* which configs CI tests, and *(b)* which is the default
in 2026. Taking them in order.
(a) CI: linalg-only on both, rest on one
Agreed — tract-linalg is the only crate where the macro lives and where the
f32x4.relaxed_madd opcode actually appears, so it's the only place where
the two configs can produce divergent bytecode. Everything downstream (
tract-core, tract-onnx, model E2E) just calls into the kernels via MMM and
is observationally identical between the two builds modulo speed.
Concretely:
[wasm32-wasip1] +simd128 +simd128,+relaxed-simd
tract-linalg --lib full suite full suite (currently 1727 /
1740, +13 from #2195)
tract-core, tract-onnx,
example bins one of the two skipped on the other
Default the "one of the two" to +simd128 since it's the
broader-compatibility build and exercises the no-relaxed madd_f32x4!
fallback. Net CI cost: one extra cargo test -p tract-linalg --target
wasm32-wasip1 invocation per PR.
(b) The default in 2026 — short version
*Keep +simd128 as the default; document +simd128,+relaxed-simd as the
recommended production flag.* Reasoning below, with the data that drove it.
------------------------------
Relaxed-SIMD support, by runtime
Sources: caniuse.com/wf-wasm-simd-relaxed (April 2026), Mozilla Bugzilla
1706922, wasmtime PR #7285, V8 status.
Runtime Unflagged-default since Status (May 2026)
Chrome / Chromium *114* (May 2023) Default on; current stable 151
Edge *114* (May 2023) Default on (Chromium)
Opera *100* (May 2023) Default on (Chromium)
Samsung Internet *23* (Aug 2023) Default on
Firefox (desktop) *146* (Jan 2026) Default on; current stable 153
Firefox (Android) *150* (Mar 2026) Default on
Chrome for Android *114* (May 2023) Default on
*Safari (desktop)* — *Behind a flag (Develop > Feature Flags > WebAssembly
relaxed SIMD)*; through 26.5 / TP
*Safari (iOS)* — *Behind a flag*; through 26.5
Node.js *20* (~Apr 2023, V8 11+) Default on
Deno (V8 11+, ~mid-2023) Default on
Bun (since ~0.5) Default on
Wasmtime *15.0* (Nov 2023, PR #7285) Default on
Wasmer enabled via --enable-simd Off by default
The single-line summary: *everywhere except Safari (both flavours),
relaxed-simd is the default in 2026.* Safari is the load-bearing exception.
Browser market share, March/April 2026
Source: StatCounter GlobalStats (March 2026 all-platform; April 2026
desktop; March 2026 mobile).
All platforms
Browser Share Relaxed-SIMD by default
Chrome 66.7% ✅
Safari 17.9% ❌ (flag)
Edge 5.79% ✅
Firefox 2.33% ✅
Samsung Internet 2.06% ✅
Opera 2.0% ✅
*Total ✅ default* *~78.9%*
*Total ❌ flag* *~17.9%* Desktop only (April 2026)
Browser Share Relaxed-SIMD by default
Chrome 71.48% ✅
Edge 11.52% ✅
Safari 6.20% ❌ (flag)
Firefox 4.22% ✅
Opera 2.28% ✅
Brave 1.47% ✅
*Total ✅ default* *~91.0%*
*Total ❌ flag* *~6.2%* Mobile only (March 2026)
Browser Share Relaxed-SIMD by default
Chrome 65.10% ✅
Safari (iOS) 25.99% ❌ (flag)
Samsung Internet 3.75% ✅
Opera 1.58% ✅
UC Browser 1.13% partial
Firefox 0.62% ✅
*Total ✅ default* *~71.1%*
*Total ❌ flag* *~26.0%*
caniuse.com's headline number — *73.89% global usage* for unflagged
relaxed-simd as of April 2026 — agrees with the all-platform table (Safari
is the gap). The mobile picture is the weakest: ~26% of mobile sessions are
iOS Safari, which can't load a relaxed-simd binary at all.
The Safari problem, specifically
Safari shipped fixed-width SIMD in 16.4 (March 2023), so all the +simd128
work tract has done already lands. Relaxed-simd was added behind a flag in
2024 (State of WebAssembly 2025/2026
<https://platform.uno/blog/the-state-of-webassembly-2025-2026/>) and has
stayed behind a flag through Safari 26.5 / TP — including iOS. iOS is the
kicker: every browser on iOS uses WebKit under the hood (Apple policy), so
Chrome-on-iOS, Edge-on-iOS, Firefox-on-iOS all inherit the same flag-gated
state. There's no path around it short of the user toggling Develop >
Feature Flags themselves, which a tract consumer can't ask their end users
to do.
What this means for the default
A tract consumer who builds with +simd128,+relaxed-simd and ships the
resulting .wasm to a browser context will hit LinkError / CompileError on
roughly *18% of all sessions* (26% of mobile sessions, 6% of desktop) — the
binary won't even load. That's a hard failure visible to the end user, not
a silent perf regression.
A consumer who builds with the current +simd128 default loads everywhere,
runs at the speeds we already ship, and forfeits the 1.08–1.46× E2E win on
the 80%+ of sessions that *would* support relaxed-simd. That's a soft loss,
recoverable at any time by flipping a flag — and the consumer can flip it
themselves with a one-liner once they know their audience.
Asymmetric costs → conservative default:
- *Default = +simd128* (current behaviour, no breakage anywhere). Add a
one-line note in the build docs that +simd128,+relaxed-simd is
recommended for non-browser targets and for browser targets that ship two
builds with host-side dispatch (the WebAssembly.validate(bytes, {
builtins: ['relaxed_simd'] }) pattern in linalg/WASM_RELAXED_SIMD.md).
- *Document the upgrade path explicitly* so a consumer who knows their
audience (e.g., "internal tool, all our users on Chrome 114+") can flip the
flag without spelunking. Two RUSTFLAG strings, one paragraph.
- *Revisit when Safari unflags relaxed-simd.* Once that lands in a
stable Safari release, the global ✅-default share crosses ~96% and the cost
equation flips. A separate PR can promote +simd128,+relaxed-simd to the
default at that point. Safari has the implementation behind the flag
already, so this is a "when" not an "if" — likely 2026 H2 or 2027 H1 based
on their typical flag-to-default cadence.
One alternative worth considering
If the project wants to treat browser and non-browser as separate
worlds: *default
+simd128,+relaxed-simd for wasm32-wasip1 / wasm32 server runtimes, default
+simd128 for wasm32-unknown-unknown browser builds*, gated in a build
script. Wasmtime, Node, Deno, Bun all support relaxed-simd by default; the
only place where the binary load fails is browsers, and
wasm32-unknown-unknown is the standard target for those. This keeps
server-side tract consumers on the fast path without breaking browser
consumers.
Trade-off: the asymmetry is non-obvious to new contributors and adds a
build-script branch. I'd lean against it for this PR but flag it as worth a
separate discussion if the project wants to lean into wasmtime/server-WASM
as a first-class deployment target.
------------------------------
*Concrete proposal for this PR:* keep default at +simd128, land the macro
substitutions and linalg/benches/wasm.rs move as already pushed (fdfe675),
keep linalg/WASM_RELAXED_SIMD.md as the documented upgrade path, add the
linalg-only dual-config CI matrix above. Promotion of +simd128,+relaxed-simd
to the default is a separate PR, gated on Safari unflagging.
Best Regards
CK
…On Thu, May 7, 2026 at 09:43 Mathieu Poumeyrol ***@***.***> wrote:
*kali* left a comment (sonos/tract#2199)
<#2199 (comment)>
for 1. can we "probe" for relaxed and basically implement detection
ourselves ?
—
Reply to this email directly, view it on GitHub
<#2199 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APL2Z6UW7K24EUXEPKCI4ZD4ZREENAVCNFSM6AAAAACYUF57ICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOJVGU3DAMJVGM>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
|
@kali, pdf attached, esier to review relaxed-simd-ecosystem-2026.pdf |
|
OK. thanks for the landscape report. so safari is behind. I agree this mean keeping the strict impl as default, and documenting with instruction on how to switch to relaxed, and hinting on how to deploy side by side. I guess we make a tech note in the doc/ folder for starters. Problem is discoverability, wasm is not even a crate on its own. I seriously need to spend some quality time with a bot to rewrite the top level README. It barely mentions platform support at all, only frameworks and models :/ This is on my roadmap, but let's keep it simple here (=> tech note/recipe in doc/ or linalg/) |
|
and that's exactly what you did. great. should we move this from draft to ready ? |
Summary
Extends #2195's relaxed-simd FMA pattern from sigmoid/tanh element-wise kernels to the 6 WASM f32 MMM kernels in
linalg/src/wasm.rs(wasm_f32_4x4,4x1,8x1,16x1,32x1,8x8).A
madd_f32x4!macro (one definition pair, gated oncfg(target_feature = "relaxed-simd")) replaces 70 inner-loop sites off32x4_add(_, f32x4_mul(_, _))(35 inFusedKerSpec::AddMatMul, 35 inFusedKerSpec::AddRowColProducts). All kernel signatures, registrations, and tests are unchanged.Pattern
Per kernel: 16 + 16 in 8x8, 4 + 4 in 4x4, 1 + 1 in 4x1, 2 + 2 in 8x1, 4 + 4 in 16x1, 8 + 8 in 32x1.
Why a macro instead of variant kernels (à la #2195)
#2195 dropped its baseline-simd128 sigmoid/tanh because the activation slot's no-relaxed fallback was the generic scalar polynomial — duplication was free. For MMM that reverses:
x86_64 / arm64fp16 ship variant kernels because they have runtime CPU-feature detection. WASM has none —
target_featureis purely compile-time. So for WASM the right shape is one source whose codegen flips with the build flag. The macro gives that.Performance — kernel-level (
microbench_8x8, M1 Pro / wasmtime 44.0.0)Same source, same harness, paired runs; only difference is the build flag.
Performance — E2E models (this PR alone)
wasm-model-benchexample crate added in this PR; both binaries built from this branch, only RUSTFLAGS differs (+simd128vs+simd128,+relaxed-simd). M1 Pro / wasmtime 44.0.0, median of n=5 reps.Read: 8x8-dominated workloads (Inception, SqueezeNet, modnet) hit 1.18–1.46×. Depthwise-heavy (MobileNet) and bandwidth-bound GEMV-heavy (DFN3 GRU at M=256) gain less, around 1.08–1.10×. Transformer-class (MiniLM) lands at 1.10×, dominated by attention QK·V GEMMs. All consistent with the per-kernel speedup table.
Quality — L2 norms identical to 7 sig figs
Per-element diff is in the 7th–8th decimal — the FMA single-rounding effect, well within
Approximation::Close(1e-4).Bytecode verification
tract-linalg --testsonwasm32-wasip1:+simd128+simd128,+relaxed-simdConfirms #2195's "LLVM does not auto-emit FMA" finding extends to MMM kernels (~972 from #2195's sigmoid/tanh inlining + 70 from this PR's substitutions + ~2 LLVM opportunistic).
Test plan
cargo test -p tract-linalg --lib --target wasm32-wasip1(wasmtime):+simd128: 1727 passed, 0 failed+simd128,+relaxed-simd: 1740 passed, 0 failed (+13 sigmoid/tanh relaxed cases from linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195)cargo checkclean on both flag combinations (no new warnings)wasm-objdump -dop-count verification (table above)f32x4.relaxed_maddto native FMA on either architecture, so the speedup should carry, but unverified)Detailed bench attribution
Full kernel-level + E2E + quality + bytecode data, plus a real-world chained DFN3 inference run (libDF on canonical tract, VoiceBank+DEMAND corpus, 60.8% RTF reduction with 100% sample-level bit-equality and identical DNSMOS scores), in
examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf.Diff
linalg/src/wasm.rs: macro definition (+25), 70 inner-loop substitutions (net 0),microbench_8x8test module (+115). Newexamples/wasm-model-bench/crate for the E2E numbers above. Net +335 / −70 in the existing files.Open questions @kali
Macro vs duplicated relaxed kernels. WASM has no runtime CPU-feature detection, so the x86_64 / arm64fp16 pattern (variant kernel + runtime dispatch) doesn't apply. The single-source macro flips codegen via
cfg(target_feature = "relaxed-simd")at compile time:not(...)armAcceptable? Or do you prefer duplicated
WasmF32_8x8Relaxedetc. matching linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195's split style anyway, accepting the LoC cost?madd_f32x4!naming. Reviewer-call. Alternatives:wasm_madd_f32x4!(matcheswasm_f32_*kernel naming),f32x4_madd!(matches the underlying intrinsic),simd_madd!(shortest). Any preference?AddRowColProductssubstitution. 35 of the 70 sites are inFusedKerSpec::AddRowColProducts, rarely the runtime bottleneck. I substituted it for symmetry. Worth keeping, or leaner to limit scope toAddMatMulonly?Dead-code
wasm_f32_4x4. Registered but not selected bymmm_f32(always returns 8x8 viamax(mr*nr)). Substituted for completeness — no perf impact since it's not called. Drop those substitutions to keep the diff smaller, or leave for symmetry?microbench_8x8placement. Currently a#[cfg(test)] #[ignore]module insidelinalg/src/wasm.rs(+115 LoC). Split intolinalg/benches/wasm_8x8.rsinstead? The#[cfg(test)]form lets it run via the samecargo test --target wasm32-wasip1invocation as the correctness suite, which keeps wasm-side benching frictionless — butlinalg/benches/is the established home, so either way works.examples/wasm-model-benchlifetime. Added here to produce the E2E numbers; useful beyond this PR (tract CI on wasm32, future linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195/linalg/wasm: wake up M-band GEMV dispatch (30-37% on small-M) #2192-style perf claims). Keep in the workspace, or split into a separate "tract: add wasm-model-bench example crate" PR and rebase this one on top?GEMV M=100/M=256 variance (
microbench_dispatch_gemv): up to 30% run-to-run within the same build during paired runs. Likely methodology bias from the all-4-kernels-back-to-back loop. Want me to land a separate methodology-cleanup PR before this one, or accept the headline 8x8 numbers + caveat noted?