Skip to content

linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199

Draft
czoli1976 wants to merge 2 commits intosonos:mainfrom
czoli1976:feature/wasm-mmm-8x8-relaxed-simd
Draft

linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199
czoli1976 wants to merge 2 commits intosonos:mainfrom
czoli1976:feature/wasm-mmm-8x8-relaxed-simd

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

Summary

Extends #2195's relaxed-simd FMA pattern from sigmoid/tanh element-wise kernels to the 6 WASM f32 MMM kernels in linalg/src/wasm.rs (wasm_f32_4x4, 4x1, 8x1, 16x1, 32x1, 8x8).

A madd_f32x4! macro (one definition pair, gated on cfg(target_feature = "relaxed-simd")) replaces 70 inner-loop sites of f32x4_add(_, f32x4_mul(_, _)) (35 in FusedKerSpec::AddMatMul, 35 in FusedKerSpec::AddRowColProducts). All kernel signatures, registrations, and tests are unchanged.

Pattern

#[cfg(target_feature = "relaxed-simd")]
macro_rules! madd_f32x4 {
    ($acc:expr, $a:expr, $b:expr) => { f32x4_relaxed_madd($a, $b, $acc) };
}
#[cfg(not(target_feature = "relaxed-simd"))]
macro_rules! madd_f32x4 {
    ($acc:expr, $a:expr, $b:expr) => { f32x4_add($acc, f32x4_mul($a, $b)) };
}

Per kernel: 16 + 16 in 8x8, 4 + 4 in 4x4, 1 + 1 in 4x1, 2 + 2 in 8x1, 4 + 4 in 16x1, 8 + 8 in 32x1.

Why a macro instead of variant kernels (à la #2195)

#2195 dropped its baseline-simd128 sigmoid/tanh because the activation slot's no-relaxed fallback was the generic scalar polynomial — duplication was free. For MMM that reverses:

  1. The existing simd128 kernel is the no-relaxed fallback. There's no scalar MMM kernel of comparable quality to fall back on.
  2. Each MMM kernel is ~500 LoC. Six duplicated "Relaxed" structs would be ~3000 LoC of near-identical source.

x86_64 / arm64fp16 ship variant kernels because they have runtime CPU-feature detection. WASM has none — target_feature is purely compile-time. So for WASM the right shape is one source whose codegen flips with the build flag. The macro gives that.

Performance — kernel-level (microbench_8x8, M1 Pro / wasmtime 44.0.0)

Same source, same harness, paired runs; only difference is the build flag.

Shape Baseline ns Relaxed ns Speedup
DFN3 m=64 k=64 n=8 2,547 1,845 1.38×
m=128 k=128 n=8 8,679 6,039 1.44×
m=256 k=256 n=64 249,803 165,875 1.51×
m=384 k=1536 n=8 267,932 172,875 1.55×

Performance — E2E models (this PR alone)

wasm-model-bench example crate added in this PR; both binaries built from this branch, only RUSTFLAGS differs (+simd128 vs +simd128,+relaxed-simd). M1 Pro / wasmtime 44.0.0, median of n=5 reps.

Model Class Baseline ms Relaxed ms Speedup
Inception v3 (NNEF, tract CI canonical) Vision CNN, deep + heavy GEMM 427.3 293.2 1.46×
SqueezeNet 1.1 (ONNX, 4.7MB) Vision CNN, 1×1 + 3×3 conv 30.9 24.2 1.28×
modnet @ 512×512 (ONNX, real-time matting) Vision CNN, im2col GEMM 1,875.9 1,586.6 1.18×
all-MiniLM-L6-v2 (batch=1, seq=128) Transformer (BERT-class) 103.0 93.8 1.10×
DFN3 erb_dec (ONNX, T=100) RNN-style audio 14.65 13.41 1.09×
MobileNet v2 (ONNX, 14MB, tract CI) Vision CNN, depthwise-heavy 75.2 69.1 1.09×
DFN3 df_dec (NNEF post-concretize, T=100) RNN-style audio 11.9 11.0 1.08×
DFN3 df_dec (ONNX, T=100) RNN-style audio 12.58 11.60 1.08×

Read: 8x8-dominated workloads (Inception, SqueezeNet, modnet) hit 1.18–1.46×. Depthwise-heavy (MobileNet) and bandwidth-bound GEMV-heavy (DFN3 GRU at M=256) gain less, around 1.08–1.10×. Transformer-class (MiniLM) lands at 1.10×, dominated by attention QK·V GEMMs. All consistent with the per-kernel speedup table.

Quality — L2 norms identical to 7 sig figs

Model Baseline L2 Relaxed L2
Inception v3 6.477089e-2 6.477089e-2
DFN3 df_dec 1.080686e-2 1.080686e-2

Per-element diff is in the 7th–8th decimal — the FMA single-rounding effect, well within Approximation::Close (1e-4).

Bytecode verification

tract-linalg --tests on wasm32-wasip1:

Build f32x4.relaxed_madd
+simd128 0
+simd128,+relaxed-simd 1044

Confirms #2195's "LLVM does not auto-emit FMA" finding extends to MMM kernels (~972 from #2195's sigmoid/tanh inlining + 70 from this PR's substitutions + ~2 LLVM opportunistic).

Test plan

Detailed bench attribution

Full kernel-level + E2E + quality + bytecode data, plus a real-world chained DFN3 inference run (libDF on canonical tract, VoiceBank+DEMAND corpus, 60.8% RTF reduction with 100% sample-level bit-equality and identical DNSMOS scores), in examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf.

Diff

linalg/src/wasm.rs: macro definition (+25), 70 inner-loop substitutions (net 0), microbench_8x8 test module (+115). New examples/wasm-model-bench/ crate for the E2E numbers above. Net +335 / −70 in the existing files.


Open questions @kali

  1. Macro vs duplicated relaxed kernels. WASM has no runtime CPU-feature detection, so the x86_64 / arm64fp16 pattern (variant kernel + runtime dispatch) doesn't apply. The single-source macro flips codegen via cfg(target_feature = "relaxed-simd") at compile time:

    • baseline kernel remains the fallback in the not(...) arm
    • 70 sites across 6 kernels stay single-sourced (vs ~3000 LoC if duplicated)

    Acceptable? Or do you prefer duplicated WasmF32_8x8Relaxed etc. matching linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195's split style anyway, accepting the LoC cost?

  2. madd_f32x4! naming. Reviewer-call. Alternatives: wasm_madd_f32x4! (matches wasm_f32_* kernel naming), f32x4_madd! (matches the underlying intrinsic), simd_madd! (shortest). Any preference?

  3. AddRowColProducts substitution. 35 of the 70 sites are in FusedKerSpec::AddRowColProducts, rarely the runtime bottleneck. I substituted it for symmetry. Worth keeping, or leaner to limit scope to AddMatMul only?

  4. Dead-code wasm_f32_4x4. Registered but not selected by mmm_f32 (always returns 8x8 via max(mr*nr)). Substituted for completeness — no perf impact since it's not called. Drop those substitutions to keep the diff smaller, or leave for symmetry?

  5. microbench_8x8 placement. Currently a #[cfg(test)] #[ignore] module inside linalg/src/wasm.rs (+115 LoC). Split into linalg/benches/wasm_8x8.rs instead? The #[cfg(test)] form lets it run via the same cargo test --target wasm32-wasip1 invocation as the correctness suite, which keeps wasm-side benching frictionless — but linalg/benches/ is the established home, so either way works.

  6. examples/wasm-model-bench lifetime. Added here to produce the E2E numbers; useful beyond this PR (tract CI on wasm32, future linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195/linalg/wasm: wake up M-band GEMV dispatch (30-37% on small-M) #2192-style perf claims). Keep in the workspace, or split into a separate "tract: add wasm-model-bench example crate" PR and rebase this one on top?

  7. GEMV M=100/M=256 variance (microbench_dispatch_gemv): up to 30% run-to-run within the same build during paired runs. Likely methodology bias from the all-4-kernels-back-to-back loop. Want me to land a separate methodology-cleanup PR before this one, or accept the headline 8x8 numbers + caveat noted?

@czoli1976 czoli1976 closed this May 7, 2026
@czoli1976 czoli1976 deleted the feature/wasm-mmm-8x8-relaxed-simd branch May 7, 2026 07:50
@czoli1976 czoli1976 restored the feature/wasm-mmm-8x8-relaxed-simd branch May 7, 2026 07:51
@czoli1976 czoli1976 reopened this May 7, 2026
@czoli1976 czoli1976 force-pushed the feature/wasm-mmm-8x8-relaxed-simd branch from 300406c to 031eec0 Compare May 7, 2026 08:21
@kali
Copy link
Copy Markdown
Collaborator

kali commented May 7, 2026

  1. if we have no runtime detection, can we maintain a runtime dispatch remote controlled by... anything on the side ? env var (or equivalent), Runtime configuration propagated through TLS ? any dirty trick ? I really dislike build-time configuration, i would prefer to avoid introducing the pattern now if we can.
  2. madd_f32x4! is perfectly fine as long as it does not leak out of the wasm modules
  3. AddRowMatMul and other fused op are never critical, but you did them, so we might as well keep it for consistency
  4. please convert 4x4 too. but the right question to ask is whether or not we keep it in the code or ditch it altogether (but that may be another debate/PR)
  5. let's move them to linalg/benches
  6. keep them here, this PR is very manageable.
  7. I think GEMV benched will always be harder as it's more about memory bandwidth than compute, making them more volatile. Specifically on a laptop or workstation with very sophisticated memory and power manangement. So that's life.

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 7, 2026

for 1. can we "probe" for relaxed and basically implement detection ourselves ?

Extends sonos#2195's relaxed-simd FMA pattern from sigmoid/tanh to the 6 WASM
f32 MMM kernels via a cfg-gated madd_f32x4! macro. 70 substitutions across
wasm_f32_4x4 / 4x1 / 8x1 / 16x1 / 32x1 / 8x8 (35 in AddMatMul + 35 in
AddRowColProducts).

Speedup vs +simd128 baseline (M1 Pro, wasmtime 44.0.0):
- microbench_8x8 GEMM: 1.40-1.55x across DFN3-style and large shapes
- E2E across 8 models: 1.08-1.46x (Inception/SqueezeNet/MobileNet/modnet/
  MiniLM/DFN3 erb_dec/df_dec)

Quality: L2 norms identical to 7 sig figs; per-element diff at 7th-8th
decimal (FMA single-rounding, within Approximation::Close).

Bytecode: 1044 f32x4.relaxed_madd ops in the relaxed test binary, 0 in
baseline. Confirms sonos#2195's "LLVM does not auto-emit FMA" finding extends
to MMM kernels.

Adds examples/wasm-model-bench/ harness for the E2E numbers.
See examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf for full data.
@czoli1976 czoli1976 force-pushed the feature/wasm-mmm-8x8-relaxed-simd branch from 031eec0 to fdfe675 Compare May 7, 2026 08:54
@czoli1976
Copy link
Copy Markdown
Contributor Author

Thanks for the quick read. On the easy ones:

2. naming — kept madd_f32x4!. Confirmed it doesn't escape linalg/src/wasm.rs (and now linalg/benches/wasm.rs); not exported.

3. AddRowColProducts — kept the 35 sites for consistency, as you suggest.

4. 4x4 — already converted in this PR (8 sites: 4 AddMatMul + 4 AddRowColProducts). The "keep 4x4 at all?" debate I'd happily punt to a separate PR — it's currently dead code (mmm_f32 always returns 8x8 via max(mr*nr)), so its only cost is source weight. Worth filing an issue to track?

5. linalg/benches — moved both microbench_8x8 and microbench_32x1_isolated out of linalg/src/wasm.rs into a new linalg/benches/wasm.rs (single bench, two mods, gated on target_arch = "wasm32"). harness = false, manual Instant::now() timing — same shape as linalg/benches/arm64.rs. Run with:

RUSTFLAGS='-C target-feature=+simd128,+relaxed-simd' \
  CARGO_TARGET_WASM32_WASIP1_RUNNER='wasmtime --env RUST_TEST_NOCAPTURE=1 --' \
  cargo bench --target wasm32-wasip1 -p tract-linalg --bench wasm

(And +simd128 alone for the baseline comparison.) Net diff after the move: −240 LoC out of wasm.rs, +235 in the new bench file.

Tests still pass clean on both flavors (1727 / 1740 with 0 / 1044 f32x4.relaxed_madd ops).

Force-pushed (031eec0f → fdfe6759).


1. runtime vs build-time dispatch — taking that one offline to think through; I'll come back with a separate post once I've actually walked through what's possible. Short version of the constraint: WASM relaxed-simd opcodes fail validation at module instantiation if the host runtime doesn't support the proposal, so a single binary can't degrade gracefully across hosts. Module-selection happens at the JS host (or wasmtime config) layer. Want to think through what shape that should take inside tract before I reply.

Build flags, consumer-side dispatch playbook (browser + wasmtime), and
quality notes. Documents the host-layer module-selection pattern as the
WASM equivalent of x86/ARM runtime CPU detection — the validation-at-
instantiation constraint rules out in-binary runtime dispatch.
@czoli1976
Copy link
Copy Markdown
Contributor Author

Thanks for pushing on this — we share the dislike for build-time config and spent real time looking for an in-binary runtime alternative before defaulting to cfg(target_feature). Short version: WASM's validation model forces the build-time choice, but the user-facing "runtime dispatch" contract is preserved at the host layer.

The constraint

WASM validates the entire module at instantiation, before any code runs. A binary containing f32x4.relaxed_madd fails to instantiate on hosts without relaxed-simd — LinkError / CompileError, not a runtime trap. So the x86/ARM pattern (one binary, both paths in source, runtime CPU detection picks at execution time) cannot be replicated in-binary on WASM: there's no point at which a wasm program "decides" whether to use FMA, because the FMA opcodes are either present (and host support is required) or absent.

This is also why the existing tract pattern doesn't transfer 1:1 here. arm64fp16 and the x86 FMA paths use compile-with-feature-enabled + runtime-check-and-register: intrinsics gated on target_feature, kernel struct only exists if the flag is set, and a second runtime check (is_aarch64_feature_detected!("fp16"), is_x86_feature_detected!("fma")) picks at registration time. In WASM the first half (intrinsic enablement) works the same; the second half (runtime detect-and-register) cannot, because the binary fails to load at all on hosts that lack the feature.

Where runtime dispatch moves to

One layer up — host-side module selection at load time:

const wantRelaxed = WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] });
const url = wantRelaxed ? '/tract-relaxed.wasm' : '/tract.wasm';

Universal browser support since 2023 (Chrome 114+, Firefox 120+, Safari 17+), wasmtime 16+. Tract internals stay build-time gated; the consumer ships two binaries and picks at host-init time. The user-facing contract — "tract picks the fast path automatically on capable hosts" — is preserved; the detect-and-dispatch site just lives in the host shim instead of in the kernel registration logic.

I've added linalg/WASM_RELAXED_SIMD.md to this PR (commit a2655a6c) with the full playbook — build flags, JS browser snippet, wasmtime config snippet, quality notes — so downstreams have a default reference. Move/rename/restructure as you like.

Why macro vs separate kernels collapses under this

Both patterns produce the same two binaries (no-relaxed / relaxed-required) and rely on the same host-side selection. Separate kernels would only matter if WASM had x86's runtime-trap behaviour — it doesn't. So the macro is strictly cheaper on source weight (~25 LoC vs ~3000 LoC for 6 duplicated kernel bodies) for an identical outcome.

What I ruled out

  • Multi-module + import-fallback (kernels module loaded conditionally by host): cross-module call per kernel invocation eats the FMA win inside the inner K-loop. Boundary cost compounds against the per-call FMA savings.
  • Always-relaxed binary + atomic flag (env var flips kernel choice in-process): binary still requires relaxed-simd to instantiate, so it can't run on hosts without it. The flag would only let users opt OUT of FMA on hosts that DO support it (e.g., for bit-determinism) — different feature.
  • Component Model feature negotiation: works in principle, requires significant tract refactor; premature for this PR.
  • Drop the fallback (relaxed-only): regression on hosts pinned to wasm MVP.
  • Build-script auto-add +relaxed-simd for wasm32 in 2026: would reduce friction but surprises users with intentional opt-out reasons; not safe as a default.

Happy to dig further if you see an angle I missed.

@czoli1976
Copy link
Copy Markdown
Contributor Author

Direct answer to the probe question: not from inside the wasm binary, for the same validation-at-instantiation reason as the broader runtime-dispatch issue. A few interpretations of "probe" and where each fails:

  1. Probe at init by attempting a relaxed_madd: the binary fails to instantiate on non-supporting hosts before any init code runs.
  2. Probe via try/catch on a kernel call: doesn't work — function-level validation is module-wide, so any function containing relaxed-simd opcodes prevents the entire module from loading on hosts without support.
  3. Probe via a separate tiny "is-relaxed-supported" wasm submodule loaded conditionally: works, but it's identical to WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] }) — same primitive, just dressed up.

The general rule: any wasm code path that contains relaxed-simd opcodes blocks the host from loading the module without support. So self-detection in a single tract binary collapses to "always require relaxed-simd, then check at runtime" — which forfeits the fallback and means the binary doesn't load on legacy hosts at all.

Host-side detection (WebAssembly.validate in the browser, Engine::wasm_relaxed_simd in wasmtime — the playbook in linalg/WASM_RELAXED_SIMD.md) is the equivalent: same detection primitive, just at the boundary that can actually act on the result.

Also: noted on #6 — kept examples/wasm-model-bench in this PR.

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 7, 2026

Understood. The binary does not even load if it features the new opcodes, killing more or less any possibility for runtime selection within a single object.

So we have two wasm build configurations. Do we need to test both configuration ? I think maybe the linalg tests, not necessarily the rest. Which configuration makes the most of sense as a default in 2026 ? The conservative one for compatibility ? or the ecosystem has wide adoption for the new opcodes already and we keep the strict opcodes for people with application needing back compatibility and who will pay the extra complexity tax ?

@czoli1976
Copy link
Copy Markdown
Contributor Author

czoli1976 commented May 7, 2026 via email

@czoli1976
Copy link
Copy Markdown
Contributor Author

@kali, pdf attached, esier to review relaxed-simd-ecosystem-2026.pdf

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 8, 2026

OK. thanks for the landscape report. so safari is behind. I agree this mean keeping the strict impl as default, and documenting with instruction on how to switch to relaxed, and hinting on how to deploy side by side. I guess we make a tech note in the doc/ folder for starters. Problem is discoverability, wasm is not even a crate on its own. I seriously need to spend some quality time with a bot to rewrite the top level README. It barely mentions platform support at all, only frameworks and models :/ This is on my roadmap, but let's keep it simple here (=> tech note/recipe in doc/ or linalg/)

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 8, 2026

and that's exactly what you did. great. should we move this from draft to ready ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants