linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro) by czoli1976 · Pull Request #2199 · sonos/tract

czoli1976 · 2026-05-07T07:46:35Z

Summary

Extends #2195's relaxed-simd FMA pattern from sigmoid/tanh element-wise kernels to the 6 WASM f32 MMM kernels in linalg/src/wasm.rs (wasm_f32_4x4, 4x1, 8x1, 16x1, 32x1, 8x8).

A madd_f32x4! macro (one definition pair, gated on cfg(target_feature = "relaxed-simd")) replaces 70 inner-loop sites of f32x4_add(_, f32x4_mul(_, _)) (35 in FusedKerSpec::AddMatMul, 35 in FusedKerSpec::AddRowColProducts). All kernel signatures, registrations, and tests are unchanged.

Pattern

#[cfg(target_feature = "relaxed-simd")]
macro_rules! madd_f32x4 {
    ($acc:expr, $a:expr, $b:expr) => { f32x4_relaxed_madd($a, $b, $acc) };
}
#[cfg(not(target_feature = "relaxed-simd"))]
macro_rules! madd_f32x4 {
    ($acc:expr, $a:expr, $b:expr) => { f32x4_add($acc, f32x4_mul($a, $b)) };
}

Per kernel: 16 + 16 in 8x8, 4 + 4 in 4x4, 1 + 1 in 4x1, 2 + 2 in 8x1, 4 + 4 in 16x1, 8 + 8 in 32x1.

Why a macro instead of variant kernels (à la #2195)

#2195 dropped its baseline-simd128 sigmoid/tanh because the activation slot's no-relaxed fallback was the generic scalar polynomial — duplication was free. For MMM that reverses:

The existing simd128 kernel is the no-relaxed fallback. There's no scalar MMM kernel of comparable quality to fall back on.
Each MMM kernel is ~500 LoC. Six duplicated "Relaxed" structs would be ~3000 LoC of near-identical source.

x86_64 / arm64fp16 ship variant kernels because they have runtime CPU-feature detection. WASM has none — target_feature is purely compile-time. So for WASM the right shape is one source whose codegen flips with the build flag. The macro gives that.

Performance — kernel-level (`microbench_8x8`, M1 Pro / wasmtime 44.0.0)

Same source, same harness, paired runs; only difference is the build flag.

Shape	Baseline ns	Relaxed ns	Speedup
DFN3 m=64 k=64 n=8	2,547	1,845	1.38×
m=128 k=128 n=8	8,679	6,039	1.44×
m=256 k=256 n=64	249,803	165,875	1.51×
m=384 k=1536 n=8	267,932	172,875	1.55×

Performance — E2E models (this PR alone)

wasm-model-bench example crate added in this PR; both binaries built from this branch, only RUSTFLAGS differs (+simd128 vs +simd128,+relaxed-simd). M1 Pro / wasmtime 44.0.0, median of n=5 reps.

Model	Class	Baseline ms	Relaxed ms	Speedup
Inception v3 (NNEF, tract CI canonical)	Vision CNN, deep + heavy GEMM	427.3	293.2	1.46×
SqueezeNet 1.1 (ONNX, 4.7MB)	Vision CNN, 1×1 + 3×3 conv	30.9	24.2	1.28×
modnet @ 512×512 (ONNX, real-time matting)	Vision CNN, im2col GEMM	1,875.9	1,586.6	1.18×
all-MiniLM-L6-v2 (batch=1, seq=128)	Transformer (BERT-class)	103.0	93.8	1.10×
DFN3 erb_dec (ONNX, T=100)	RNN-style audio	14.65	13.41	1.09×
MobileNet v2 (ONNX, 14MB, tract CI)	Vision CNN, depthwise-heavy	75.2	69.1	1.09×
DFN3 df_dec (NNEF post-concretize, T=100)	RNN-style audio	11.9	11.0	1.08×
DFN3 df_dec (ONNX, T=100)	RNN-style audio	12.58	11.60	1.08×

Read: 8x8-dominated workloads (Inception, SqueezeNet, modnet) hit 1.18–1.46×. Depthwise-heavy (MobileNet) and bandwidth-bound GEMV-heavy (DFN3 GRU at M=256) gain less, around 1.08–1.10×. Transformer-class (MiniLM) lands at 1.10×, dominated by attention QK·V GEMMs. All consistent with the per-kernel speedup table.

Quality — L2 norms identical to 7 sig figs

Model	Baseline L2	Relaxed L2
Inception v3	6.477089e-2	6.477089e-2
DFN3 df_dec	1.080686e-2	1.080686e-2

Per-element diff is in the 7th–8th decimal — the FMA single-rounding effect, well within Approximation::Close (1e-4).

Bytecode verification

tract-linalg --tests on wasm32-wasip1:

Build	f32x4.relaxed_madd
`+simd128`	0
`+simd128,+relaxed-simd`	1044

Confirms #2195's "LLVM does not auto-emit FMA" finding extends to MMM kernels (~972 from #2195's sigmoid/tanh inlining + 70 from this PR's substitutions + ~2 LLVM opportunistic).

Test plan

cargo test -p tract-linalg --lib --target wasm32-wasip1 (wasmtime):
- +simd128: 1727 passed, 0 failed
- +simd128,+relaxed-simd: 1740 passed, 0 failed (+13 sigmoid/tanh relaxed cases from linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195)
cargo check clean on both flag combinations (no new warnings)
wasm-objdump -d op-count verification (table above)
x86_64 + FMA3: not measured (M-series host only — same caveat as linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195; both wasmtime/Cranelift and V8 lower f32x4.relaxed_madd to native FMA on either architecture, so the speedup should carry, but unverified)

Detailed bench attribution

Full kernel-level + E2E + quality + bytecode data, plus a real-world chained DFN3 inference run (libDF on canonical tract, VoiceBank+DEMAND corpus, 60.8% RTF reduction with 100% sample-level bit-equality and identical DNSMOS scores), in examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf.

Diff

linalg/src/wasm.rs: macro definition (+25), 70 inner-loop substitutions (net 0), microbench_8x8 test module (+115). New examples/wasm-model-bench/ crate for the E2E numbers above. Net +335 / −70 in the existing files.

Open questions @kali

Macro vs duplicated relaxed kernels. WASM has no runtime CPU-feature detection, so the x86_64 / arm64fp16 pattern (variant kernel + runtime dispatch) doesn't apply. The single-source macro flips codegen via cfg(target_feature = "relaxed-simd") at compile time:
- baseline kernel remains the fallback in the not(...) arm
- 70 sites across 6 kernels stay single-sourced (vs ~3000 LoC if duplicated)
Acceptable? Or do you prefer duplicated WasmF32_8x8Relaxed etc. matching linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195's split style anyway, accepting the LoC cost?
madd_f32x4! naming. Reviewer-call. Alternatives: wasm_madd_f32x4! (matches wasm_f32_* kernel naming), f32x4_madd! (matches the underlying intrinsic), simd_madd! (shortest). Any preference?
AddRowColProducts substitution. 35 of the 70 sites are in FusedKerSpec::AddRowColProducts, rarely the runtime bottleneck. I substituted it for symmetry. Worth keeping, or leaner to limit scope to AddMatMul only?
Dead-code wasm_f32_4x4. Registered but not selected by mmm_f32 (always returns 8x8 via max(mr*nr)). Substituted for completeness — no perf impact since it's not called. Drop those substitutions to keep the diff smaller, or leave for symmetry?
microbench_8x8 placement. Currently a #[cfg(test)] #[ignore] module inside linalg/src/wasm.rs (+115 LoC). Split into linalg/benches/wasm_8x8.rs instead? The #[cfg(test)] form lets it run via the same cargo test --target wasm32-wasip1 invocation as the correctness suite, which keeps wasm-side benching frictionless — but linalg/benches/ is the established home, so either way works.
examples/wasm-model-bench lifetime. Added here to produce the E2E numbers; useful beyond this PR (tract CI on wasm32, future linalg/wasm: add WASM SIMD sigmoid + tanh kernels (relaxed-simd FMA) #2195/linalg/wasm: wake up M-band GEMV dispatch (30-37% on small-M) #2192-style perf claims). Keep in the workspace, or split into a separate "tract: add wasm-model-bench example crate" PR and rebase this one on top?
GEMV M=100/M=256 variance (microbench_dispatch_gemv): up to 30% run-to-run within the same build during paired runs. Likely methodology bias from the all-4-kernels-back-to-back loop. Want me to land a separate methodology-cleanup PR before this one, or accept the headline 8x8 numbers + caveat noted?

kali · 2026-05-07T08:39:03Z

if we have no runtime detection, can we maintain a runtime dispatch remote controlled by... anything on the side ? env var (or equivalent), Runtime configuration propagated through TLS ? any dirty trick ? I really dislike build-time configuration, i would prefer to avoid introducing the pattern now if we can.
madd_f32x4! is perfectly fine as long as it does not leak out of the wasm modules
AddRowMatMul and other fused op are never critical, but you did them, so we might as well keep it for consistency
please convert 4x4 too. but the right question to ask is whether or not we keep it in the code or ditch it altogether (but that may be another debate/PR)
let's move them to linalg/benches
keep them here, this PR is very manageable.
I think GEMV benched will always be harder as it's more about memory bandwidth than compute, making them more volatile. Specifically on a laptop or workstation with very sophisticated memory and power manangement. So that's life.

kali · 2026-05-07T08:43:28Z

for 1. can we "probe" for relaxed and basically implement detection ourselves ?

Extends sonos#2195's relaxed-simd FMA pattern from sigmoid/tanh to the 6 WASM f32 MMM kernels via a cfg-gated madd_f32x4! macro. 70 substitutions across wasm_f32_4x4 / 4x1 / 8x1 / 16x1 / 32x1 / 8x8 (35 in AddMatMul + 35 in AddRowColProducts). Speedup vs +simd128 baseline (M1 Pro, wasmtime 44.0.0): - microbench_8x8 GEMM: 1.40-1.55x across DFN3-style and large shapes - E2E across 8 models: 1.08-1.46x (Inception/SqueezeNet/MobileNet/modnet/ MiniLM/DFN3 erb_dec/df_dec) Quality: L2 norms identical to 7 sig figs; per-element diff at 7th-8th decimal (FMA single-rounding, within Approximation::Close). Bytecode: 1044 f32x4.relaxed_madd ops in the relaxed test binary, 0 in baseline. Confirms sonos#2195's "LLVM does not auto-emit FMA" finding extends to MMM kernels. Adds examples/wasm-model-bench/ harness for the E2E numbers. See examples/wasm-model-bench/MMM_MACRO_ATTRIBUTION.pdf for full data.

czoli1976 · 2026-05-07T08:57:41Z

Thanks for the quick read. On the easy ones:

2. naming — kept madd_f32x4!. Confirmed it doesn't escape linalg/src/wasm.rs (and now linalg/benches/wasm.rs); not exported.

3. AddRowColProducts — kept the 35 sites for consistency, as you suggest.

4. 4x4 — already converted in this PR (8 sites: 4 AddMatMul + 4 AddRowColProducts). The "keep 4x4 at all?" debate I'd happily punt to a separate PR — it's currently dead code (mmm_f32 always returns 8x8 via max(mr*nr)), so its only cost is source weight. Worth filing an issue to track?

5. linalg/benches — moved both microbench_8x8 and microbench_32x1_isolated out of linalg/src/wasm.rs into a new linalg/benches/wasm.rs (single bench, two mods, gated on target_arch = "wasm32"). harness = false, manual Instant::now() timing — same shape as linalg/benches/arm64.rs. Run with:

RUSTFLAGS='-C target-feature=+simd128,+relaxed-simd' \
  CARGO_TARGET_WASM32_WASIP1_RUNNER='wasmtime --env RUST_TEST_NOCAPTURE=1 --' \
  cargo bench --target wasm32-wasip1 -p tract-linalg --bench wasm

(And +simd128 alone for the baseline comparison.) Net diff after the move: −240 LoC out of wasm.rs, +235 in the new bench file.

Tests still pass clean on both flavors (1727 / 1740 with 0 / 1044 f32x4.relaxed_madd ops).

Force-pushed (031eec0f → fdfe6759).

1. runtime vs build-time dispatch — taking that one offline to think through; I'll come back with a separate post once I've actually walked through what's possible. Short version of the constraint: WASM relaxed-simd opcodes fail validation at module instantiation if the host runtime doesn't support the proposal, so a single binary can't degrade gracefully across hosts. Module-selection happens at the JS host (or wasmtime config) layer. Want to think through what shape that should take inside tract before I reply.

Build flags, consumer-side dispatch playbook (browser + wasmtime), and quality notes. Documents the host-layer module-selection pattern as the WASM equivalent of x86/ARM runtime CPU detection — the validation-at- instantiation constraint rules out in-binary runtime dispatch.

czoli1976 · 2026-05-07T10:56:18Z

Thanks for pushing on this — we share the dislike for build-time config and spent real time looking for an in-binary runtime alternative before defaulting to cfg(target_feature). Short version: WASM's validation model forces the build-time choice, but the user-facing "runtime dispatch" contract is preserved at the host layer.

The constraint

WASM validates the entire module at instantiation, before any code runs. A binary containing f32x4.relaxed_madd fails to instantiate on hosts without relaxed-simd — LinkError / CompileError, not a runtime trap. So the x86/ARM pattern (one binary, both paths in source, runtime CPU detection picks at execution time) cannot be replicated in-binary on WASM: there's no point at which a wasm program "decides" whether to use FMA, because the FMA opcodes are either present (and host support is required) or absent.

This is also why the existing tract pattern doesn't transfer 1:1 here. arm64fp16 and the x86 FMA paths use compile-with-feature-enabled + runtime-check-and-register: intrinsics gated on target_feature, kernel struct only exists if the flag is set, and a second runtime check (is_aarch64_feature_detected!("fp16"), is_x86_feature_detected!("fma")) picks at registration time. In WASM the first half (intrinsic enablement) works the same; the second half (runtime detect-and-register) cannot, because the binary fails to load at all on hosts that lack the feature.

Where runtime dispatch moves to

One layer up — host-side module selection at load time:

const wantRelaxed = WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] });
const url = wantRelaxed ? '/tract-relaxed.wasm' : '/tract.wasm';

Universal browser support since 2023 (Chrome 114+, Firefox 120+, Safari 17+), wasmtime 16+. Tract internals stay build-time gated; the consumer ships two binaries and picks at host-init time. The user-facing contract — "tract picks the fast path automatically on capable hosts" — is preserved; the detect-and-dispatch site just lives in the host shim instead of in the kernel registration logic.

I've added linalg/WASM_RELAXED_SIMD.md to this PR (commit a2655a6c) with the full playbook — build flags, JS browser snippet, wasmtime config snippet, quality notes — so downstreams have a default reference. Move/rename/restructure as you like.

Why macro vs separate kernels collapses under this

Both patterns produce the same two binaries (no-relaxed / relaxed-required) and rely on the same host-side selection. Separate kernels would only matter if WASM had x86's runtime-trap behaviour — it doesn't. So the macro is strictly cheaper on source weight (~25 LoC vs ~3000 LoC for 6 duplicated kernel bodies) for an identical outcome.

What I ruled out

Multi-module + import-fallback (kernels module loaded conditionally by host): cross-module call per kernel invocation eats the FMA win inside the inner K-loop. Boundary cost compounds against the per-call FMA savings.
Always-relaxed binary + atomic flag (env var flips kernel choice in-process): binary still requires relaxed-simd to instantiate, so it can't run on hosts without it. The flag would only let users opt OUT of FMA on hosts that DO support it (e.g., for bit-determinism) — different feature.
Component Model feature negotiation: works in principle, requires significant tract refactor; premature for this PR.
Drop the fallback (relaxed-only): regression on hosts pinned to wasm MVP.
Build-script auto-add +relaxed-simd for wasm32 in 2026: would reduce friction but surprises users with intentional opt-out reasons; not safe as a default.

Happy to dig further if you see an angle I missed.

czoli1976 · 2026-05-07T11:05:37Z

Direct answer to the probe question: not from inside the wasm binary, for the same validation-at-instantiation reason as the broader runtime-dispatch issue. A few interpretations of "probe" and where each fails:

Probe at init by attempting a relaxed_madd: the binary fails to instantiate on non-supporting hosts before any init code runs.
Probe via try/catch on a kernel call: doesn't work — function-level validation is module-wide, so any function containing relaxed-simd opcodes prevents the entire module from loading on hosts without support.
Probe via a separate tiny "is-relaxed-supported" wasm submodule loaded conditionally: works, but it's identical to WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] }) — same primitive, just dressed up.

The general rule: any wasm code path that contains relaxed-simd opcodes blocks the host from loading the module without support. So self-detection in a single tract binary collapses to "always require relaxed-simd, then check at runtime" — which forfeits the fallback and means the binary doesn't load on legacy hosts at all.

Host-side detection (WebAssembly.validate in the browser, Engine::wasm_relaxed_simd in wasmtime — the playbook in linalg/WASM_RELAXED_SIMD.md) is the equivalent: same detection primitive, just at the boundary that can actually act on the result.

Also: noted on #6 — kept examples/wasm-model-bench in this PR.

kali · 2026-05-07T12:22:17Z

Understood. The binary does not even load if it features the new opcodes, killing more or less any possibility for runtime selection within a single object.

So we have two wasm build configurations. Do we need to test both configuration ? I think maybe the linalg tests, not necessarily the rest. Which configuration makes the most of sense as a default in 2026 ? The conservative one for compatibility ? or the ecosystem has wide adoption for the new opcodes already and we keep the strict opcodes for people with application needing back compatibility and who will pay the extra complexity tax ?

czoli1976 · 2026-05-07T17:10:12Z

Two questions: *(a)* which configs CI tests, and *(b)* which is the default in 2026. Taking them in order. (a) CI: linalg-only on both, rest on one Agreed — tract-linalg is the only crate where the macro lives and where the f32x4.relaxed_madd opcode actually appears, so it's the only place where the two configs can produce divergent bytecode. Everything downstream ( tract-core, tract-onnx, model E2E) just calls into the kernels via MMM and is observationally identical between the two builds modulo speed. Concretely: [wasm32-wasip1] +simd128 +simd128,+relaxed-simd tract-linalg --lib full suite full suite (currently 1727 / 1740, +13 from #2195) tract-core, tract-onnx, example bins one of the two skipped on the other Default the "one of the two" to +simd128 since it's the broader-compatibility build and exercises the no-relaxed madd_f32x4! fallback. Net CI cost: one extra cargo test -p tract-linalg --target wasm32-wasip1 invocation per PR. (b) The default in 2026 — short version *Keep +simd128 as the default; document +simd128,+relaxed-simd as the recommended production flag.* Reasoning below, with the data that drove it. ------------------------------ Relaxed-SIMD support, by runtime Sources: caniuse.com/wf-wasm-simd-relaxed (April 2026), Mozilla Bugzilla 1706922, wasmtime PR #7285, V8 status. Runtime Unflagged-default since Status (May 2026) Chrome / Chromium *114* (May 2023) Default on; current stable 151 Edge *114* (May 2023) Default on (Chromium) Opera *100* (May 2023) Default on (Chromium) Samsung Internet *23* (Aug 2023) Default on Firefox (desktop) *146* (Jan 2026) Default on; current stable 153 Firefox (Android) *150* (Mar 2026) Default on Chrome for Android *114* (May 2023) Default on *Safari (desktop)* — *Behind a flag (Develop > Feature Flags > WebAssembly relaxed SIMD)*; through 26.5 / TP *Safari (iOS)* — *Behind a flag*; through 26.5 Node.js *20* (~Apr 2023, V8 11+) Default on Deno (V8 11+, ~mid-2023) Default on Bun (since ~0.5) Default on Wasmtime *15.0* (Nov 2023, PR #7285) Default on Wasmer enabled via --enable-simd Off by default The single-line summary: *everywhere except Safari (both flavours), relaxed-simd is the default in 2026.* Safari is the load-bearing exception. Browser market share, March/April 2026 Source: StatCounter GlobalStats (March 2026 all-platform; April 2026 desktop; March 2026 mobile). All platforms Browser Share Relaxed-SIMD by default Chrome 66.7% ✅ Safari 17.9% ❌ (flag) Edge 5.79% ✅ Firefox 2.33% ✅ Samsung Internet 2.06% ✅ Opera 2.0% ✅ *Total ✅ default* *~78.9%* *Total ❌ flag* *~17.9%* Desktop only (April 2026) Browser Share Relaxed-SIMD by default Chrome 71.48% ✅ Edge 11.52% ✅ Safari 6.20% ❌ (flag) Firefox 4.22% ✅ Opera 2.28% ✅ Brave 1.47% ✅ *Total ✅ default* *~91.0%* *Total ❌ flag* *~6.2%* Mobile only (March 2026) Browser Share Relaxed-SIMD by default Chrome 65.10% ✅ Safari (iOS) 25.99% ❌ (flag) Samsung Internet 3.75% ✅ Opera 1.58% ✅ UC Browser 1.13% partial Firefox 0.62% ✅ *Total ✅ default* *~71.1%* *Total ❌ flag* *~26.0%* caniuse.com's headline number — *73.89% global usage* for unflagged relaxed-simd as of April 2026 — agrees with the all-platform table (Safari is the gap). The mobile picture is the weakest: ~26% of mobile sessions are iOS Safari, which can't load a relaxed-simd binary at all. The Safari problem, specifically Safari shipped fixed-width SIMD in 16.4 (March 2023), so all the +simd128 work tract has done already lands. Relaxed-simd was added behind a flag in 2024 (State of WebAssembly 2025/2026 <https://platform.uno/blog/the-state-of-webassembly-2025-2026/>) and has stayed behind a flag through Safari 26.5 / TP — including iOS. iOS is the kicker: every browser on iOS uses WebKit under the hood (Apple policy), so Chrome-on-iOS, Edge-on-iOS, Firefox-on-iOS all inherit the same flag-gated state. There's no path around it short of the user toggling Develop > Feature Flags themselves, which a tract consumer can't ask their end users to do. What this means for the default A tract consumer who builds with +simd128,+relaxed-simd and ships the resulting .wasm to a browser context will hit LinkError / CompileError on roughly *18% of all sessions* (26% of mobile sessions, 6% of desktop) — the binary won't even load. That's a hard failure visible to the end user, not a silent perf regression. A consumer who builds with the current +simd128 default loads everywhere, runs at the speeds we already ship, and forfeits the 1.08–1.46× E2E win on the 80%+ of sessions that *would* support relaxed-simd. That's a soft loss, recoverable at any time by flipping a flag — and the consumer can flip it themselves with a one-liner once they know their audience. Asymmetric costs → conservative default: - *Default = +simd128* (current behaviour, no breakage anywhere). Add a one-line note in the build docs that +simd128,+relaxed-simd is recommended for non-browser targets and for browser targets that ship two builds with host-side dispatch (the WebAssembly.validate(bytes, { builtins: ['relaxed_simd'] }) pattern in linalg/WASM_RELAXED_SIMD.md). - *Document the upgrade path explicitly* so a consumer who knows their audience (e.g., "internal tool, all our users on Chrome 114+") can flip the flag without spelunking. Two RUSTFLAG strings, one paragraph. - *Revisit when Safari unflags relaxed-simd.* Once that lands in a stable Safari release, the global ✅-default share crosses ~96% and the cost equation flips. A separate PR can promote +simd128,+relaxed-simd to the default at that point. Safari has the implementation behind the flag already, so this is a "when" not an "if" — likely 2026 H2 or 2027 H1 based on their typical flag-to-default cadence. One alternative worth considering If the project wants to treat browser and non-browser as separate worlds: *default +simd128,+relaxed-simd for wasm32-wasip1 / wasm32 server runtimes, default +simd128 for wasm32-unknown-unknown browser builds*, gated in a build script. Wasmtime, Node, Deno, Bun all support relaxed-simd by default; the only place where the binary load fails is browsers, and wasm32-unknown-unknown is the standard target for those. This keeps server-side tract consumers on the fast path without breaking browser consumers. Trade-off: the asymmetry is non-obvious to new contributors and adds a build-script branch. I'd lean against it for this PR but flag it as worth a separate discussion if the project wants to lean into wasmtime/server-WASM as a first-class deployment target. ------------------------------ *Concrete proposal for this PR:* keep default at +simd128, land the macro substitutions and linalg/benches/wasm.rs move as already pushed (fdfe675), keep linalg/WASM_RELAXED_SIMD.md as the documented upgrade path, add the linalg-only dual-config CI matrix above. Promotion of +simd128,+relaxed-simd to the default is a separate PR, gated on Safari unflagging. Best Regards CK

…

On Thu, May 7, 2026 at 09:43 Mathieu Poumeyrol ***@***.***> wrote: *kali* left a comment (sonos/tract#2199) <#2199 (comment)> for 1. can we "probe" for relaxed and basically implement detection ourselves ? — Reply to this email directly, view it on GitHub <#2199 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APL2Z6UW7K24EUXEPKCI4ZD4ZREENAVCNFSM6AAAAACYUF57ICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOJVGU3DAMJVGM> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

czoli1976 · 2026-05-08T08:03:08Z

@kali, pdf attached, esier to review relaxed-simd-ecosystem-2026.pdf

kali · 2026-05-08T08:30:09Z

OK. thanks for the landscape report. so safari is behind. I agree this mean keeping the strict impl as default, and documenting with instruction on how to switch to relaxed, and hinting on how to deploy side by side. I guess we make a tech note in the doc/ folder for starters. Problem is discoverability, wasm is not even a crate on its own. I seriously need to spend some quality time with a bot to rewrite the top level README. It barely mentions platform support at all, only frameworks and models :/ This is on my roadmap, but let's keep it simple here (=> tech note/recipe in doc/ or linalg/)

kali · 2026-05-08T08:31:23Z

and that's exactly what you did. great. should we move this from draft to ready ?

czoli1976 closed this May 7, 2026

czoli1976 deleted the feature/wasm-mmm-8x8-relaxed-simd branch May 7, 2026 07:50

czoli1976 restored the feature/wasm-mmm-8x8-relaxed-simd branch May 7, 2026 07:51

czoli1976 reopened this May 7, 2026

czoli1976 force-pushed the feature/wasm-mmm-8x8-relaxed-simd branch from 300406c to 031eec0 Compare May 7, 2026 08:21

czoli1976 force-pushed the feature/wasm-mmm-8x8-relaxed-simd branch from 031eec0 to fdfe675 Compare May 7, 2026 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199

linalg/wasm: relaxed-simd FMA in MMM kernels (madd_f32x4! macro)#2199
czoli1976 wants to merge 2 commits intosonos:mainfrom
czoli1976:feature/wasm-mmm-8x8-relaxed-simd

czoli1976 commented May 7, 2026

Uh oh!

kali commented May 7, 2026

Uh oh!

kali commented May 7, 2026

Uh oh!

czoli1976 commented May 7, 2026

Uh oh!

czoli1976 commented May 7, 2026

Uh oh!

czoli1976 commented May 7, 2026

Uh oh!

kali commented May 7, 2026 •

edited

Loading

Uh oh!

czoli1976 commented May 7, 2026 via email

Uh oh!

czoli1976 commented May 8, 2026

Uh oh!

kali commented May 8, 2026

Uh oh!

kali commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 7, 2026

Summary

Pattern

Why a macro instead of variant kernels (à la #2195)

Performance — kernel-level (microbench_8x8, M1 Pro / wasmtime 44.0.0)

Performance — E2E models (this PR alone)

Quality — L2 norms identical to 7 sig figs

Bytecode verification

Test plan

Detailed bench attribution

Diff

Open questions @kali

Uh oh!

kali commented May 7, 2026

Uh oh!

kali commented May 7, 2026

Uh oh!

czoli1976 commented May 7, 2026

Uh oh!

czoli1976 commented May 7, 2026

The constraint

Where runtime dispatch moves to

Why macro vs separate kernels collapses under this

What I ruled out

Uh oh!

czoli1976 commented May 7, 2026

Uh oh!

kali commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czoli1976 commented May 7, 2026 via email

Uh oh!

czoli1976 commented May 8, 2026

Uh oh!

kali commented May 8, 2026

Uh oh!

kali commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Performance — kernel-level (`microbench_8x8`, M1 Pro / wasmtime 44.0.0)

kali commented May 7, 2026 •

edited

Loading

kali commented May 8, 2026 •

edited

Loading