[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard by jfsantos · Pull Request #251 · sdatkinson/NeuralAmpModelerCore

jfsantos · 2026-04-18T00:50:50Z

A2 fast-path WaveNet

Adds a specialized forward-pass implementation for the A2 standard and A2 nano amp models, routed automatically when a .nam file matches the A2 shape. The generic WaveNet path is unchanged — it remains the default for everything else, and there's a CMake opt-out (-DNAM_ENABLE_A2_FAST=OFF) that disables the specialization entirely.

Why

The generic nam::wavenet::WaveNet is a fully general WaveNet: it supports multiple layer arrays, any channel count, any kernel size, any dilation, gating / FiLM / head1x1 / grouped convolutions, optional post-stack heads, dynamic buffer resizing. All of that flexibility costs something — even for a model that uses none of it. A2 nano and A2 standard share the same fixed architecture: 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1, layer1x1 active, kernel sizes {6, 15}, head conv k=16 bias=true, and channels ∈ {3, 8}. For those two models we can cut out all the dynamism and pick a loop structure tuned to the channel count.

What's in the PR

Strict A2 shape detector (NAM/wavenet/a2_fast.h, a2_fast.cpp) — checks every knob against the A2 signature. If any field is off by one bit, it falls through to the generic path. Never silently routes a non-A2 model to the fast path.
A2FastModel<Channels> template DSP (a2_fast.cpp) with explicit instantiations for Channels = 3 and Channels = 8. Shared by both specializations:
- Column-major per-tap weight storage with a dedicated _load_weights that reads the stream in the same order the generic path does (including the trailing head_scale that WaveNet::set_weights_ consumes).
- Pre-allocated work buffers sized in SetMaxBufferSize, nothing resizes during process().
- Power-of-2 ring buffers with a max_buffer_size-wide tail mirror, so reads spanning past the wrap land in contiguous memory (no memmove-rewind jitter, constant-time write + read).
Two internal strategies, dispatched at compile time on Channels:
- Channels ≤ 4 (A2 nano): hand-unrolled 3×3 GEMV with all 9 weights lifted into named const float locals and the c-reduction kept in scalar temps a0/a1/a2. Tap 0 seeds z directly from conv_b (skipping the memset-to-zero pass). Final tap + mixin + LeakyReLU + head_sum accumulate + layer1x1 residual all inlined into a single loop on register-resident scalars. Matches the structure nam2c --fused generates for the same shape.
- Channels ≥ 8 (A2 standard): Eigen fixed-LHS block GEMM (Eigen::Matrix<float, 8, Eigen::Dynamic>), one 8×8 × 8×num_frames GEMM per tap into a ring-buffer view, then block-wise colwise() += for bias, rank-1 outer product for mixin, elementwise .select() for LeakyReLU, another 8×8 × 8×num_frames GEMM for the layer1x1 residual. Leans into Eigen's tuned GEMM kernel for the size that has one, strips everything around it.
Kernel size is a template parameter (_layer_forward_k<6> / _layer_forward_k<15>) dispatched by a switch on L.kernel_size, so the tap loop and per-tap weight offsets are compile-time constants.
Dispatcher hook (wavenet/model.cpp::create_config) — one new if (is_a2_shape(...)) return create_a2_fast_config(...) right after the existing slimmable-wavenet branch and before the generic path. Gated on NAM_ENABLE_A2_FAST.
C++ verification harness (tools/test/test_a2_fast.cpp) — 6 detector tests (accept nano/standard, reject tweaks to channels / kernel_sizes / activation / gating) plus 2 equivalence tests that build both paths from the same config and weights and assert the outputs match to 5e-5 on a two-tone input, at block sizes 64 and 256. Runs as part of run_tests under #if defined(NAM_ENABLE_A2_FAST).
Benchmark tool (tools/bench_a2_fast) — loads a .nam, builds both fast and generic DSPs from the same weights, times each block separately (not per-iteration), and reports min / p50 / p99 / p99.9 / max / mean plus RTF.

Numbers

Apple Silicon M1, block=64, 75,000 blocks per variant, released (-Ofast). Per-block timing in microseconds (lower is better):

A2 nano (Channels = 3)

path	p50	p99	p99.9	max	RTF	vs generic Eigen
Generic WaveNet (Eigen)	27.3	34.1	44.8	105	49×	1.00×
Generic WaveNet (`NAM_USE_INLINE_GEMM`)	6.8	8.9	15.2	52	198×	4.04×
a2_fast (this PR)	4.4	5.5	11.6	24	307×	6.23×
nam2c `--fused` (for reference)	4.0	5.0	13.0	23	333×	~6.8×

A2 standard (Channels = 8)

path	p50	p99	p99.9	max	RTF	vs generic Eigen
Generic WaveNet (`NAM_USE_INLINE_GEMM`)	38.0	45.8	56.8	107	35×	0.95×
Generic WaveNet (Eigen)	36.0	44.1	54.9	120	37×	1.00×
a2_fast (this PR)	31.8	38.8	49.1	99	42×	1.14×

A few observations worth calling out:

NAM_USE_INLINE_GEMM is a ch=3 win and a ch=8 regression. The hand-unrolled paths in conv1d.cpp cover 3×3 / 4×4 / 6×6 / 8×8 explicitly; the small ones beat Eigen cleanly but the 8×8 case loses to Eigen's GEMM kernel. Explains why that flag isn't the default — and why a single "specialized path" strategy doesn't work across channel counts.
Jitter is tighter on the fast path in both regimes — max block time is 24 µs (ch=3) and 99 µs (ch=8), vs 105 µs and 120 µs for generic Eigen. Pow2 rings kill the periodic memmove spike. Still well inside the 1333 µs audio deadline at 48 kHz block=64 for all variants — this matters more for small-block / high-SR configurations than for typical desktop plugins.
The ch=3 result is essentially tied with nam2c --fused. We're within 9% at p50 and better at p99.9 (10.8 µs vs 13.0 µs). nam2c's generated C is the reasonable ceiling for this kind of code in portable C; we're there.

What's NOT in this PR (deliberately)

No explicit SIMD intrinsics. Everything stays portable — Eigen handles the NEON/SSE dispatch for the ch=8 GEMMs, and the ch=3 scalar code gets auto-FMA'd at -Ofast. I tried fold-expression unrolling as an alternate approach and it regressed ch=8 because it scalarized and blocked clang's auto-vectorizer, so the ch=8 path leans on Eigen's tuned kernel rather than trying to hand-roll one.
Q15, LUT activations, DTCM placement, --fast-math approximations. These are useful on embedded targets (Cortex-M7) but add complexity without gain on desktop. The A2FastModel header is a natural home if we want to add them later, but not as part of this PR.
Widening the detector. It's strict on purpose — kernel_sizes and dilations must match the A2 pattern exactly. If a future A2-like variant ships with a different schedule, we'll need a second detector (or to loosen this one with care). The test suite has explicit "rejection" cases for each field, so the boundary is enforced.

How to try it

cmake -B build && cmake --build build -j
./build/tools/run_tests                          # includes A2 tests
./build/tools/bench_a2_fast path/to/a2_nano.nam path/to/a2_standard.nam

Opt out: cmake -B build -DNAM_ENABLE_A2_FAST=OFF.

Developed with support and sponsorship from TONE3000

Routes models whose config matches the A2 shape signature (single layer array, 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1, layer1x1 active, head conv k=16 bias=true, channels in {3, 8}, fixed kernel_sizes and dilations) to a hand-tuned specialization that strips the dynamic-shape overhead, feature-flag branches, memmove-rewind jitter, and per-call allocations the generic path pays for. Build-time opt-out: -DNAM_ENABLE_A2_FAST=OFF. Numbers (Apple M5, block=64, 75k blocks sampled, p50 us / RTF): nano (ch=3) standard (ch=8) Generic (Eigen) 27.3 / 49x 36.0 / 37x Generic (inline GEMM) 6.8 / 198x 38.0 / 35x a2_fast (this PR) 4.4 / 307x 31.8 / 42x

…ess()) Two per-call heap allocations in A2FastModel::process() moved to pre-allocated members sized in SetMaxBufferSize: the float32 input copy (_cond) and the float32 head output scratch (_head_out). Each was a per-block std::vector ctor that escaped the earlier audit. Adds test_process_realtime_safe_{nano,standard} using the same allocation_tracking infrastructure as the generic WaveNet RT-safety tests: overridden malloc/free/new/delete increment counters while tracking is enabled; run_allocation_test_no_allocations asserts the counts stay at 0 during process(). Exercised across block sizes {1, 32, 64, 128, 256} for both Channels=3 (hand-rolled path) and Channels=8 (Eigen::Map + fixed-LHS block GEMM path) — both clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Linux CI uses Debug + -Werror, which surfaces two issues not caught by the macOS Release build: 1. Unused fuse_post_conv lambda in _layer_forward_k. Earlier edits fully inlined the post-conv work for Channels==3 and moved to Eigen block ops for Channels>=8, so the lambda became dead code — tripping -Wunused-variable. Removed. 2. Missing explicit STL headers. libstdc++ is stricter than libc++ about transitive includes: - <algorithm>, <string>, <utility> (test_a2_fast.cpp) - <algorithm>, <cstdlib>, <utility> (bench_a2_fast.cpp) - <cstddef>, <iterator>, <utility> (a2_fast.cpp) Reproduced both failures by building locally with -DCMAKE_BUILD_TYPE=Debug, with and without -DNAM_USE_INLINE_GEMM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sdatkinson · 2026-04-20T19:08:55Z

Fantastic summary, @jfsantos ! ❤️ Really appreciate that you captured all this in the PR; hopefully it'll help interested folks understand our thinking & where/how this works best.

Reviewing...

sdatkinson

Great! Thanks!

sdatkinson · 2026-04-20T19:18:00Z

+  if (la.value("groups_input_mixin", 1) != 1)
+    return false;
+
+  // Not slimmable


Nit--slimmable with a layer array-level strategy

João Felipe Santos and others added 3 commits April 17, 2026 17:47

sdatkinson approved these changes Apr 20, 2026

View reviewed changes

sdatkinson merged commit a7037e5 into sdatkinson:main Apr 20, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251

[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251
sdatkinson merged 3 commits intosdatkinson:mainfrom
jfsantos:a2-fast

jfsantos commented Apr 18, 2026 •

edited

Loading

Uh oh!

sdatkinson commented Apr 20, 2026

Uh oh!

sdatkinson left a comment

Uh oh!

sdatkinson Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jfsantos commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A2 fast-path WaveNet

Why

What's in the PR

Numbers

A2 nano (Channels = 3)

A2 standard (Channels = 8)

What's NOT in this PR (deliberately)

How to try it

Uh oh!

sdatkinson commented Apr 20, 2026

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

sdatkinson Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jfsantos commented Apr 18, 2026 •

edited

Loading