Skip to content

[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251

Merged
sdatkinson merged 3 commits intosdatkinson:mainfrom
jfsantos:a2-fast
Apr 20, 2026
Merged

[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251
sdatkinson merged 3 commits intosdatkinson:mainfrom
jfsantos:a2-fast

Conversation

@jfsantos
Copy link
Copy Markdown
Contributor

@jfsantos jfsantos commented Apr 18, 2026

A2 fast-path WaveNet

Adds a specialized forward-pass implementation for the A2 standard and A2 nano amp models, routed automatically when a .nam file matches the A2 shape. The generic WaveNet path is unchanged — it remains the default for everything else, and there's a CMake opt-out (-DNAM_ENABLE_A2_FAST=OFF) that disables the specialization entirely.

Why

The generic nam::wavenet::WaveNet is a fully general WaveNet: it supports multiple layer arrays, any channel count, any kernel size, any dilation, gating / FiLM / head1x1 / grouped convolutions, optional post-stack heads, dynamic buffer resizing. All of that flexibility costs something — even for a model that uses none of it. A2 nano and A2 standard share the same fixed architecture: 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1, layer1x1 active, kernel sizes {6, 15}, head conv k=16 bias=true, and channels ∈ {3, 8}. For those two models we can cut out all the dynamism and pick a loop structure tuned to the channel count.

What's in the PR

  1. Strict A2 shape detector (NAM/wavenet/a2_fast.h, a2_fast.cpp) — checks every knob against the A2 signature. If any field is off by one bit, it falls through to the generic path. Never silently routes a non-A2 model to the fast path.

  2. A2FastModel<Channels> template DSP (a2_fast.cpp) with explicit instantiations for Channels = 3 and Channels = 8. Shared by both specializations:

    • Column-major per-tap weight storage with a dedicated _load_weights that reads the stream in the same order the generic path does (including the trailing head_scale that WaveNet::set_weights_ consumes).
    • Pre-allocated work buffers sized in SetMaxBufferSize, nothing resizes during process().
    • Power-of-2 ring buffers with a max_buffer_size-wide tail mirror, so reads spanning past the wrap land in contiguous memory (no memmove-rewind jitter, constant-time write + read).

    Two internal strategies, dispatched at compile time on Channels:

    • Channels ≤ 4 (A2 nano): hand-unrolled 3×3 GEMV with all 9 weights lifted into named const float locals and the c-reduction kept in scalar temps a0/a1/a2. Tap 0 seeds z directly from conv_b (skipping the memset-to-zero pass). Final tap + mixin + LeakyReLU + head_sum accumulate + layer1x1 residual all inlined into a single loop on register-resident scalars. Matches the structure nam2c --fused generates for the same shape.
    • Channels ≥ 8 (A2 standard): Eigen fixed-LHS block GEMM (Eigen::Matrix<float, 8, Eigen::Dynamic>), one 8×8 × 8×num_frames GEMM per tap into a ring-buffer view, then block-wise colwise() += for bias, rank-1 outer product for mixin, elementwise .select() for LeakyReLU, another 8×8 × 8×num_frames GEMM for the layer1x1 residual. Leans into Eigen's tuned GEMM kernel for the size that has one, strips everything around it.

    Kernel size is a template parameter (_layer_forward_k<6> / _layer_forward_k<15>) dispatched by a switch on L.kernel_size, so the tap loop and per-tap weight offsets are compile-time constants.

  3. Dispatcher hook (wavenet/model.cpp::create_config) — one new if (is_a2_shape(...)) return create_a2_fast_config(...) right after the existing slimmable-wavenet branch and before the generic path. Gated on NAM_ENABLE_A2_FAST.

  4. C++ verification harness (tools/test/test_a2_fast.cpp) — 6 detector tests (accept nano/standard, reject tweaks to channels / kernel_sizes / activation / gating) plus 2 equivalence tests that build both paths from the same config and weights and assert the outputs match to 5e-5 on a two-tone input, at block sizes 64 and 256. Runs as part of run_tests under #if defined(NAM_ENABLE_A2_FAST).

  5. Benchmark tool (tools/bench_a2_fast) — loads a .nam, builds both fast and generic DSPs from the same weights, times each block separately (not per-iteration), and reports min / p50 / p99 / p99.9 / max / mean plus RTF.

Numbers

Apple Silicon M1, block=64, 75,000 blocks per variant, released (-Ofast). Per-block timing in microseconds (lower is better):

A2 nano (Channels = 3)

path p50 p99 p99.9 max RTF vs generic Eigen
Generic WaveNet (Eigen) 27.3 34.1 44.8 105 49× 1.00×
Generic WaveNet (NAM_USE_INLINE_GEMM) 6.8 8.9 15.2 52 198× 4.04×
a2_fast (this PR) 4.4 5.5 11.6 24 307× 6.23×
nam2c --fused (for reference) 4.0 5.0 13.0 23 333× ~6.8×

A2 standard (Channels = 8)

path p50 p99 p99.9 max RTF vs generic Eigen
Generic WaveNet (NAM_USE_INLINE_GEMM) 38.0 45.8 56.8 107 35× 0.95×
Generic WaveNet (Eigen) 36.0 44.1 54.9 120 37× 1.00×
a2_fast (this PR) 31.8 38.8 49.1 99 42× 1.14×

A few observations worth calling out:

  • NAM_USE_INLINE_GEMM is a ch=3 win and a ch=8 regression. The hand-unrolled paths in conv1d.cpp cover 3×3 / 4×4 / 6×6 / 8×8 explicitly; the small ones beat Eigen cleanly but the 8×8 case loses to Eigen's GEMM kernel. Explains why that flag isn't the default — and why a single "specialized path" strategy doesn't work across channel counts.
  • Jitter is tighter on the fast path in both regimesmax block time is 24 µs (ch=3) and 99 µs (ch=8), vs 105 µs and 120 µs for generic Eigen. Pow2 rings kill the periodic memmove spike. Still well inside the 1333 µs audio deadline at 48 kHz block=64 for all variants — this matters more for small-block / high-SR configurations than for typical desktop plugins.
  • The ch=3 result is essentially tied with nam2c --fused. We're within 9% at p50 and better at p99.9 (10.8 µs vs 13.0 µs). nam2c's generated C is the reasonable ceiling for this kind of code in portable C; we're there.

What's NOT in this PR (deliberately)

  • No explicit SIMD intrinsics. Everything stays portable — Eigen handles the NEON/SSE dispatch for the ch=8 GEMMs, and the ch=3 scalar code gets auto-FMA'd at -Ofast. I tried fold-expression unrolling as an alternate approach and it regressed ch=8 because it scalarized and blocked clang's auto-vectorizer, so the ch=8 path leans on Eigen's tuned kernel rather than trying to hand-roll one.
  • Q15, LUT activations, DTCM placement, --fast-math approximations. These are useful on embedded targets (Cortex-M7) but add complexity without gain on desktop. The A2FastModel header is a natural home if we want to add them later, but not as part of this PR.
  • Widening the detector. It's strict on purpose — kernel_sizes and dilations must match the A2 pattern exactly. If a future A2-like variant ships with a different schedule, we'll need a second detector (or to loosen this one with care). The test suite has explicit "rejection" cases for each field, so the boundary is enforced.

How to try it

cmake -B build && cmake --build build -j
./build/tools/run_tests                          # includes A2 tests
./build/tools/bench_a2_fast path/to/a2_nano.nam path/to/a2_standard.nam

Opt out: cmake -B build -DNAM_ENABLE_A2_FAST=OFF.

Developed with support and sponsorship from TONE3000

João Felipe Santos and others added 3 commits April 17, 2026 17:47
Routes models whose config matches the A2 shape signature (single layer
array, 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1, layer1x1
active, head conv k=16 bias=true, channels in {3, 8}, fixed kernel_sizes
and dilations) to a hand-tuned specialization that strips the
dynamic-shape overhead, feature-flag branches, memmove-rewind jitter,
and per-call allocations the generic path pays for.

Build-time opt-out: -DNAM_ENABLE_A2_FAST=OFF.

Numbers (Apple M5, block=64, 75k blocks sampled, p50 us / RTF):

                          nano (ch=3)    standard (ch=8)
  Generic (Eigen)         27.3 / 49x     36.0 / 37x
  Generic (inline GEMM)    6.8 / 198x    38.0 / 35x
  a2_fast (this PR)        4.4 / 307x    31.8 / 42x
…ess())

Two per-call heap allocations in A2FastModel::process() moved to
pre-allocated members sized in SetMaxBufferSize: the float32 input copy
(_cond) and the float32 head output scratch (_head_out). Each was a
per-block std::vector ctor that escaped the earlier audit.

Adds test_process_realtime_safe_{nano,standard} using the same
allocation_tracking infrastructure as the generic WaveNet RT-safety
tests: overridden malloc/free/new/delete increment counters while
tracking is enabled; run_allocation_test_no_allocations asserts the
counts stay at 0 during process(). Exercised across block sizes
{1, 32, 64, 128, 256} for both Channels=3 (hand-rolled path) and
Channels=8 (Eigen::Map + fixed-LHS block GEMM path) — both clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linux CI uses Debug + -Werror, which surfaces two issues not caught by
the macOS Release build:

1. Unused fuse_post_conv lambda in _layer_forward_k. Earlier edits fully
   inlined the post-conv work for Channels==3 and moved to Eigen block
   ops for Channels>=8, so the lambda became dead code — tripping
   -Wunused-variable. Removed.

2. Missing explicit STL headers. libstdc++ is stricter than libc++ about
   transitive includes:
     - <algorithm>, <string>, <utility> (test_a2_fast.cpp)
     - <algorithm>, <cstdlib>, <utility> (bench_a2_fast.cpp)
     - <cstddef>, <iterator>, <utility> (a2_fast.cpp)

Reproduced both failures by building locally with
-DCMAKE_BUILD_TYPE=Debug, with and without -DNAM_USE_INLINE_GEMM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sdatkinson
Copy link
Copy Markdown
Owner

Fantastic summary, @jfsantos ! ❤️ Really appreciate that you captured all this in the PR; hopefully it'll help interested folks understand our thinking & where/how this works best.

Reviewing...

Copy link
Copy Markdown
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks!

Comment thread NAM/wavenet/a2_fast.cpp
if (la.value("groups_input_mixin", 1) != 1)
return false;

// Not slimmable
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit--slimmable with a layer array-level strategy

@sdatkinson sdatkinson merged commit a7037e5 into sdatkinson:main Apr 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants