[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251
Merged
sdatkinson merged 3 commits intosdatkinson:mainfrom Apr 20, 2026
Merged
[FEATURE] A2 fast-path WaveNet for A2 nano + A2 standard#251sdatkinson merged 3 commits intosdatkinson:mainfrom
sdatkinson merged 3 commits intosdatkinson:mainfrom
Conversation
Routes models whose config matches the A2 shape signature (single layer
array, 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1, layer1x1
active, head conv k=16 bias=true, channels in {3, 8}, fixed kernel_sizes
and dilations) to a hand-tuned specialization that strips the
dynamic-shape overhead, feature-flag branches, memmove-rewind jitter,
and per-call allocations the generic path pays for.
Build-time opt-out: -DNAM_ENABLE_A2_FAST=OFF.
Numbers (Apple M5, block=64, 75k blocks sampled, p50 us / RTF):
nano (ch=3) standard (ch=8)
Generic (Eigen) 27.3 / 49x 36.0 / 37x
Generic (inline GEMM) 6.8 / 198x 38.0 / 35x
a2_fast (this PR) 4.4 / 307x 31.8 / 42x
…ess())
Two per-call heap allocations in A2FastModel::process() moved to
pre-allocated members sized in SetMaxBufferSize: the float32 input copy
(_cond) and the float32 head output scratch (_head_out). Each was a
per-block std::vector ctor that escaped the earlier audit.
Adds test_process_realtime_safe_{nano,standard} using the same
allocation_tracking infrastructure as the generic WaveNet RT-safety
tests: overridden malloc/free/new/delete increment counters while
tracking is enabled; run_allocation_test_no_allocations asserts the
counts stay at 0 during process(). Exercised across block sizes
{1, 32, 64, 128, 256} for both Channels=3 (hand-rolled path) and
Channels=8 (Eigen::Map + fixed-LHS block GEMM path) — both clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linux CI uses Debug + -Werror, which surfaces two issues not caught by
the macOS Release build:
1. Unused fuse_post_conv lambda in _layer_forward_k. Earlier edits fully
inlined the post-conv work for Channels==3 and moved to Eigen block
ops for Channels>=8, so the lambda became dead code — tripping
-Wunused-variable. Removed.
2. Missing explicit STL headers. libstdc++ is stricter than libc++ about
transitive includes:
- <algorithm>, <string>, <utility> (test_a2_fast.cpp)
- <algorithm>, <cstdlib>, <utility> (bench_a2_fast.cpp)
- <cstddef>, <iterator>, <utility> (a2_fast.cpp)
Reproduced both failures by building locally with
-DCMAKE_BUILD_TYPE=Debug, with and without -DNAM_USE_INLINE_GEMM.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Fantastic summary, @jfsantos ! ❤️ Really appreciate that you captured all this in the PR; hopefully it'll help interested folks understand our thinking & where/how this works best. Reviewing... |
sdatkinson
approved these changes
Apr 20, 2026
| if (la.value("groups_input_mixin", 1) != 1) | ||
| return false; | ||
|
|
||
| // Not slimmable |
Owner
There was a problem hiding this comment.
Nit--slimmable with a layer array-level strategy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A2 fast-path WaveNet
Adds a specialized forward-pass implementation for the A2 standard and A2 nano amp models, routed automatically when a
.namfile matches the A2 shape. The generic WaveNet path is unchanged — it remains the default for everything else, and there's a CMake opt-out (-DNAM_ENABLE_A2_FAST=OFF) that disables the specialization entirely.Why
The generic
nam::wavenet::WaveNetis a fully general WaveNet: it supports multiple layer arrays, any channel count, any kernel size, any dilation, gating / FiLM / head1x1 / grouped convolutions, optional post-stack heads, dynamic buffer resizing. All of that flexibility costs something — even for a model that uses none of it. A2 nano and A2 standard share the same fixed architecture: 23 layers, LeakyReLU(0.01), no gating/FiLM/head1x1,layer1x1active, kernel sizes{6, 15}, head convk=16 bias=true, and channels ∈ {3, 8}. For those two models we can cut out all the dynamism and pick a loop structure tuned to the channel count.What's in the PR
Strict A2 shape detector (
NAM/wavenet/a2_fast.h,a2_fast.cpp) — checks every knob against the A2 signature. If any field is off by one bit, it falls through to the generic path. Never silently routes a non-A2 model to the fast path.A2FastModel<Channels>template DSP (a2_fast.cpp) with explicit instantiations forChannels = 3andChannels = 8. Shared by both specializations:_load_weightsthat reads the stream in the same order the generic path does (including the trailinghead_scalethatWaveNet::set_weights_consumes).SetMaxBufferSize, nothing resizes duringprocess().max_buffer_size-wide tail mirror, so reads spanning past the wrap land in contiguous memory (no memmove-rewind jitter, constant-time write + read).Two internal strategies, dispatched at compile time on
Channels:const floatlocals and the c-reduction kept in scalar tempsa0/a1/a2. Tap 0 seedszdirectly fromconv_b(skipping the memset-to-zero pass). Final tap + mixin + LeakyReLU +head_sumaccumulate + layer1x1 residual all inlined into a single loop on register-resident scalars. Matches the structurenam2c --fusedgenerates for the same shape.Eigen::Matrix<float, 8, Eigen::Dynamic>), one 8×8 × 8×num_frames GEMM per tap into a ring-buffer view, then block-wisecolwise() +=for bias, rank-1 outer product for mixin, elementwise.select()for LeakyReLU, another 8×8 × 8×num_frames GEMM for the layer1x1 residual. Leans into Eigen's tuned GEMM kernel for the size that has one, strips everything around it.Kernel size is a template parameter (
_layer_forward_k<6>/_layer_forward_k<15>) dispatched by aswitchonL.kernel_size, so the tap loop and per-tap weight offsets are compile-time constants.Dispatcher hook (
wavenet/model.cpp::create_config) — one newif (is_a2_shape(...)) return create_a2_fast_config(...)right after the existing slimmable-wavenet branch and before the generic path. Gated onNAM_ENABLE_A2_FAST.C++ verification harness (
tools/test/test_a2_fast.cpp) — 6 detector tests (accept nano/standard, reject tweaks to channels / kernel_sizes / activation / gating) plus 2 equivalence tests that build both paths from the same config and weights and assert the outputs match to 5e-5 on a two-tone input, at block sizes 64 and 256. Runs as part ofrun_testsunder#if defined(NAM_ENABLE_A2_FAST).Benchmark tool (
tools/bench_a2_fast) — loads a.nam, builds both fast and generic DSPs from the same weights, times each block separately (not per-iteration), and reports min / p50 / p99 / p99.9 / max / mean plus RTF.Numbers
Apple Silicon M1,
block=64, 75,000 blocks per variant, released (-Ofast). Per-block timing in microseconds (lower is better):A2 nano (Channels = 3)
NAM_USE_INLINE_GEMM)--fused(for reference)A2 standard (Channels = 8)
NAM_USE_INLINE_GEMM)A few observations worth calling out:
NAM_USE_INLINE_GEMMis a ch=3 win and a ch=8 regression. The hand-unrolled paths inconv1d.cppcover 3×3 / 4×4 / 6×6 / 8×8 explicitly; the small ones beat Eigen cleanly but the 8×8 case loses to Eigen's GEMM kernel. Explains why that flag isn't the default — and why a single "specialized path" strategy doesn't work across channel counts.maxblock time is 24 µs (ch=3) and 99 µs (ch=8), vs 105 µs and 120 µs for generic Eigen. Pow2 rings kill the periodic memmove spike. Still well inside the 1333 µs audio deadline at 48 kHz block=64 for all variants — this matters more for small-block / high-SR configurations than for typical desktop plugins.nam2c --fused. We're within 9% at p50 and better at p99.9 (10.8 µs vs 13.0 µs). nam2c's generated C is the reasonable ceiling for this kind of code in portable C; we're there.What's NOT in this PR (deliberately)
-Ofast. I tried fold-expression unrolling as an alternate approach and it regressed ch=8 because it scalarized and blocked clang's auto-vectorizer, so the ch=8 path leans on Eigen's tuned kernel rather than trying to hand-roll one.A2FastModelheader is a natural home if we want to add them later, but not as part of this PR.How to try it
Opt out:
cmake -B build -DNAM_ENABLE_A2_FAST=OFF.Developed with support and sponsorship from TONE3000