A2Fast: portable conv + memory optimizations (up to 1.6x on channels=8)#277
Merged
sdatkinson merged 7 commits intoJun 8, 2026
Merged
Conversation
Previously each of the 23 layers held its own std::vector<float> history. Heap allocators can place these at addresses that share the same L1 cache sets, causing conflict misses on the wide-dilation layers whose tap reads, _layer_in writes, and head-history writes simultaneously compete for cache. Replace with a single contiguous allocation (_history_arena) from which each layer takes a raw pointer slice. Sequential placement naturally distributes base addresses across cache sets because each slot size mod the set stride is non-zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously each layer accumulated its post-activation output into a separate _head_sum buffer, which was then zeroed before every block (memset of Channels*num_frames floats) and copied into _head_history by _head_ring_write (another memcpy of the same size) after all layers finished. Replace with direct writes: layer 0 assigns (IsFirst=true template parameter), layers 1-22 accumulate (IsFirst=false). The IsFirst branch resolves at compile time, so there is no runtime branch in the hot loop. _head_ring_write is removed; process() now manages the head ring write position directly, rewinding with a cheap memmove of kHeadKernelSize-1 columns when the end of the buffer is reached. Also eliminates ztile.setZero() for the Channels=8 Eigen path: tap 0 now assigns into ztile (noalias() =) rather than accumulating on top of a zero-initialised matrix, saving one Channels*num_frames write per layer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Previously _ring_write always refreshed the full tail mirror — a memcpy of maxBufferSize * Channels * sizeof(float) bytes per layer per block (4 KB for Ch=8 / 128 frames, 92 KB total across 23 layers per buffer). Most of the time no tap read crosses the pow2_size boundary, so the mirror copy was pure overhead. Replace with mirror-on-demand: after each write, compute each tap's next-call read range and mirror only the columns that actually overflow past pow2_size. Zero copy is emitted whenever no tap straddles the wrap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
Replaces the Eigen noalias() GEMM loop with an explicit T=4 frame-tiled,
tap-major C++ kernel. Structure of the inner body:
a[f][o] += Wcol[o] * h[f] (o = output channel, f = frame in tile)
Wcol = W[:,cp] is stride-1 in o, so the o loop vectorizes. h[f] is a
scalar loaded from the (column-major) history buffer, so the compiler
emits a broadcast-scalar FMA instruction — vmlaq_n_f32 on AArch64, the
AVX equivalent on x86 — without requiring architecture-specific intrinsics.
The T=4 tiling amortizes the two weight register loads (W[:,cp] = 8
floats = 2 Q-regs on AArch64) across four independent accumulator chains,
matching the weight-reuse strategy of the explicit NEON microkernel.
A scalar tail handles buffer sizes that are not multiples of 4; in
practice audio buffer sizes (64, 128, 256) are always multiples of 4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Landon McCoy <landon-chaos@users.noreply.github.com>
…s.txt to compile to Release by default.
Allman braces for the new if constexpr / dispatcher blocks, no single-line loops, and collapse the manually-aligned declarations. No behavioral change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sdatkinson
requested changes
Jun 8, 2026
sdatkinson
left a comment
Owner
There was a problem hiding this comment.
CMakeLists.txt but otherwise LGTM
| // broadcast) with weight vector Wcol reused across all T frames — the same | ||
| // weight amortisation as an explicit NEON microkernel, without intrinsics. | ||
| // Equivalent SIMD broadcast-FMA instructions are emitted on x86 (AVX) and | ||
| // other SIMD targets automatically. |
|
|
||
| set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake") | ||
|
|
||
| # Default to a Release build: without this, an empty CMAKE_BUILD_TYPE compiles |
Owner
There was a problem hiding this comment.
I think I prefer defaulting to Debug--slower performance is very noticeable; losing your assertions when running tests, less so :)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizes the A2-shape fast-path WaveNet (
a2_fast.cpp) with portable, intrinsic-free changes. No change to model output; the generic WaveNet path is untouched.Changes
T×Channelsaccumulator local and reuses each weight column across 4 frames. The compiler emits SIMD broadcast-FMA (vmlaq_n_f32on AArch64, AVX equivalents on x86) — the same weight amortization as a hand-written microkernel, without intrinsics._head_historyat the ring slot (IsFirstselects=vs+=), eliminating the separate_head_sumbuffer, its per-block zeroing pass, and the head ring-write copy.layer1x1projection + residual store feed nothing downstream, so they're elided via anIsLasttemplate parameter.mbs×Channelsmirror refresh every block, predict each tap's next read range and mirror only the columns that actually overflow pastpow2_size— frequently zero.Releasewhen no build type is set (an emptyCMAKE_BUILD_TYPEcompiles unoptimized, ~10x slower for Eigen-heavy code).Benchmark (desktop, Apple Silicon, fast-path p50 µs/block, vs
main)Both builds forced to
-DCMAKE_BUILD_TYPE=Releaseto isolate the code changes. The unchanged generic WaveNet ran as a control and stayed flat in both builds, confirming the deltas are real.Channels=8 sees 1.3–1.6x from the tiled conv. Channels=3 is roughly flat on desktop; the arena/mirror changes target narrow embedded caches (Cortex-A8/M7) and aren't captured by a desktop run.
Some of these optimizations were contributed by landon-chaos, as he found them out while optimizing Core to run on his pedals :)