Skip to content

Faster (Anderson safe-scaling) nrm2 variants + fix munit labels#24

Merged
sigilante merged 1 commit into
masterfrom
fix/issue3-fast-nrm2
May 30, 2026
Merged

Faster (Anderson safe-scaling) nrm2 variants + fix munit labels#24
sigilante merged 1 commit into
masterfrom
fix/issue3-fast-nrm2

Conversation

@sigilante
Copy link
Copy Markdown
Collaborator

Addresses #3. The current nrm2 uses the classic running-scale algorithm — a division plus several ops per element, which is especially slow in SoftFloat where every op is a function call (~7× slower than needed).

*nrm2_B

Adds snrm2_B/dnrm2_B/hnrm2_B/qnrm2_B implementing E. Anderson's safe-scaling 2-norm (ACM TOMS Algorithm 978, 2017 — the algorithm in Reference-LAPACK/OpenBLAS the issue links). Three fixed-threshold accumulators (small/medium/big) replace the per-element division while preserving overflow/underflow safety; the inner loop is abs → compare → square → add, no division. Constants are 2^k from each type's exponent range (float16's are modest enough to fit, unlike hrotmg).

Side-by-side, per @sigilante's suggestion — the existing nrm2 is untouched. The fast algorithm returns different bits, so switching the default is a coordinated Lagoon/Hoon change for later; this lands the alternative so it can be benchmarked and validated first.

test_nrm2_B.c checks the medium band (3,4 → 5) and the overflow/underflow safety that's the whole point, across all four precisions (2^120, 2^1000, half 1024, quad 2^8200 — all squares that would overflow naïvely).

munit labels (the companion fix)

Now that gemv/gemm tests live in level2/level3, test_all.c registers three sub-suites so a test's label matches its level instead of everything reading /blas/level1:

/blas/level1/...   (asum, axpy, …, nrm2_B, rot, complex)
/blas/level2/test_sgemv_…
/blas/level3/test_sgemm_…

191/191 pass. Leaves #3 open for the benchmark + eventual default switch.

🤖 Generated with Claude Code

Addresses #3. The current nrm2 uses the classic running-scale algorithm,
which performs a division and several ops per element -- expensive in
SoftFloat, where every op is a function call (~7x slower than needed).

Add snrm2_B/dnrm2_B/hnrm2_B/qnrm2_B implementing E. Anderson's safe-
scaling 2-norm (ACM TOMS Algorithm 978, 2017), as used by Reference-
LAPACK/OpenBLAS: three fixed-threshold accumulators (small/medium/big)
remove the per-element division while keeping overflow/underflow safety.
Constants are 2^k chosen from each type's exponent range (and, unlike
hrotmg, float16's modest thresholds fit its range fine).

These are side-by-side *_B variants per @sigilante's suggestion -- the
existing nrm2 is untouched, since the faster algorithm returns different
bits and switching it would need a coordinated Lagoon/Hoon change.

test_nrm2_B.c verifies the medium band (3,4 -> 5) and the whole point --
overflow/underflow safety -- across all four precisions (e.g. 2^120,
2^1000, half 1024 whose squares overflow, 2^8200 quad).

Also fixes the munit suite labels: now that gemv/gemm tests live in
level2/level3, test_all.c registers three sub-suites (/blas/level1,
/blas/level2, /blas/level3) so a test's label matches its level instead
of everything reading "/blas/level1".

191/191 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sigilante sigilante merged commit 40ff4fa into master May 30, 2026
1 check passed
@sigilante sigilante deleted the fix/issue3-fast-nrm2 branch May 30, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant