Faster (Anderson safe-scaling) nrm2 variants + fix munit labels#24
Merged
Conversation
Addresses #3. The current nrm2 uses the classic running-scale algorithm, which performs a division and several ops per element -- expensive in SoftFloat, where every op is a function call (~7x slower than needed). Add snrm2_B/dnrm2_B/hnrm2_B/qnrm2_B implementing E. Anderson's safe- scaling 2-norm (ACM TOMS Algorithm 978, 2017), as used by Reference- LAPACK/OpenBLAS: three fixed-threshold accumulators (small/medium/big) remove the per-element division while keeping overflow/underflow safety. Constants are 2^k chosen from each type's exponent range (and, unlike hrotmg, float16's modest thresholds fit its range fine). These are side-by-side *_B variants per @sigilante's suggestion -- the existing nrm2 is untouched, since the faster algorithm returns different bits and switching it would need a coordinated Lagoon/Hoon change. test_nrm2_B.c verifies the medium band (3,4 -> 5) and the whole point -- overflow/underflow safety -- across all four precisions (e.g. 2^120, 2^1000, half 1024 whose squares overflow, 2^8200 quad). Also fixes the munit suite labels: now that gemv/gemm tests live in level2/level3, test_all.c registers three sub-suites (/blas/level1, /blas/level2, /blas/level3) so a test's label matches its level instead of everything reading "/blas/level1". 191/191 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #3. The current
nrm2uses the classic running-scale algorithm — a division plus several ops per element, which is especially slow in SoftFloat where every op is a function call (~7× slower than needed).*nrm2_BAdds
snrm2_B/dnrm2_B/hnrm2_B/qnrm2_Bimplementing E. Anderson's safe-scaling 2-norm (ACM TOMS Algorithm 978, 2017 — the algorithm in Reference-LAPACK/OpenBLAS the issue links). Three fixed-threshold accumulators (small/medium/big) replace the per-element division while preserving overflow/underflow safety; the inner loop isabs → compare → square → add, no division. Constants are2^kfrom each type's exponent range (float16's are modest enough to fit, unlikehrotmg).Side-by-side, per @sigilante's suggestion — the existing
nrm2is untouched. The fast algorithm returns different bits, so switching the default is a coordinated Lagoon/Hoon change for later; this lands the alternative so it can be benchmarked and validated first.test_nrm2_B.cchecks the medium band (3,4 → 5) and the overflow/underflow safety that's the whole point, across all four precisions (2^120,2^1000, half1024, quad2^8200— all squares that would overflow naïvely).munit labels (the companion fix)
Now that gemv/gemm tests live in
level2/level3,test_all.cregisters three sub-suites so a test's label matches its level instead of everything reading/blas/level1:191/191 pass. Leaves #3 open for the benchmark + eventual default switch.
🤖 Generated with Claude Code