urbit · sigilante · May 30, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -22,6 +22,9 @@ jobs:
           git config --global url."https://github.com/".insteadOf "git@github.com:"
           git submodule update --init --recursive
 
+      - name: Lint test registration
+        run: ./tools/check_test_registration.sh
+
       - name: Build SoftFloat
         run: make -C SoftFloat/build/Linux-x86_64-GCC
 

diff --git a/HANDOFF.md b/HANDOFF.md
@@ -0,0 +1,164 @@
+# SoftBLAS — Handoff to urbit/numerics agent
+
+_Scratch document — not committed. Written 2026-05-30 after a five-reviewer audit
+(Gnome/Dwarf/Elf/Hobbit/Angel) of `urbit/SoftBLAS` and the follow-up fixes._
+
+## TL;DR
+
+A correctness audit found three silently-wrong-answer bugs, a process-killing
+error path, and an inert rounding-mode API. The first four are **fixed and merged
+to `master`**; CI now runs on every PR. The rounding-mode work lives in **PR #1**,
+which has been completed and reconciled with `master` but is **a breaking API
+change** that requires a coordinated Lagoon update in Vere.
+
+Your job picks up at: (1) finish the Lagoon side of the `rndMode` change, (2) bump
+the pinned SoftBLAS commit in Vere, (3) work the remaining audit backlog below.
+
+## What shipped this session
+
+| PR | State | Summary |
+|----|-------|---------|
+| **#8** | merged | `f128_abs` masked both float128 words (corrupted negative quads); `*gemm` indexed B with `ldc` instead of `ldb`; `*gemv` folded `incX/incY` into the matrix index. All three produced silently wrong results, none caught by the old tests. + regression tests. |
+| **#9** | merged | `exit(-1)` on invalid args (bad transpose/layout; `swap` N==0) → replaced with `return`. `swap` N==0 is now a no-op per BLAS spec. (Vere is the only client; it validates and bails via `u3m_bail`, so a plain `return` suffices — no error-callback machinery.) |
+| **#10** | merged | GitHub Actions CI: builds SoftFloat + SoftBLAS, runs the munit suite on every push/PR. |
+| **#1** | open, **mergeable**, **CI green** | Rounding-mode redesign — see below. Branch `sigilante/rounding-mode`. Merged up to `master` so it no longer reintroduces the #8 bugs. 140/140 tests pass (locally and in CI). |
+
+`master` is green (139 tests). The bugs were invisible before because every prior
+test used unit strides and `ldb == ldc`.
+
+## PR #1 (`sigilante/rounding-mode`) — what it is and what's left
+
+**Design:** rounding is now **per-call**. Every arithmetic routine gained a
+trailing `const uint_fast8_t rndMode` parameter and calls `_set_rounding(rndMode)`
+on entry (sets SoftFloat's `softfloat_roundingMode`). The old global
+`softblas_roundingMode` + `softblas_state.c` are **removed**. This supersedes the
+audit's "problem #5" (the old `set_rounding()` was never called, so the documented
+rounding control silently did nothing).
+
+`_set_rounding(a)` accepts `'n' 'z' 'u' 'd' 'a'`; an **unknown mode is a no-op**
+(leaves SoftFloat's current mode unchanged) rather than `exit(1)`.
+
+**Completed in this handoff's work (pushed to the branch):**
+- Converted all remaining implementations to match the header's `rndMode`
+  prototypes; restored the `'a'` (round-away) mode; made unknown-mode a no-op.
+- Updated every test call site; added `test_saxpy_rounding`, which proves the mode
+  actually changes results (`1.0 + 2⁻²⁴` → `1.0` under `'n'`, next-float-up under
+  `'u'`). Would fail if `_set_rounding` were inert.
+- Merged `origin/master` in: #8's fixes + `rndMode` now coexist (verified:
+  `sgemm` has `ldb` + `rndMode`; `sgemv` has `A[i*lda+j]` + `rndMode`; `f128_abs`
+  fixed). 140/140 pass.
+
+**⚠️ This is a breaking API change.** Every routine signature gained a trailing
+arg. All downstream callers must pass a mode.
+
+**Deliberately NOT given `rndMode`** (they do no rounding / header omits it):
+`isamax/idamax/ihamax/iqamax` (index-of-max = abs + compare), and `sdsdot/hsdot`
+(their prototypes don't take it). Leave these alone.
+
+**Converted but NOT built** (kept header-consistent for later): the complex
+(`caxpy/ccopy/cdotc/csrot/cscal/scasum/scnrm2`) and rotation
+(`*rot/*rotg/*rotm`) routines now carry `rndMode`, but they are **not** in the
+Makefile and several still have pre-existing compile bugs (see backlog). `*rotmg`
+were left unconverted (multi-line signatures; also unbuilt).
+
+## Immediate action items (Vere / Lagoon side)
+
+1. **Lagoon `rndMode` update — REQUIRED, do in tandem with merging PR #1.**
+   `vere/pkg/noun/jets/i/lagoon.c` has **~32 BLAS call sites** (`saxpy/sdot/sscal/
+   sgemm` and h/d/q variants) that call the *old* signatures. Each needs the
+   trailing `rndMode` argument. Lagoon already computes the mode — see
+   `_set_rounding(c3_w a)` at ~`lagoon.c:40`, which maps `%n/%z/%u/%d/%a`. Thread
+   that same char (the jet's rounding-mode argument) into each BLAS call. The
+   moment PR #1 merges, Vere won't compile until this is done.
+   - Note: `i*amax` calls take **no** `rndMode` — don't add it there.
+   - Lagoon currently sets `softfloat_roundingMode` directly (and the now-removed
+     `softblas_roundingMode`); after this change, drop the
+     `softblas_roundingMode = ...` writes (that global no longer exists) and rely
+     on passing `rndMode` per call.
+
+2. **Bump the pinned SoftBLAS commit in Vere.**
+   `vere/ext/softblas/build.zig.zon:10` pins `cbffb33f…` (an old commit that has
+   *all three* #8 bugs). After merges land, bump `.url` + `.hash` to a `master`
+   commit that includes #8/#9/#10 (and PR #1 once merged). Also check
+   `vere/ext/softblas/build.zig`'s source list still matches SoftBLAS's
+   `BLAS_SRCS` (no `softblas_state.c` anymore).
+
+3. **Merge PR #1** once Lagoon is ready. It's mergeable; "UNSTABLE" only means CI
+   isn't wired on that branch yet (CI reached `master` via #10 after PR #1 forked;
+   the merge commit already pulled `master` in, so a re-merge of latest `master`
+   would attach CI).
+
+## Remaining SoftBLAS backlog (from the audit, not yet done)
+
+A fuller list was kept in `WORKLIST.md` (was stashed under
+`stash@{0}` on the `ci/github-actions` branch: `git stash show -p stash@{0}`).
+Highlights, roughly by severity:
+
+**Correctness / determinism**
+- `qaxpy`: `nan_unify_q(QY)` only canonicalizes `QY[0]`; NaNs in `QY[1..]` escape →
+  breaks determinism for vectors length > 1. Unify per written element.
+- `*nrm2`: loop bound `(N-1)*incX` can integer-overflow → early exit / infinite loop.
+- `FNEG` macro (`1 << 31` / `1 << 63`): shifts a signed int into the sign bit = UB.
+  Affects complex conjugation. Use a width-correct unsigned literal.
+- `*swap`/`csrot`: declare `incX` as `uint64_t` then test `if (incX < 0)` (always
+  false) → negative strides become huge positives → OOB. (May differ post-PR#1.)
+- `iamax`: 0-based for N>1 but hardcodes `return 1` for N==1 — inconsistent; document
+  the convention (Fortran BLAS is 1-based; this is effectively 0-based).
+- `f128M_abs`: now masks the right word, but is still a structurally broken
+  `(float128_t*){…}` compound literal (currently unused).
+
+**Resource / robustness**
+- `qasum` leaks 16 bytes/call (heap accumulator never freed; use a stack temp).
+- `svec/dvec/hvec/qvec` don't null-check `malloc`.
+
+**Broken / orphaned code presented as "complete"**
+- `rot/rotg/rotm/rotmg` (all precisions) are declared in the header but **not built**
+  → link errors if a caller uses them. Either build+fix or document as unbuilt.
+- `sdsdot.c`, `hsdot.c`, `scnrm2.c` don't compile (undefined idents / complex type
+  errors). `qrot` prototype uses `float16_t*` (copy-paste). `*rotmg` compute
+  constants via integer multiply of hex bit-patterns (meaningless). `srot/drot/hrot`
+  compare SoftFloat structs to int literals with `!=` (UB) and omit
+  `#include "softblas.h"`.
+
+**Test-harness bugs (in `tests/test_all.c`)**
+- `*swap_stride` slots run `*swap_two` (the stride variant is never exercised).
+- `idamax/ihamax/iqamax` stride+13542 slots point at the **`isamax`** functions —
+  the d/h/q index-max tests don't test their own routine.
+
+**Hygiene / scope**
+- `v[0]=lo, v[1]=hi` float128 layout invariant should be documented in the header
+  (currently an `// XXX` note in `qdot.c`).
+- Consider an X-macro for the 6 genuinely-identical Level-1 routines
+  (axpy/copy/dot/scal/swap/asum) across `h/s/d`; keep `q` separate (pointer API).
+- Move internal helpers (`*vec`, `nan_unify_*`) to an internal header.
+- `benchmarking/` is dead (benchmarking was removed from planned work, commit
+  f4908d3) — candidate for deletion. Unused complex 16/128-bit (`i`/`v`) type stubs.
+- Naming sovereignty is limited by downstream clients — coordinate renames.
+
+## Build & test
+
+- **CI / Linux (canonical):** submodules (SoftFloat is an SSH submodule — rewrite to
+  HTTPS), `make -C SoftFloat/build/Linux-x86_64-GCC`, then `make tests && ./test_all`.
+  See `.github/workflows/ci.yml`.
+- **macOS gotcha:** the Makefile hardcodes `gcc`, `-LSoftFloat/build/Linux-x86_64-GCC`,
+  and `-l:softfloat.a` (GNU-ld only) — it does **not** build out-of-the-box on
+  Apple `ld`. During this work I built locally with `clang`, linking the checked-in
+  (arm64) `softfloat.a` directly and compiling the `BLAS_SRCS` list by hand. A
+  Darwin build path is an open backlog item.
+
+## Landmines / things that bit me
+
+- The whole reason the bugs survived: tests only used unit strides and `ldb==ldc`.
+  Any new routine work should add non-unit-stride / `ldb≠ldc` coverage.
+- PR #1's source merged cleanly with #8 because the changes were on different lines
+  (signature/`_set_rounding` at top vs. index fixes in the loop body) — but always
+  re-verify `f128_abs`, `*gemm` `ldb`, and `*gemv` indexing after any future merge.
+- When adding `rndMode` to test calls, watch for calls with trailing comments
+  (`sswap(0,…);  // …`) — a naive "ends with `);`" filter misses them.
+- Merging `master` into PR #1 silently re-took master's **Makefile**, which lists
+  `softblas_state.c` (deleted here) and the WIP complex routines (`scasum`/`ccopy`
+  don't compile) — this broke CI. Resolved by restoring the branch's `BLAS_SRCS`
+  (40 REAL routines) and re-deleting `softblas_state.c`. On any future `master`
+  merge, re-check the Makefile didn't resurrect either. The branch's complex/rot
+  files carry `rndMode` but are intentionally unbuilt and still WIP-broken
+  (e.g. `ccopy` declares `ix/iy` but uses `iX/iY`) — fix when completing complex.
diff --git a/README.md b/README.md
@@ -12,10 +12,11 @@ Following SoftFloat 3e and requiring a 64-bit OS, all quantities are passed by v
 
 - 1.0.0 (commit `7d05697aea5363dcf5f877a9c8b464e9c352d3d4`).  Basic suite of `REAL`-valued operations suitable for use with half, single, double, and quadruple precision floating-point numbers.
 - 1.1.0 (commit `f94acbcfd26cebd8e135ad9e8c7caa156fcc4ac9`).  Errors changed to `exit(-1)` instead of `return`.
+- Unreleased.  Invalid-argument errors `return` (no-op) again instead of `exit(-1)`, so the library never terminates the host process.  The rounding mode is now a per-call argument (see below) rather than a global.
 
 ### Planned work
 
-- [ ] Add rounding-mode propagation to function signatures (in progress).
+- [x] Add rounding-mode propagation to function signatures.
 - [ ] Complete complex-valued functions (in progress).
 - [ ] Use a linter.
 - [ ] Add (kelvin) versioning, at least on interface.
@@ -31,13 +32,13 @@ Following SoftFloat 3e and requiring a 64-bit OS, all quantities are passed by v
 |  64 | `d` |  `float64_t` |  64 | `z` |  `complex64_t` |
 | 128 | `q` | `float128_t` | 128 | `v` | `complex128_t` |
 
-The rounding mode should be set via the global variable `softblas_roundingMode` (a `char` `typedef`), defined in `softblas.h`.  Valid values are:
+Every routine takes a trailing rounding-mode argument `rndMode` (a `uint_fast8_t` holding one of the `char` codes below).  It is applied via `_set_rounding` at routine entry, before any arithmetic.  An unrecognized code leaves SoftFloat's current mode unchanged.  Valid values are:
 
-- `'n'` - round to nearest (even)
+- `'n'` - round to nearest, ties to even
 - `'z'` - round towards zero
-- `'u'` - round up
-- `'d'` - round down
-- `'a'` - round away from zero
+- `'u'` - round up (towards +∞)
+- `'d'` - round down (towards −∞)
+- `'a'` - round to nearest, ties away from zero
 
 Per Wikipedia:
 

diff --git a/include/softblas.h b/include/softblas.h
@@ -168,11 +168,16 @@ typedef struct {
 #define c64_div(a, b) (complex64_t){ f64_div( f64_add( f64_mul(a.real, b.real), f64_mul(a.imag, b.imag) ), f64_add( f64_mul(b.real, b.real), f64_mul(b.imag, b.imag) ) ), f64_div( f64_sub( f64_mul(a.imag, b.real), f64_mul(a.real, b.imag) ), f64_add( f64_mul(b.real, b.real), f64_mul(b.imag, b.imag) ) ) }
 #define c128_div(a, b) (complex128_t){ f128_div( f128_add( f128_mul(a.real, b.real), f128_mul(a.imag, b.imag) ), f128_add( f128_mul(b.real, b.real), f128_mul(b.imag, b.imag) ) ), f128_div( f128_sub( f128_mul(a.imag, b.real), f128_mul(a.real, b.imag) ), f128_add( f128_mul(b.real, b.real), f128_mul(b.imag, b.imag) ) ) }
 
-#define c16_conj(a) (complex16_t){ a.real, FNEG(a.imag) }
-#define c32_conj(a) (complex32_t){ a.real, FNEG(a.imag) }
-#define c64_conj(a) (complex64_t){ a.real, FNEG(a.imag) }
-#define c128_conj(a) (complex128_t){ a.real, FNEG(a.imag) }
-#define FNEG(x) ( x.v ^ (1 << (sizeof(x.v) * 8 - 1)) )
+//  FNEG flips the sign bit of a scalar-valued SoftFloat type (float16/32/64).
+//  The mask must be unsigned and at least as wide as x.v: a plain `1 << 31`
+//  is undefined (signed int overflow) and `1 << 63` does not fit in an int.
+#define FNEG(x) ( (x).v ^ ((uint64_t)1 << (sizeof((x).v) * 8 - 1)) )
+#define c16_conj(a) (complex16_t){ (a).real, FNEG((a).imag) }
+#define c32_conj(a) (complex32_t){ (a).real, FNEG((a).imag) }
+#define c64_conj(a) (complex64_t){ (a).real, FNEG((a).imag) }
+//  float128_t holds its sign bit in the high word v[1]; v is an array, so it
+//  cannot use FNEG (which xors the whole scalar). Flip v[1]'s sign bit only.
+#define c128_conj(a) (complex128_t){ (a).real, (float128_t){ (a).imag.v[0], (a).imag.v[1] ^ ((uint64_t)1 << 63) } }
 
 //  SOFTBLAS PROTOTYPES
 
@@ -188,7 +193,7 @@ void srotg(float32_t *a, float32_t *b, float32_t *c, float32_t *s, const uint_fa
 void srotm(const uint64_t N, float32_t *X, const uint64_t incX, float32_t *Y, const uint64_t incY, const float32_t *P, const uint_fast8_t rndMode);
 void srotmg(float32_t *D1, float32_t *D2, float32_t *X1, const float32_t y1, float32_t *P, const uint_fast8_t rndMode);
 void sscal(uint64_t N, float32_t SA, float32_t *SX, uint64_t incX, const uint_fast8_t rndMode);
-void sswap(uint64_t N, float32_t *SX, uint64_t incX, float32_t *SY, uint64_t incY, const uint_fast8_t rndMode);
+void sswap(uint64_t N, float32_t *SX, int64_t incX, float32_t *SY, int64_t incY, const uint_fast8_t rndMode);
 uint64_t isamax(uint64_t N, const float32_t *SX, uint64_t incX);
 
 //    Double-precision
@@ -202,7 +207,7 @@ void drotg(float64_t *a, float64_t *b, float64_t *c, float64_t *s, const uint_fa
 void drotm(const uint64_t N, float64_t *X, const uint64_t incX, float64_t *Y, const uint64_t incY, const float64_t *P, const uint_fast8_t rndMode);
 void drotmg(float64_t *D1, float64_t *D2, float64_t *X1, const float64_t y1, float64_t *P, const uint_fast8_t rndMode);
 void dscal(uint64_t N, float64_t DA, float64_t *DX, uint64_t incX, const uint_fast8_t rndMode);
-void dswap(uint64_t N, float64_t *DX, uint64_t incX, float64_t *DY, uint64_t incY, const uint_fast8_t rndMode);
+void dswap(uint64_t N, float64_t *DX, int64_t incX, float64_t *DY, int64_t incY, const uint_fast8_t rndMode);
 uint64_t idamax(uint64_t N, const float64_t *DX, uint64_t incX);
 
 //    Half-precision
@@ -216,7 +221,7 @@ void hrotg(float16_t *a, float16_t *b, float16_t *c, float16_t *s, const uint_fa
 void hrotm(const uint64_t N, float16_t *X, const uint64_t incX, float16_t *Y, const uint64_t incY, const float16_t *P, const uint_fast8_t rndMode);
 void hrotmg(float16_t *D1, float16_t *D2, float16_t *X1, const float16_t y1, float16_t *P, const uint_fast8_t rndMode);
 void hscal(uint64_t N, float16_t HA, float16_t *HX, uint64_t incX, const uint_fast8_t rndMode);
-void hswap(uint64_t N, float16_t *HX, uint64_t incX, float16_t *HY, uint64_t incY, const uint_fast8_t rndMode);
+void hswap(uint64_t N, float16_t *HX, int64_t incX, float16_t *HY, int64_t incY, const uint_fast8_t rndMode);
 uint64_t ihamax(uint64_t N, const float16_t *HX, uint64_t incX);
 
 //    Quad-precision
@@ -230,7 +235,7 @@ void qrotg(float128_t *a, float128_t *b, float128_t *c, float128_t *s, const uin
 void qrotm(const uint64_t N, float128_t *X, const uint64_t incX, float128_t *Y, const uint64_t incY, const float128_t *P, const uint_fast8_t rndMode);
 void qrotmg(float128_t *D1, float128_t *D2, float128_t *X1, const float128_t y1, float128_t *P, const uint_fast8_t rndMode);
 void qscal(uint64_t N, float128_t QA, float128_t *QX, uint64_t incX, const uint_fast8_t rndMode);
-void qswap(uint64_t N, float128_t *QX, uint64_t incX, float128_t *QY, uint64_t incY, const uint_fast8_t rndMode);
+void qswap(uint64_t N, float128_t *QX, int64_t incX, float128_t *QY, int64_t incY, const uint_fast8_t rndMode);
 uint64_t iqamax(uint64_t N, const float128_t *QX, uint64_t incX);
 
 //    Complex single-precision
@@ -239,7 +244,7 @@ void caxpy(uint64_t N, complex32_t CA, complex32_t *CX, int64_t incX, complex32_
 void ccopy(uint64_t N, const complex32_t *CX, int64_t incX, complex32_t *CY, int64_t incY, const uint_fast8_t rndMode);
 complex32_t cdotc(uint64_t N, const complex32_t *CX, int64_t incX, const complex32_t *CY, int64_t incY, const uint_fast8_t rndMode);
 float32_t scnrm2(uint64_t N, const complex32_t *CX, uint64_t incX, const uint_fast8_t rndMode);
-void csrot(const uint64_t N, complex32_t *CX, const uint64_t incX, complex32_t *CY, const uint64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode);
+void csrot(const uint64_t N, complex32_t *CX, const int64_t incX, complex32_t *CY, const int64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode);
 
 void cscal(uint64_t N, complex32_t CA, complex32_t *CX, uint64_t incX, const uint_fast8_t rndMode);
 void cswap(uint64_t N, complex32_t *CX, uint64_t incX, complex32_t *CY, uint64_t incY, const uint_fast8_t rndMode);

diff --git a/src/blas/level1/csrot.c b/src/blas/level1/csrot.c
@@ -1,6 +1,6 @@
 #include "softblas.h"
 
-void csrot(const uint64_t N, complex32_t *CX, const uint64_t incX, complex32_t *CY, const uint64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode) {
+void csrot(const uint64_t N, complex32_t *CX, const int64_t incX, complex32_t *CY, const int64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode) {
     _set_rounding(rndMode);
     complex32_t temp;
     int64_t ix = 0;

diff --git a/src/blas/level1/dnrm2.c b/src/blas/level1/dnrm2.c
@@ -14,7 +14,7 @@ float64_t dnrm2(uint64_t N, const float64_t *X, uint64_t incX, const uint_fast8_
         float64_t ssq = { SB_REAL64_ONE };
         float64_t absXI;
 
-        for (uint64_t ix = 0; ix < 1 + (N - 1) * incX; ix += incX) {
+        for (uint64_t k = 0, ix = 0; k < N; k++, ix += incX) {
             if (f64_ne(X[ix], (float64_t){ SB_REAL64_ZERO })) {
                 absXI = f64_abs(X[ix]);
                 if (f64_lt(scale, absXI)) {

diff --git a/src/blas/level1/dswap.c b/src/blas/level1/dswap.c
@@ -1,13 +1,13 @@
 #include "softblas.h"
 
-void dswap(uint64_t N, float64_t *DX, uint64_t incX, float64_t *DY, uint64_t incY, const uint_fast8_t rndMode) {
+void dswap(uint64_t N, float64_t *DX, int64_t incX, float64_t *DY, int64_t incY, const uint_fast8_t rndMode) {
     _set_rounding(rndMode);
     float64_t dtemp;
 
     if (N == 0) return;
 
-    uint64_t iX = 0;
-    uint64_t iY = 0;
+    int64_t iX = 0;
+    int64_t iY = 0;
     if (incX < 0) iX = (-N + 1) * incX;
     if (incY < 0) iY = (-N + 1) * incY;
     for (uint64_t i = 0; i < N; i++) {

diff --git a/src/blas/level1/hnrm2.c b/src/blas/level1/hnrm2.c
@@ -14,7 +14,7 @@ float16_t hnrm2(uint64_t N, const float16_t *X, uint64_t incX, const uint_fast8_
         float16_t ssq = { SB_REAL16_ONE };
         float16_t absXI;
 
-        for (uint64_t ix = 0; ix < 1 + (N - 1) * incX; ix += incX) {
+        for (uint64_t k = 0, ix = 0; k < N; k++, ix += incX) {
             if (f16_ne(X[ix], (float16_t){ SB_REAL16_ZERO })) {
                 absXI = f16_abs(X[ix]);
                 if (f16_lt(scale, absXI)) {