Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ jobs:
git config --global url."https://github.com/".insteadOf "git@github.com:"
git submodule update --init --recursive

- name: Lint test registration
run: ./tools/check_test_registration.sh

- name: Build SoftFloat
run: make -C SoftFloat/build/Linux-x86_64-GCC

Expand Down
164 changes: 164 additions & 0 deletions HANDOFF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# SoftBLAS — Handoff to urbit/numerics agent

_Scratch document — not committed. Written 2026-05-30 after a five-reviewer audit
(Gnome/Dwarf/Elf/Hobbit/Angel) of `urbit/SoftBLAS` and the follow-up fixes._

## TL;DR

A correctness audit found three silently-wrong-answer bugs, a process-killing
error path, and an inert rounding-mode API. The first four are **fixed and merged
to `master`**; CI now runs on every PR. The rounding-mode work lives in **PR #1**,
which has been completed and reconciled with `master` but is **a breaking API
change** that requires a coordinated Lagoon update in Vere.

Your job picks up at: (1) finish the Lagoon side of the `rndMode` change, (2) bump
the pinned SoftBLAS commit in Vere, (3) work the remaining audit backlog below.

## What shipped this session

| PR | State | Summary |
|----|-------|---------|
| **#8** | merged | `f128_abs` masked both float128 words (corrupted negative quads); `*gemm` indexed B with `ldc` instead of `ldb`; `*gemv` folded `incX/incY` into the matrix index. All three produced silently wrong results, none caught by the old tests. + regression tests. |
| **#9** | merged | `exit(-1)` on invalid args (bad transpose/layout; `swap` N==0) → replaced with `return`. `swap` N==0 is now a no-op per BLAS spec. (Vere is the only client; it validates and bails via `u3m_bail`, so a plain `return` suffices — no error-callback machinery.) |
| **#10** | merged | GitHub Actions CI: builds SoftFloat + SoftBLAS, runs the munit suite on every push/PR. |
| **#1** | open, **mergeable**, **CI green** | Rounding-mode redesign — see below. Branch `sigilante/rounding-mode`. Merged up to `master` so it no longer reintroduces the #8 bugs. 140/140 tests pass (locally and in CI). |

`master` is green (139 tests). The bugs were invisible before because every prior
test used unit strides and `ldb == ldc`.

## PR #1 (`sigilante/rounding-mode`) — what it is and what's left

**Design:** rounding is now **per-call**. Every arithmetic routine gained a
trailing `const uint_fast8_t rndMode` parameter and calls `_set_rounding(rndMode)`
on entry (sets SoftFloat's `softfloat_roundingMode`). The old global
`softblas_roundingMode` + `softblas_state.c` are **removed**. This supersedes the
audit's "problem #5" (the old `set_rounding()` was never called, so the documented
rounding control silently did nothing).

`_set_rounding(a)` accepts `'n' 'z' 'u' 'd' 'a'`; an **unknown mode is a no-op**
(leaves SoftFloat's current mode unchanged) rather than `exit(1)`.

**Completed in this handoff's work (pushed to the branch):**
- Converted all remaining implementations to match the header's `rndMode`
prototypes; restored the `'a'` (round-away) mode; made unknown-mode a no-op.
- Updated every test call site; added `test_saxpy_rounding`, which proves the mode
actually changes results (`1.0 + 2⁻²⁴` → `1.0` under `'n'`, next-float-up under
`'u'`). Would fail if `_set_rounding` were inert.
- Merged `origin/master` in: #8's fixes + `rndMode` now coexist (verified:
`sgemm` has `ldb` + `rndMode`; `sgemv` has `A[i*lda+j]` + `rndMode`; `f128_abs`
fixed). 140/140 pass.

**⚠️ This is a breaking API change.** Every routine signature gained a trailing
arg. All downstream callers must pass a mode.

**Deliberately NOT given `rndMode`** (they do no rounding / header omits it):
`isamax/idamax/ihamax/iqamax` (index-of-max = abs + compare), and `sdsdot/hsdot`
(their prototypes don't take it). Leave these alone.

**Converted but NOT built** (kept header-consistent for later): the complex
(`caxpy/ccopy/cdotc/csrot/cscal/scasum/scnrm2`) and rotation
(`*rot/*rotg/*rotm`) routines now carry `rndMode`, but they are **not** in the
Makefile and several still have pre-existing compile bugs (see backlog). `*rotmg`
were left unconverted (multi-line signatures; also unbuilt).

## Immediate action items (Vere / Lagoon side)

1. **Lagoon `rndMode` update — REQUIRED, do in tandem with merging PR #1.**
`vere/pkg/noun/jets/i/lagoon.c` has **~32 BLAS call sites** (`saxpy/sdot/sscal/
sgemm` and h/d/q variants) that call the *old* signatures. Each needs the
trailing `rndMode` argument. Lagoon already computes the mode — see
`_set_rounding(c3_w a)` at ~`lagoon.c:40`, which maps `%n/%z/%u/%d/%a`. Thread
that same char (the jet's rounding-mode argument) into each BLAS call. The
moment PR #1 merges, Vere won't compile until this is done.
- Note: `i*amax` calls take **no** `rndMode` — don't add it there.
- Lagoon currently sets `softfloat_roundingMode` directly (and the now-removed
`softblas_roundingMode`); after this change, drop the
`softblas_roundingMode = ...` writes (that global no longer exists) and rely
on passing `rndMode` per call.

2. **Bump the pinned SoftBLAS commit in Vere.**
`vere/ext/softblas/build.zig.zon:10` pins `cbffb33f…` (an old commit that has
*all three* #8 bugs). After merges land, bump `.url` + `.hash` to a `master`
commit that includes #8/#9/#10 (and PR #1 once merged). Also check
`vere/ext/softblas/build.zig`'s source list still matches SoftBLAS's
`BLAS_SRCS` (no `softblas_state.c` anymore).

3. **Merge PR #1** once Lagoon is ready. It's mergeable; "UNSTABLE" only means CI
isn't wired on that branch yet (CI reached `master` via #10 after PR #1 forked;
the merge commit already pulled `master` in, so a re-merge of latest `master`
would attach CI).

## Remaining SoftBLAS backlog (from the audit, not yet done)

A fuller list was kept in `WORKLIST.md` (was stashed under
`stash@{0}` on the `ci/github-actions` branch: `git stash show -p stash@{0}`).
Highlights, roughly by severity:

**Correctness / determinism**
- `qaxpy`: `nan_unify_q(QY)` only canonicalizes `QY[0]`; NaNs in `QY[1..]` escape →
breaks determinism for vectors length > 1. Unify per written element.
- `*nrm2`: loop bound `(N-1)*incX` can integer-overflow → early exit / infinite loop.
- `FNEG` macro (`1 << 31` / `1 << 63`): shifts a signed int into the sign bit = UB.
Affects complex conjugation. Use a width-correct unsigned literal.
- `*swap`/`csrot`: declare `incX` as `uint64_t` then test `if (incX < 0)` (always
false) → negative strides become huge positives → OOB. (May differ post-PR#1.)
- `iamax`: 0-based for N>1 but hardcodes `return 1` for N==1 — inconsistent; document
the convention (Fortran BLAS is 1-based; this is effectively 0-based).
- `f128M_abs`: now masks the right word, but is still a structurally broken
`(float128_t*){…}` compound literal (currently unused).

**Resource / robustness**
- `qasum` leaks 16 bytes/call (heap accumulator never freed; use a stack temp).
- `svec/dvec/hvec/qvec` don't null-check `malloc`.

**Broken / orphaned code presented as "complete"**
- `rot/rotg/rotm/rotmg` (all precisions) are declared in the header but **not built**
→ link errors if a caller uses them. Either build+fix or document as unbuilt.
- `sdsdot.c`, `hsdot.c`, `scnrm2.c` don't compile (undefined idents / complex type
errors). `qrot` prototype uses `float16_t*` (copy-paste). `*rotmg` compute
constants via integer multiply of hex bit-patterns (meaningless). `srot/drot/hrot`
compare SoftFloat structs to int literals with `!=` (UB) and omit
`#include "softblas.h"`.

**Test-harness bugs (in `tests/test_all.c`)**
- `*swap_stride` slots run `*swap_two` (the stride variant is never exercised).
- `idamax/ihamax/iqamax` stride+13542 slots point at the **`isamax`** functions —
the d/h/q index-max tests don't test their own routine.

**Hygiene / scope**
- `v[0]=lo, v[1]=hi` float128 layout invariant should be documented in the header
(currently an `// XXX` note in `qdot.c`).
- Consider an X-macro for the 6 genuinely-identical Level-1 routines
(axpy/copy/dot/scal/swap/asum) across `h/s/d`; keep `q` separate (pointer API).
- Move internal helpers (`*vec`, `nan_unify_*`) to an internal header.
- `benchmarking/` is dead (benchmarking was removed from planned work, commit
f4908d3) — candidate for deletion. Unused complex 16/128-bit (`i`/`v`) type stubs.
- Naming sovereignty is limited by downstream clients — coordinate renames.

## Build & test

- **CI / Linux (canonical):** submodules (SoftFloat is an SSH submodule — rewrite to
HTTPS), `make -C SoftFloat/build/Linux-x86_64-GCC`, then `make tests && ./test_all`.
See `.github/workflows/ci.yml`.
- **macOS gotcha:** the Makefile hardcodes `gcc`, `-LSoftFloat/build/Linux-x86_64-GCC`,
and `-l:softfloat.a` (GNU-ld only) — it does **not** build out-of-the-box on
Apple `ld`. During this work I built locally with `clang`, linking the checked-in
(arm64) `softfloat.a` directly and compiling the `BLAS_SRCS` list by hand. A
Darwin build path is an open backlog item.

## Landmines / things that bit me

- The whole reason the bugs survived: tests only used unit strides and `ldb==ldc`.
Any new routine work should add non-unit-stride / `ldb≠ldc` coverage.
- PR #1's source merged cleanly with #8 because the changes were on different lines
(signature/`_set_rounding` at top vs. index fixes in the loop body) — but always
re-verify `f128_abs`, `*gemm` `ldb`, and `*gemv` indexing after any future merge.
- When adding `rndMode` to test calls, watch for calls with trailing comments
(`sswap(0,…); // …`) — a naive "ends with `);`" filter misses them.
- Merging `master` into PR #1 silently re-took master's **Makefile**, which lists
`softblas_state.c` (deleted here) and the WIP complex routines (`scasum`/`ccopy`
don't compile) — this broke CI. Resolved by restoring the branch's `BLAS_SRCS`
(40 REAL routines) and re-deleting `softblas_state.c`. On any future `master`
merge, re-check the Makefile didn't resurrect either. The branch's complex/rot
files carry `rndMode` but are intentionally unbuilt and still WIP-broken
(e.g. `ccopy` declares `ix/iy` but uses `iX/iY`) — fix when completing complex.
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ Following SoftFloat 3e and requiring a 64-bit OS, all quantities are passed by v

- 1.0.0 (commit `7d05697aea5363dcf5f877a9c8b464e9c352d3d4`). Basic suite of `REAL`-valued operations suitable for use with half, single, double, and quadruple precision floating-point numbers.
- 1.1.0 (commit `f94acbcfd26cebd8e135ad9e8c7caa156fcc4ac9`). Errors changed to `exit(-1)` instead of `return`.
- Unreleased. Invalid-argument errors `return` (no-op) again instead of `exit(-1)`, so the library never terminates the host process. The rounding mode is now a per-call argument (see below) rather than a global.

### Planned work

- [ ] Add rounding-mode propagation to function signatures (in progress).
- [x] Add rounding-mode propagation to function signatures.
- [ ] Complete complex-valued functions (in progress).
- [ ] Use a linter.
- [ ] Add (kelvin) versioning, at least on interface.
Expand All @@ -31,13 +32,13 @@ Following SoftFloat 3e and requiring a 64-bit OS, all quantities are passed by v
| 64 | `d` | `float64_t` | 64 | `z` | `complex64_t` |
| 128 | `q` | `float128_t` | 128 | `v` | `complex128_t` |

The rounding mode should be set via the global variable `softblas_roundingMode` (a `char` `typedef`), defined in `softblas.h`. Valid values are:
Every routine takes a trailing rounding-mode argument `rndMode` (a `uint_fast8_t` holding one of the `char` codes below). It is applied via `_set_rounding` at routine entry, before any arithmetic. An unrecognized code leaves SoftFloat's current mode unchanged. Valid values are:

- `'n'` - round to nearest (even)
- `'n'` - round to nearest, ties to even
- `'z'` - round towards zero
- `'u'` - round up
- `'d'` - round down
- `'a'` - round away from zero
- `'u'` - round up (towards +∞)
- `'d'` - round down (towards −∞)
- `'a'` - round to nearest, ties away from zero

Per Wikipedia:

Expand Down
25 changes: 15 additions & 10 deletions include/softblas.h
Original file line number Diff line number Diff line change
Expand Up @@ -168,11 +168,16 @@ typedef struct {
#define c64_div(a, b) (complex64_t){ f64_div( f64_add( f64_mul(a.real, b.real), f64_mul(a.imag, b.imag) ), f64_add( f64_mul(b.real, b.real), f64_mul(b.imag, b.imag) ) ), f64_div( f64_sub( f64_mul(a.imag, b.real), f64_mul(a.real, b.imag) ), f64_add( f64_mul(b.real, b.real), f64_mul(b.imag, b.imag) ) ) }
#define c128_div(a, b) (complex128_t){ f128_div( f128_add( f128_mul(a.real, b.real), f128_mul(a.imag, b.imag) ), f128_add( f128_mul(b.real, b.real), f128_mul(b.imag, b.imag) ) ), f128_div( f128_sub( f128_mul(a.imag, b.real), f128_mul(a.real, b.imag) ), f128_add( f128_mul(b.real, b.real), f128_mul(b.imag, b.imag) ) ) }

#define c16_conj(a) (complex16_t){ a.real, FNEG(a.imag) }
#define c32_conj(a) (complex32_t){ a.real, FNEG(a.imag) }
#define c64_conj(a) (complex64_t){ a.real, FNEG(a.imag) }
#define c128_conj(a) (complex128_t){ a.real, FNEG(a.imag) }
#define FNEG(x) ( x.v ^ (1 << (sizeof(x.v) * 8 - 1)) )
// FNEG flips the sign bit of a scalar-valued SoftFloat type (float16/32/64).
// The mask must be unsigned and at least as wide as x.v: a plain `1 << 31`
// is undefined (signed int overflow) and `1 << 63` does not fit in an int.
#define FNEG(x) ( (x).v ^ ((uint64_t)1 << (sizeof((x).v) * 8 - 1)) )
#define c16_conj(a) (complex16_t){ (a).real, FNEG((a).imag) }
#define c32_conj(a) (complex32_t){ (a).real, FNEG((a).imag) }
#define c64_conj(a) (complex64_t){ (a).real, FNEG((a).imag) }
// float128_t holds its sign bit in the high word v[1]; v is an array, so it
// cannot use FNEG (which xors the whole scalar). Flip v[1]'s sign bit only.
#define c128_conj(a) (complex128_t){ (a).real, (float128_t){ (a).imag.v[0], (a).imag.v[1] ^ ((uint64_t)1 << 63) } }

// SOFTBLAS PROTOTYPES

Expand All @@ -188,7 +193,7 @@ void srotg(float32_t *a, float32_t *b, float32_t *c, float32_t *s, const uint_fa
void srotm(const uint64_t N, float32_t *X, const uint64_t incX, float32_t *Y, const uint64_t incY, const float32_t *P, const uint_fast8_t rndMode);
void srotmg(float32_t *D1, float32_t *D2, float32_t *X1, const float32_t y1, float32_t *P, const uint_fast8_t rndMode);
void sscal(uint64_t N, float32_t SA, float32_t *SX, uint64_t incX, const uint_fast8_t rndMode);
void sswap(uint64_t N, float32_t *SX, uint64_t incX, float32_t *SY, uint64_t incY, const uint_fast8_t rndMode);
void sswap(uint64_t N, float32_t *SX, int64_t incX, float32_t *SY, int64_t incY, const uint_fast8_t rndMode);
uint64_t isamax(uint64_t N, const float32_t *SX, uint64_t incX);

// Double-precision
Expand All @@ -202,7 +207,7 @@ void drotg(float64_t *a, float64_t *b, float64_t *c, float64_t *s, const uint_fa
void drotm(const uint64_t N, float64_t *X, const uint64_t incX, float64_t *Y, const uint64_t incY, const float64_t *P, const uint_fast8_t rndMode);
void drotmg(float64_t *D1, float64_t *D2, float64_t *X1, const float64_t y1, float64_t *P, const uint_fast8_t rndMode);
void dscal(uint64_t N, float64_t DA, float64_t *DX, uint64_t incX, const uint_fast8_t rndMode);
void dswap(uint64_t N, float64_t *DX, uint64_t incX, float64_t *DY, uint64_t incY, const uint_fast8_t rndMode);
void dswap(uint64_t N, float64_t *DX, int64_t incX, float64_t *DY, int64_t incY, const uint_fast8_t rndMode);
uint64_t idamax(uint64_t N, const float64_t *DX, uint64_t incX);

// Half-precision
Expand All @@ -216,7 +221,7 @@ void hrotg(float16_t *a, float16_t *b, float16_t *c, float16_t *s, const uint_fa
void hrotm(const uint64_t N, float16_t *X, const uint64_t incX, float16_t *Y, const uint64_t incY, const float16_t *P, const uint_fast8_t rndMode);
void hrotmg(float16_t *D1, float16_t *D2, float16_t *X1, const float16_t y1, float16_t *P, const uint_fast8_t rndMode);
void hscal(uint64_t N, float16_t HA, float16_t *HX, uint64_t incX, const uint_fast8_t rndMode);
void hswap(uint64_t N, float16_t *HX, uint64_t incX, float16_t *HY, uint64_t incY, const uint_fast8_t rndMode);
void hswap(uint64_t N, float16_t *HX, int64_t incX, float16_t *HY, int64_t incY, const uint_fast8_t rndMode);
uint64_t ihamax(uint64_t N, const float16_t *HX, uint64_t incX);

// Quad-precision
Expand All @@ -230,7 +235,7 @@ void qrotg(float128_t *a, float128_t *b, float128_t *c, float128_t *s, const uin
void qrotm(const uint64_t N, float128_t *X, const uint64_t incX, float128_t *Y, const uint64_t incY, const float128_t *P, const uint_fast8_t rndMode);
void qrotmg(float128_t *D1, float128_t *D2, float128_t *X1, const float128_t y1, float128_t *P, const uint_fast8_t rndMode);
void qscal(uint64_t N, float128_t QA, float128_t *QX, uint64_t incX, const uint_fast8_t rndMode);
void qswap(uint64_t N, float128_t *QX, uint64_t incX, float128_t *QY, uint64_t incY, const uint_fast8_t rndMode);
void qswap(uint64_t N, float128_t *QX, int64_t incX, float128_t *QY, int64_t incY, const uint_fast8_t rndMode);
uint64_t iqamax(uint64_t N, const float128_t *QX, uint64_t incX);

// Complex single-precision
Expand All @@ -239,7 +244,7 @@ void caxpy(uint64_t N, complex32_t CA, complex32_t *CX, int64_t incX, complex32_
void ccopy(uint64_t N, const complex32_t *CX, int64_t incX, complex32_t *CY, int64_t incY, const uint_fast8_t rndMode);
complex32_t cdotc(uint64_t N, const complex32_t *CX, int64_t incX, const complex32_t *CY, int64_t incY, const uint_fast8_t rndMode);
float32_t scnrm2(uint64_t N, const complex32_t *CX, uint64_t incX, const uint_fast8_t rndMode);
void csrot(const uint64_t N, complex32_t *CX, const uint64_t incX, complex32_t *CY, const uint64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode);
void csrot(const uint64_t N, complex32_t *CX, const int64_t incX, complex32_t *CY, const int64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode);

void cscal(uint64_t N, complex32_t CA, complex32_t *CX, uint64_t incX, const uint_fast8_t rndMode);
void cswap(uint64_t N, complex32_t *CX, uint64_t incX, complex32_t *CY, uint64_t incY, const uint_fast8_t rndMode);
Expand Down
2 changes: 1 addition & 1 deletion src/blas/level1/csrot.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#include "softblas.h"

void csrot(const uint64_t N, complex32_t *CX, const uint64_t incX, complex32_t *CY, const uint64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode) {
void csrot(const uint64_t N, complex32_t *CX, const int64_t incX, complex32_t *CY, const int64_t incY, const complex32_t c, const complex32_t s, const uint_fast8_t rndMode) {
_set_rounding(rndMode);
complex32_t temp;
int64_t ix = 0;
Expand Down
2 changes: 1 addition & 1 deletion src/blas/level1/dnrm2.c
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ float64_t dnrm2(uint64_t N, const float64_t *X, uint64_t incX, const uint_fast8_
float64_t ssq = { SB_REAL64_ONE };
float64_t absXI;

for (uint64_t ix = 0; ix < 1 + (N - 1) * incX; ix += incX) {
for (uint64_t k = 0, ix = 0; k < N; k++, ix += incX) {
if (f64_ne(X[ix], (float64_t){ SB_REAL64_ZERO })) {
absXI = f64_abs(X[ix]);
if (f64_lt(scale, absXI)) {
Expand Down
6 changes: 3 additions & 3 deletions src/blas/level1/dswap.c
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#include "softblas.h"

void dswap(uint64_t N, float64_t *DX, uint64_t incX, float64_t *DY, uint64_t incY, const uint_fast8_t rndMode) {
void dswap(uint64_t N, float64_t *DX, int64_t incX, float64_t *DY, int64_t incY, const uint_fast8_t rndMode) {
_set_rounding(rndMode);
float64_t dtemp;

if (N == 0) return;

uint64_t iX = 0;
uint64_t iY = 0;
int64_t iX = 0;
int64_t iY = 0;
if (incX < 0) iX = (-N + 1) * incX;
if (incY < 0) iY = (-N + 1) * incY;
for (uint64_t i = 0; i < N; i++) {
Expand Down
2 changes: 1 addition & 1 deletion src/blas/level1/hnrm2.c
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ float16_t hnrm2(uint64_t N, const float16_t *X, uint64_t incX, const uint_fast8_
float16_t ssq = { SB_REAL16_ONE };
float16_t absXI;

for (uint64_t ix = 0; ix < 1 + (N - 1) * incX; ix += incX) {
for (uint64_t k = 0, ix = 0; k < N; k++, ix += incX) {
if (f16_ne(X[ix], (float16_t){ SB_REAL16_ZERO })) {
absXI = f16_abs(X[ix]);
if (f16_lt(scale, absXI)) {
Expand Down
Loading
Loading