Skip to content

fix: first-run breakage (#559, #561) + #560 platform-aware diagnosis#562

Open
ruvnet wants to merge 3 commits into
mainfrom
fix/issues-559-561
Open

fix: first-run breakage (#559, #561) + #560 platform-aware diagnosis#562
ruvnet wants to merge 3 commits into
mainfrom
fix/issues-559-561

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented May 14, 2026

Three real first-run breakages reported in the last few hours. Fresh-clone users hitting any of these would conclude the project doesn't work — and that perception substrate is what's feeding the "feels like mock" narrative in #557.

What's fixed

#559./verify pointed at removed v1/ paths

Wrapper hard-coded v1/data/proof / v1/src but the proof scripts moved to archive/v1/ long ago. Fresh clone failed before the pipeline ran. Reporter (@Fewmanism) provided the exact diff in the issue; applied verbatim across all four hits.

./verify   # now end-to-end PASS

#561 — firmware README would misflash and point at the wrong provisioner

  1. Wrong flash offset. Manual flash command put the app at 0x10000. Partition tables (partitions_display.csv, partitions_4mb.csv) put ota_0 at 0x20000. 0x10000 is phy_init — flashing the app there would corrupt PHY data. Both occurrences fixed; added 0xf000 ota_data_initial.bin which release bundles ship.
  2. Wrong provisioner path. README said python scripts/provision.py. There are two provision.py in the repo — scripts/provision.py (275 lines, stale) and firmware/esp32-csi-node/provision.py (348 lines, has the provision.py: esptool v5 incompat + NVS partition wipes existing keys when partial update #391 full-replace fix). README updated to point at the canonical one. The stale duplicate is a separate cleanup.

#560 — proof hash mismatches on macOS arm64 / Accelerate

@Fewmanism reports that with pinned numpy 1.26.4 / scipy 1.14.1 on macOS arm64, the proof's SHA-256 differs. Root cause: numpy/scipy use Accelerate.framework on darwin-arm64 and OpenBLAS on linux/windows x86_64. Accelerate's FFT + BLAS produce bit-different IEEE 754 output. That is not a code bug — the proof's bit-exact contract cannot hold across BLAS backends.

What this PR changes:

  • verify.py now prints a RUNTIME ENVIRONMENT block before the pipeline runs (platform, machine, Python version, numpy BLAS backend).
  • The FAIL message reorders causes: platform BLAS/FFT backend is the primary suspect (not "unlikely"), with a pointer to the printed env block.
  • New archive/v1/data/proof/REFERENCE_PLATFORMS.md documents the reference platforms (linux/windows x86_64 with OpenBLAS), the expected-MISMATCH platforms (darwin-arm64 with Accelerate, any MKL install), and three workable responses.

Converts #560 from "the proof is broken on my Mac" → "the proof has a documented single-backend contract".

Verification

  • ./verify on Windows x86_64 / OpenBLAS — VERDICT PASS, hash 8c0680d7…51c6 matches expected. RUNTIME ENVIRONMENT block prints numpy BLAS: scipy-openblas.
  • grep -E '0x10000|scripts/provision\.py' firmware/esp32-csi-node/README.md — no matches.

Closes

🤖 Generated with claude-flow

…gnosis

Three related fixes — a fresh-clone user hitting any of these would
conclude the project doesn't work; #557's "feels like mock" narrative
is fed in part by these breakages.

## #559 — `./verify` pointed at removed `v1/` paths

The wrapper hard-coded `v1/data/proof` / `v1/src`, but the proof scripts
moved to `archive/v1/` long ago. A fresh clone failed before the
pipeline could even run. User `Fewmanism` provided the exact diff in
the issue. Applied verbatim across four hits (PROOF_DIR, V1_SRC, the
Phase 3 scan-message, and the SKIP-state recovery hint).

  ./verify  # now PASS end-to-end

## #561 — firmware README would misflash and point at the wrong provisioner

Two real bring-up bugs:

1. Manual flash command put the app at `0x10000`. The partition tables
   (`partitions_display.csv`, `partitions_4mb.csv`) define `ota_0` at
   `0x20000`. `0x10000` is the start of `phy_init` data — flashing
   the app binary there would corrupt the PHY init data and the app
   would never run. The QEMU section already had the right `0x20000`,
   so this was an internal contradiction. Both occurrences fixed.

   Also added `0xf000 ota_data_initial.bin` to the manual flash
   command — the release bundle ships this binary and without it the
   bootloader can refuse to boot after a factory wipe.

2. `python scripts/provision.py` referenced the wrong file. There are
   actually TWO `provision.py` files in the repo (`scripts/` — 275
   lines, stale; `firmware/esp32-csi-node/` — 348 lines, has the
   issue #391 full-replace semantics fix). The canonical one is in
   the firmware dir. Both README occurrences fixed to point at the
   canonical path. (The stale `scripts/provision.py` is a separate
   cleanup; the historical ADRs that reference it are intentionally
   not touched.)

## #560 — proof hash mismatches on macOS arm64 / Accelerate

User `Fewmanism` reports that with the same pinned `numpy 1.26.4` /
`scipy 1.14.1` on macOS arm64, the proof's SHA-256 differs from the
published expected hash. The proof passes on linux-x86_64 and
windows-x86_64 (where wheels ship OpenBLAS); it mismatches on
darwin-arm64 (where numpy/scipy use Accelerate.framework). That is
not a code bug — Accelerate's FFT and BLAS produce bit-different
output on identical IEEE 754 inputs from the same backend, and the
proof's bit-exact contract therefore cannot hold across backends.

What this commit changes:

- `verify.py` now prints a RUNTIME ENVIRONMENT block before the
  pipeline runs: platform, machine, Python version, numpy BLAS
  backend. Users on a non-reference backend see the cause up front.
- The FAIL message reorders causes: platform BLAS/FFT backend is
  now the *primary* suspect (not "unlikely"), with a pointer to
  the printed RUNTIME ENVIRONMENT block.
- New `archive/v1/data/proof/REFERENCE_PLATFORMS.md` documents the
  reference platforms (linux-x86_64 + windows-x86_64 with OpenBLAS),
  the expected-MISMATCH platforms (darwin-arm64 with Accelerate,
  any MKL install), and three workable responses for users hitting
  a non-reference backend (run on a reference platform, generate a
  local-reference hash, or use tolerance-based comparison — that
  last one is the roadmap path).

This converts #560 from "the proof is broken on my Mac" to "the proof
has a documented single-backend contract".

## Verification

- `./verify` (Windows x86_64 / OpenBLAS): VERDICT PASS, hash
  `8c0680d7…51c6` matches expected. RUNTIME ENVIRONMENT block prints
  numpy BLAS = `scipy-openblas`.
- `grep -E '0x10000|scripts/provision\.py' firmware/esp32-csi-node/README.md`:
  no matches.

Co-Authored-By: claude-flow <ruv@ruv.net>
@Fewmanism
Copy link
Copy Markdown

Thank you for the quick and thorough follow-up. I appreciate you applying the verify fix, clarifying the ESP32-S3 flashing/provisioning docs, and documenting the macOS arm64 BLAS/hash behavior.

Same drift as #559 but in CI: the workflow ran `working-directory: v1`
on the two verify steps, but the Python codebase moved to `archive/v1/`
ages ago. The job failed with:

  An error occurred trying to start process '/usr/bin/bash' with
  working directory '/home/runner/work/RuView/RuView/v1'.
  No such file or directory

Fixed both occurrences (working-directory: v1 -> working-directory:
archive/v1).

Also added `SECRET_KEY` env var to both steps — `verify.py` transitively
imports `src.app` -> `src.config.settings` (since PR #547 introduced
pydantic-settings with a required `secret_key` field). The value is
never used for any auth path in the proof pipeline; it just needs to
satisfy the import chain. Same env-var workaround used locally to make
`./verify` pass.

After this commit, "Verify Pipeline Determinism (3.11)" should go green
on this PR.

Co-Authored-By: claude-flow <ruv@ruv.net>
# uses pydantic-settings with a required `secret_key` field. The proof
# only needs the import chain to resolve; the value is never used for
# any auth path in the proof pipeline.
SECRET_KEY: ci-proof-replay-only-not-a-real-secret
working-directory: v1
working-directory: archive/v1
env:
SECRET_KEY: ci-proof-replay-only-not-a-real-secret
Two real bugs found while pushing the v0.8.0 image to Docker Hub:

## Rust 1.85 -> 1.90

`hnsw_rs 0.3.4` (transitive via wifi-densepose-ruvector ->
ruvector-attn-mincut -> hnsw_rs) calls `nbp.is_multiple_of(500_000)`.
`is_multiple_of` on unsigned integers was stabilised in Rust 1.87
(rust-lang/rust#128101 — RFC 3565). On 1.85 the compile fails with:

  error[E0658]: use of unstable library feature `unsigned_is_multiple_of`
   --> hnsw_rs-0.3.4/src/hnswio.rs:736:20

Pinned to 1.90 for reproducibility — a comment in the Dockerfile flags
the 1.87 MSRV requirement so a future downgrade can't quietly break it.

## .gitattributes — force LF on shell scripts + Dockerfile

Without a `.gitattributes`, git's default `core.autocrlf=true` on
Windows converts shell scripts to CRLF on checkout. `COPY`ing
`docker/docker-entrypoint.sh` into a Linux image then preserves CRLF.
The shebang line `#!/bin/sh\r\n` causes `exec /app/docker-entrypoint.sh`
to fail with:

  exec /app/docker-entrypoint.sh: no such file or directory

The kernel tries to look up an interpreter literally named `/bin/sh\r`,
which doesn't exist. Container exits immediately. The first v0.8.0
image push (digest sha256:7957…44fa) suffered exactly this; the
re-pushed image (digest sha256:e9f4…d38315) was built on a
renormalised tree.

The .gitattributes rule forces LF for:
  - *.sh / *.bash
  - Dockerfile*
  - docker/* (covers docker-entrypoint.sh + docker-compose.yml)
  - scripts/*
  - `verify` (the proof-replay wrapper — same root cause as if it
    had landed CRLF in someone's clone)

Binary file globs (*.bin, *.wasm, *.rvf, *.pcap, etc.) explicitly
marked binary so text-normalisation never touches them.

## CHANGELOG — drop the false `--introspection` flag claim

The CHANGELOG entry for v0.8.0 said the introspection endpoints were
"off by default, enabled via `--introspection`". That isn't true:
`sensing-server --help` has no such flag. The routes are mounted
unconditionally in `main.rs`. The per-frame `update()` p99 of
0.041 ms (~24× under D4's 1 ms budget) makes always-on viable; the
"off by default" framing came from an earlier draft of ADR-099 that
the implementation outgrew. Corrected.

## Verification

End-to-end smoke test of the pushed image:

  docker run -d -p 13000:3000 -e CSI_SOURCE=simulated     -e SENSING_BIND_ADDR=0.0.0.0 ruvnet/wifi-densepose:v0.8.0

  /health -> {"status":"ok","source":"simulated",...}
  /api/v1/info -> {"backend":"rust","features":{"ruvector":true,"signal_processing":true,...}}
  /api/v1/introspection/snapshot -> {"regime":"unknown",
    "regime_changed":false,"top_k_similarity":[]} (ADR-099 shape exact)
  /ui/observatory.html -> HTTP 200, 15 KB

Published manifest digests:
  ruvnet/wifi-densepose:v0.8.0 -> sha256:e9f4c5af…d38315
  ruvnet/wifi-densepose:latest -> sha256:e9f4c5af…d38315

Co-Authored-By: claude-flow <ruv@ruv.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants