Skip to content

fix(firmware): SPI cache crash fix + node_id/filter_mac defensive copies + esptool v5 (rebased #397)#446

Merged
ruvnet merged 6 commits intomainfrom
fix/firmware-pr397-rebased
Apr 28, 2026
Merged

fix(firmware): SPI cache crash fix + node_id/filter_mac defensive copies + esptool v5 (rebased #397)#446
ruvnet merged 6 commits intomainfrom
fix/firmware-pr397-rebased

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Apr 28, 2026

Rebases the 6 firmware-fix commits from #397 onto current main (the original PR is now CONFLICTING with the post-merge of nvsim work).

What this fixes

  • fix(firmware): move defensive node_id capture before wifi_init_sta()g_nvs_config can be clobbered by WiFi driver init on some devices (80:b5:4e:c1:be:b8), reverting node_id to Kconfig default. Capture now runs before wifi_init_sta().
  • fix(firmware): defensive copy of filter_mac to prevent callback crash — Same clobber path corrupted filter_mac reads in the CSI callback (100–500 Hz). Module-local statics s_filter_mac + s_filter_mac_set insulate the callback from any subsequent corruption.
  • fix(firmware): MGMT-only promiscuous filter to prevent SPI cache crash — ESP32-S3 nodes crashed in cache_ll_l1_resume_icache / wDev_ProcessFiq after ~2400 callbacks under DATA-frame load. Narrowed to WIFI_PROMIS_FILTER_MASK_MGMT (~10 Hz beacons).
  • fix(provision): write-flash → write_flash for esptool v5 compat — ESP-IDF v5.4 ships esptool 4.10 (underscore form only); standalone esptool v5 accepts both. Reverts to write_flash so both tools work.
  • fix(firmware): 50 Hz callback rate gate + sdkconfig extra IRAM optCSI_MIN_PROCESS_INTERVAL_US early-drops excess callbacks; CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y as defense-in-depth.
  • fix(firmware): address PR #397 review feedback — review fixes from the original PR.

Validation on COM7 (this session)

  • Build: 866 KB esp32-csi-node.bin, 54 % flash free
  • Flash: 5.7 s @ 1221 kbit/s, hash verified
  • 4-minute soak: 3000 CSI callbacks, 0 crash markers (old threshold was ~2400 before SPI cache crashes)
  • BR=9.2 br/min, persons=4, drops=0 — clean steady-state at MGMT-only ~10 Hz rate

Closes the path to a v0.6.3-esp32 firmware release. Original PR #397 should be closed once this lands.

proffesor-for-testing and others added 6 commits April 28, 2026 08:39
The original defensive copy in csi_collector_init() (line 172 of main.c)
runs AFTER wifi_init_sta() (line 147), which on some ESP32-S3 devices
corrupts g_nvs_config.node_id back to the Kconfig default of 1.

Reproduced on device 80:b5:4e:c1:be:b8 (ESP32-S3 QFN56 rev v0.2):
  - NVS provisioned with node_id=5
  - Release firmware (no fix): seed receives node_id=1 (clobbered)
  - This patch: seed receives node_id=5 (correct)

Changes:
  - Add csi_collector_set_node_id() called from main.c immediately
    after nvs_config_load(), before wifi_init_sta() runs
  - csi_collector_init() now detects and logs the clobber if early
    capture disagrees with current g_nvs_config value
  - Fallback path preserved: if set_node_id() is never called,
    init() still captures from g_nvs_config (backwards compatible)

Co-Authored-By: claude-flow <ruv@ruv.net>
The CSI callback reads g_nvs_config.filter_mac_set and filter_mac on
every invocation (100-500 Hz). If wifi_init_sta() corrupts g_nvs_config
(same root cause as the node_id clobber), the callback reads garbage
from the struct, leading to Core 0 LoadProhibited panic after ~2400
callbacks (~70 seconds of operation).

Extends the early-capture pattern from the node_id fix to also copy
filter_mac_set and filter_mac into module-local statics before WiFi
init runs. Adds canary logging to detect filter_mac corruption.

Observed on device 80:b5:4e:c1:be:b8 via serial:
  CSI cb #2400 → Guru Meditation Error: Core 0 panic'ed (LoadProhibited)
  → TG0WDT_SYS_RST → reboot → crash again at ~2900 callbacks

Refs #232 #375 #385 #386 #390

Co-Authored-By: Ruflo & AQE
The WiFi driver's wDev_ProcessFiq interrupt handler crashes with
LoadProhibited in cache_ll_l1_resume_icache when promiscuous mode
captures MGMT+DATA frames (100-500 interrupts/sec). The high interrupt
rate races with SPI flash cache operations, corrupting cache state.

Changes:
- Promiscuous filter: MGMT+DATA → MGMT-only (~10 Hz beacons)
- CSI config: disable htltf_en and stbc_htltf2_en (LLTF-only)

LLTF provides 64 subcarriers (HT20) — sufficient for presence,
breathing, and fall detection. The 10 Hz beacon rate eliminates
the SPI flash cache contention that caused the crash.

Verified on device 80:b5:4e:c1:be:b8:
- Before: LoadProhibited crash at ~1600-2400 callbacks (every ~70s)
- After: 2700+ callbacks over 4.7 minutes, zero crashes

Backtrace decode confirmed crash in ESP-IDF closed-source WiFi blob:
  _xt_lowint1 → wDev_ProcessFiq → spi_flash_restore_cache
  → cache_ll_l1_resume_icache → EXCVADDR=0x00000004 (NULL deref)

Co-Authored-By: Ruflo & AQE
esptool v5+ rejects hyphenated subcommands. The provision script
used 'write-flash' which fails with "invalid choice". Changed to
'write_flash' (underscore) which works with both old and new esptool.

Co-Authored-By: Ruflo & AQE
- Add early rate gate in wifi_csi_callback at 50 Hz (defense-in-depth,
  does not prevent crash alone but reduces callback execution time)
- Add null-data injection timer infrastructure (disabled — TX adds
  interrupt pressure that triggers the SPI cache crash, RuView#396)
- sdkconfig.defaults: add CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y
- sdkconfig.defaults: document SPIRAM XIP attempt (crashes differently)

Co-Authored-By: Ruflo & AQE
Applies @ruvnet's five review requests on PR #397 (RuView#397 comment
4289417527):

1. **Inline comment on `provision.py` `write_flash`** — ESP-IDF v5.4
   bundles esptool 4.10.0 (underscore-only). #391's hyphen swap broke
   the documented venv flow; kept the underscore form and added a
   three-line comment warning future maintainers not to "re-fix" it.

2. **Correct `edge_processing.c` sample_rate** (blocking) — changed
   hard-coded `20.0f` → `10.0f` at line 718 so
   `estimate_bpm_zero_crossing()` matches the MGMT-only CSI rate.
   Without this, breathing and heart-rate reports were 2× the true
   value. Added a comment tying the constant to the callback rate gate.

3. **Removed disabled probe-injection infrastructure** — dropped the
   forward declaration, the `CSI_PROBE_INTERVAL_MS` define, six static
   variables (`s_probe_timer`, `s_probe_tx_count`, `s_probe_tx_fail`,
   `s_ap_bssid`, `s_ap_bssid_known`), and three functions
   (`csi_send_probe_request`, `probe_timer_cb`,
   `csi_collector_start_probe_timer`). None were reachable.
   `csi_inject_ndp_frame()` reverted to the original ADR-029 stub.
   Can be revived from this commit's parent if needed.

4. **Cleaned `sdkconfig.defaults`** — removed the SPIRAM prose and
   commented-out `# CONFIG_SPIRAM is not set` line. Kept only the live
   `CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y` with a concise rationale.

5. **Bumped firmware version 0.6.1 → 0.6.2** and added four
   `[Unreleased]` CHANGELOG entries covering the SPI cache crash fix,
   the `filter_mac` / `node_id` clobber defense, the sample-rate
   correction, and the `write_flash` command-form revert.

Net: +39 / -128 across six files.

Validation in this devcontainer:
- Static sanity on modified C files: braces balance (csi_collector.c
  59/59; edge_processing.c 96/96), zero dangling references to removed
  probe-injection symbols.
- Rust workspace tests and Python proof not executed here — cargo not
  installed and pip blocked by PEP 668. Deferring hardware build +
  flash + miniterm verification to @ruvnet's COM7 per his offer in
  the review comment.

Co-Authored-By: claude-flow <ruv@ruv.net>
@ruvnet ruvnet merged commit f06d0c6 into main Apr 28, 2026
13 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants