ESP32-S3 deploy v2: gain-lock + NBVI + persistent baseline + auto-recalibrate + complex CSI + REST + drift channel (ADR-100..108, 110)#596
Conversation
End-to-end deployment fixes that took the two ESP32-S3 sensor boards (room01, room02) from "boots but DSP frozen, OTA always rolls back" to "motion/presence/breathing all live, two consecutive OTA round-trips succeed". Full forensic write-up in docs/adr/ADR-098. Firmware (firmware/esp32-csi-node/main/): * csi_collector.c — remove esp_wifi_set_promiscuous(true): this call silenced the CSI RX callback entirely on this silicon revision (yield=0pps). Without it, callbacks resume at ~5-10 pps. * edge_processing.c — root cause: incoming CSI frames carry 192 subcarriers but EDGE_MAX_SUBCARRIERS=128, so the size check early-returned every frame and Step 8 (motion) never ran. Truncate to 128 + warn once instead of returning. * edge_processing.c — replace per-bin unwrapped-phase variance with temporal variance of per-frame broadband mean amplitude. Empirical separation on deployed hardware: empty 0.07-0.10, walking 3.5-14 (~44x). Scaled by /3.0 and clamped to [0,1]. * edge_processing.c — biquad fs 20.0 -> 10.0, matching the actual callback rate (was halving the breathing passband). * ota_update.c — OTA_WITH_SEQUENTIAL_WRITES -> OTA_SIZE_UNKNOWN to erase the full target partition (stale tail of the previous larger image was crashing the new image on boot, looking like rollback). * ota_update.c — httpd_config_t.stack_size = 8192 (default 4 KB overflowed in OTA verify path). * main.c — log esp_reset_reason() and running_partition->label once at app_main start, so OTA outcomes are visible without guesswork. * sdkconfig.defaults — local deployment defaults: tier=2, display disabled (no expander on these boards), 8192 timer stack. Sensing server (v2/crates/wifi-densepose-sensing-server/): * src/main.rs — parse_rv_feature_state() for the 0xC5110006 feature_state packet that RuView FW emits by default; this format was previously unhandled. Wire ahead of parse_esp32_vitals. * src/main.rs — BaselineTracker with hysteretic motion gating on top of FW-reported scores, so UI sees clean boolean presence transitions. * src/main.rs — refuse --source simulate; remove auto-fallback to synthetic data. Production builds never run on fake signals. * src/main.rs/csi.rs — parse_csi_lean() for legacy FW 5.47 CSV packets; defence-in-depth for mistakenly flashed legacy sensors. Desktop UI (v2/crates/wifi-densepose-desktop/): * src/commands/discovery.rs — third discovery path: HTTP /status sweep across the local /24 in parallel with mDNS/UDP. mDNS+UDP-beacon are not advertised by current RuView FW. Replace sequential for-task-in-tasks select-with-deadline (which blocked on slow unrelated IPs) with futures::join_all + overall timeout. * src/commands/server.rs — pass --bind-addr (was --bind); pass RUST_LOG env instead of unsupported --log-level; auto-load bundled wifi-densepose-v1.rvf next to the binary; reasonable defaults (esp32 source, 0.0.0.0 bind). * ui/* — keep last good node list when a poll returns 0 (discovery is jittery on busy LANs); 8 s timeout (was 3 s); remove "simulate" from DataSource enum and Sensing dropdown; default Sensing source esp32. Mobile UI (ui/mobile/): * constants/websocket.ts — WS_PATH '/ws/sensing' + WS_PORT 8765 to match the RuView sensing-server's WS endpoint (was the legacy FastAPI /api/v1/stream/pose). * services/ws.service.ts — derive WS host from serverUrl but use WS_PORT; remove simulation fallback paths entirely (no generateSimulatedData, no startSimulation on reconnect failure). * stores/settingsStore.ts — serverUrl defaults to http://100.123.189.10:8080 (deployed Mac's Tailscale IP), so the phone connects from any network without LAN dependency. * stores/matStore.ts — default dataSource='real', simulationAcknowledged=true; no synthetic triage data. * screens/MATScreen, VitalsScreen — hide simulation overlay/badge. Docker: * docker/docker-compose.yml — sensing-server host port 5005 -> 5006 to match the RuView FW's compiled CSI_TARGET_PORT default. Documentation: * docs/adr/ADR-098-esp32s3-csi-deployment-fixes.md — full forensic ADR covering each decision, the empirical numbers that drove it, the false hypotheses we ruled out along the way, and open items. Verified on hardware (both nodes): * motion empty < 0.05 (room01 0.018, room02 0.070) * motion walking > 0.3 within 1-3 s, saturates at 1.0 * motion decay < 0.1 within 5 s after leaving * breathing 21-22 BPM detected after ~30 s stationary * two consecutive OTA round-trips succeed without USB intervention * discovery finds both sensors via HTTP sweep in <2 s Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operator's household environment showed CSI-variance presence detection failing — empty room produced HIGHER variance than an occupied room because ambient WiFi noise (neighbour APs, retransmits, BT-coex) dominated the broadband-variance signal at multi-meter range. Deployed a TP-Link TL-WR841N in WISP mode as a dedicated isolated AP for the sensors: * Sensors associate only with TP-Link_8340 (clean channel) * TP-Link bridges to the household AP, NAT-forwards sensor UDP to the Mac * Mac keeps its primary household-AP association — no LAN reconfig needed * Empty-room variance dropped 50.7 → 35.8 (-30%) Replaced presence classification with RSSI MAD-Δ override: * Per-node rolling 120-sample (~10 s @ 12 Hz) window of frame RSSI * Metric: mean(|Δrssi|) between consecutive frames — robust to int8 quantisation jitter * Thresholds tuned for the operator's geometry: d < 0.20 → absent < 0.55 → present_still < 1.10 → present_moving >= 1.10 → active * Confidence field temporarily carries raw d for in-field threshold tuning * CSI-based features (variance, motion_band_power, spectral_power) remain in features.* for vital-sign signal-quality and multi-node fusion paths UI / tooling: * New static/spectrum.html — live signal console: combined classification, all host-computed features (variance, motion_band, spectral, breathing band, RSSI, dominant_freq, change_points), per-node FW signals, and a 60-second variance trace. Served via `python -m http.server 8091`. * static/calibrate.html — simpler per-node motion/presence/RSSI bars with peak-hold. Desktop UI / discovery hardening (rolled in here because they came up during this debug session): * commands/discovery.rs: HTTP sweep limited to 2..=60 hosts (was 1..=254), mDNS + UDP-broadcast paths disabled (current RuView FW doesn't advertise them and they were burning CPU every poll cycle). Per-request timeout set to 1500 ms with overall budget enforced via tokio::time::timeout + futures::join_all (replaces the previous sequential select loop that blocked on slow IPs). * ui/hooks/useNodes.ts: poll interval 10 s → 30 s. * ui/pages/Dashboard.tsx + NetworkDiscovery.tsx: merge new scan results into existing list instead of replacing — discovery races sometimes miss a node that was found a moment ago. Firmware tuning: * edge_processing.c: broadband-variance divisor /3.0 → /30.0 → /5.0 iterated; final /5.0 chosen for multi-meter geometry (sensor 1-3 m from activity zone). DEBUG_MOTION_DSP scaffolding removed. * csi_collector.c: CSI_MIN_SEND_INTERVAL_US 20 ms → 4 ms so the host can see every available frame (real ceiling is the WiFi CSI callback rate). Documentation: * docs/adr/ADR-099 — full forensic write-up: measurement tables for sit/ walk/empty, the RSSI-Δ rationale, the WISP setup procedure, calibration protocol for new deployments, and open items. Verified end-to-end on hardware (sensors at 192.168.1.17/.19 → TP-Link at 192.168.1.14 → Mac at 192.168.1.21): * UDP/5006 packets arrive ~12 Hz combined from both nodes * Empty-room baseline d ≈ 0.49 measured (next: capture sit + walk to finalize thresholds) * Vital signs continue to populate (breathing 9–11 BPM stable) * Two consecutive OTA round-trips remain functional after the change Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports Francesco Pace's ESPectre gain-lock (GPLv3) to RuView FW: medians AGC and FFT scale over the first 300 packets after boot, then freezes them via phy_force_rx_gain / phy_fft_scale_force. With both sensors locked and proper AP→body→sensor geometry, a 30-s × 3-state capture (empty / still / walk) now separates by ×3.4–×5.9 instead of ±0.02 within ±0.10 noise as in ADR-099. Adds static/raw.html — per-node 56-subcarrier amplitude bars + RSSI/ broadband traces, no DSP, for live calibration. ADR-100 documents the technique, boot calibration values for the operator's deployment (AGC=42/44, both APPLIED), and the verified three-state separation table.
After ADR-100 gain-lock reveals a clean baseline, the broadband CV of
mean amplitude separates EMPTY/STILL/WALK by 3-6× on the operator's
deployment where RSSI MAD-Δ overlapped within noise. Adds:
amp_presence_override(node_id, amps) — per-frame: rolling 4.5 s
short window for CV, 60 s long window for 95th-percentile baseline,
cross-node fusion (MAX CV gate, ANY baseline-drop → still),
3 s motion hysteresis to bridge step pauses.
amp_classify_from_latest() — readonly fusion for feature_state
(0xC5110006) and adaptive-model paths that don't carry raw amps.
Wired into the three SensingUpdate-producing paths (raw CSI,
feature_state, adaptive model). Marks rssi_presence_override as
dead_code, kept for reference.
Live test (10 samples @ 3 s):
walk: present_moving, CV 41-53 %, sustained through pauses
stop: absent (CV 4-8 %) after 3 s hold expires
Surfaces the raw-amplitude classifier's per-node decision in node_features[].classification so the UI can show which sensor is actually seeing motion at any moment. Lets the operator visually find the best sensor placement without physically moving things — just walk around and watch which badge lights up. Server side: adds amp_node_level() pure helper + amp_node_snapshot() that reads AMP_LATEST, then plugs it into build_node_features so the existing PerNodeFeatureInfo.classification carries the new labels. UI: adds a global badge in the top bar and a per-node badge inline in each h2, color-coded (grey/absent, blue/present_still, green/moving, red/active) plus the live per-node CV %.
…ture_state Two server-side parsers (csi.rs::parse_esp32_frame and the duplicate in main.rs) read every field after `n_antennas` from offsets shifted by 2 bytes — n_subcarriers as u8 instead of u16, sequence at 10..14 instead of 12..16, rssi at 14 instead of 16. The saturating_neg() workaround hid the bug by always forcing a negative dBm value, so the trace looked plausible but was actually a slice of mid-sequence number. ADR-100 D3 documented this as an open item; this commit closes it. Adds two regression tests in csi.rs (header-offset round-trip with distinctive values per field, plus 20-byte boundary case) so the layout contract can't drift again without CI catching it. Even with both parsers correct, RSSI never reached the UI because the firmware now ships only rv_feature_state_t (0xC5110006) — raw CSI (0xC5110001) is no longer hot. rv_feature_state had no RSSI field; both parsers fell back to rssi: -50 hardcode. To fix without a protocol bump: repurpose the first byte of the trailing `reserved` field (offset 54) as `int8_t rssi_dbm`. Firmware fills it from radio_ops::get_health()::rssi_median_dbm in emit_feature_state. Server reads buf[54] as i8; 0 means "not measured yet" → keeps the historical -50 fallback for backward compat with pre-update nodes. Verified live on TP-Link WISP (192.168.0.100/101): node 1: -54 dBm node 2: -63 dBm (was plateau -50.0 fallback) Co-Authored-By: claude-flow <ruv@ruv.net>
…ated) Adds the scaffolding for Narrow-Band Vital Information ranking: an exponentially-weighted moving variance per subcarrier (alpha = 0.02 → tau ≈ 10 s at 5 pps), refreshed every 25 frames into a stable_bin mask = bins whose EMA variance is below the across-band median. The intended payoff is to drive per-node CV in STILL down by averaging broad_mean_amp_history over quiet bins only (instead of all 128), so ADR-101's STILL/EMPTY classifier separates them at a smaller body block. Activated path is REVERTED in this commit on purpose. Quiet bins by construction barely move, so windowed variance of their mean collapses to ~0 and motion_energy goes constant. Empirical verification 2026-05-17: motion_score pinned at 0.013/0.021 with std=0 across 125 frames after turning quiet-only averaging on; reverted to full-band push_val for motion_energy with a comment explaining why. The right shape is a second channel in rv_feature_state_t carrying "baseline_quiet" alongside motion_score so the server can use one for classification and the other for motion gating — that's an additive protocol bump and a separate change. EMA state lands now so we don't have to wire it back from scratch when we do it. Also kept from the earlier session: the n_subcarriers > 128 truncate fix (root cause of motion_energy = 0 — process_frame used to early- return on 384-byte CSI frames from this silicon) and the broadband-mean amplitude history that feeds Step 8. Co-Authored-By: claude-flow <ruv@ruv.net>
scripts/ota-deploy.sh
Python 3 helper (the earlier bash version tripped over macOS bash 3.2's
missing associative arrays). One invocation with no arguments:
1. discovers nodes in the local /24 via ARP + /ota/status:8032 probe;
2. POSTs the firmware blob to every node in parallel;
3. waits for reboot, polls /ota/status until running_partition flips,
and fails-loud if any node stays on the old partition (typical
symptom of a panic on first boot from the new slot).
Supports `--build` (idf.py build first), `--no-verify`, explicit IP
list, and OTA_PSK=<token> for the ADR-050 Bearer auth path.
Measured cycle: ~25 s end-to-end for both room01 + room02.
static/mobile.html
Mobile-first sibling of static/raw.html. The desktop page is unreadable
on a 360-420 px screen — bars chart fights the narrow viewport, 11-12 px
font, controls overlap the badge. The mobile page:
- sticky global badge (30 px) + connection pill + reset (44 px tap);
- per-node card with 22 px node badge, 18 px stat tiles, 90 px trace;
- drops the bars chart (useless under 600 px wide);
- viewport-fit=cover, theme-color, apple-mobile-web-app meta tags;
- high-contrast palette tuned for outdoor light;
- reuses the /ws/sensing contract verbatim — anything that lights up
raw.html lights this up too.
main.rs ServeDir route
Adds `.nest_service("/static", ServeDir::new(.../static))` so
raw.html / mobile.html / calibrate.html / spectrum.html are served on
the main 8080 port. Previously they needed a separate
`python -m http.server :8091`, which the operator had to remember to
start by hand on every deploy. Now there's exactly one URL per device.
Reachable from a phone on the LAN:
http://<mac>:8080/static/mobile.html
http://<mac>:8080/static/raw.html
Co-Authored-By: claude-flow <ruv@ruv.net>
* nodes[].rssi_dbm of 0 used to display literally as "0.0 dBm", misleading the operator when rssi_history was empty on the first few ticks. Now coerce to "--" and skip pushing zeros to the trace. * per-node fps was 1/dt instantaneous, blown up to 235 by multiple SensingUpdate emit paths firing back-to-back. Replaced with a 1-second windowed counter — now matches the real ~38 fps per node.
Ports Pace's NBVI = α·(σ/μ²) + (1-α)·(σ/μ) (α=0.5) into the amp_presence_override classifier. Per node, accumulates a 30-second ring of full amplitude vectors, every ~5 s ranks the subcarriers, picks top-12 by lowest NBVI, then computes broadband mean and CV ONLY on that subset instead of all 56 subcarriers. Live impact on the operator's deployment (idle room, 2 pps ping): node 1 CV: 5% -> 3.1% (-38 %) node 2 CV: 7% -> 3.9% (-44 %) Thresholds tightened proportionally to match the new baseline: active: 30 % -> 22 % present_moving: 15 % -> 10 % This lets the detector catch subtler motion (e.g. waving while seated) without raising the false-positive rate above what we had before. Implemented entirely server-side — no firmware change, no second flash cycle. Algorithm parameters in const block for easy retuning.
After 3393c1e made FW emit ~80 % feature_state packets and ~20 % raw CSI, the server's feature_state path was overwriting NodeInfo.amplitude with vec![] on every feature_state tick. raw.html's per-node bar chart ended up freezing for hundreds of milliseconds between rare raw-CSI packets, and /api/v1/sensing/latest mostly snapshotted an empty amps vector even though raw CSI was flowing. Fix: in the feature_state SensingUpdate builder, hand out ns.frame_history.back() (the last raw amps vector that the raw-CSI path pushed) instead of an empty Vec. Bars now refresh on every WS update (verified: 100/100 updates carry amps in a 4-s sample, was ~20/100 before the patch). Classifier behaviour unchanged — amp_presence_override still runs only when actual raw CSI arrives; this only affects what the UI displays.
Operator request: only one UI page open. raw.html (ADR-099 console, extended in ADR-101 with per-node classification badges) covers all live-debug use cases. mobile.html / spectrum.html / calibrate.html were either superseded or never adopted in the field — removing them reduces the surface that has to track ADR-101/102 contract changes. raw.html stays at /static/raw.html on the existing :8080 listener.
* docs/references/espectre-techniques.md — catalogues every Pace technique from Part-2 against what RuView has implemented, doesn't have, or has differently. Includes ranked open-items list. * sensing-server: revert feature_state path to vec![] amplitudes. The previous fix made bars LOOK live by reissuing the last raw-CSI vector on every feature_state tick — operator reported this made the bars misleading (visually busy but unresponsive to movement). raw.html already skips empty-amp updates so bars now refresh only on actual fresh CSI, which is honest. * raw.html: comment on the skip-empty branch for future-me.
Problem from ADR-103 v1: persisted NBVI-subset mean (19.86 in operator's
recording) drifted out of comparability after server restart because
NBVI re-selected a different top-12 subset, yielding a different mean
from the same channel. classifier saw current/baseline ratio > 1 even
in clearly empty room.
Fix:
1. Separate FULL-broadband mean (all non-zero subcarriers) from
NBVI-subset mean in amp_presence_override. NBVI subset still drives
CV / motion sensitivity. FULL is what gets compared to the
persistent baseline — stable across NBVI re-selection.
2. baseline.json schema v2: full_broadband_{mean,p50,p95,std,cv_pct}
replaces NBVI-only p95_amp/mean_amp. Loader prefers full_*; falls
back to legacy fields for backward compat.
3. NBVI Step 1 quiet-window finder (ESPectre): nbvi_select_top_k now
slides a window across the calibration history, picks the lowest-CV
sub-window, and ranks subcarriers using only that. Robust to brief
motion during the calibration buffer.
4. scripts/record-baseline.py v2: emits v2 schema, computes
full-broadband stats per node, trims head/tail transients, picks
cleanest 30-s sub-window, also saves per_subcarrier_mean for future
subcarrier-level comparison.
Operator workflow now: step out → run script → restart server →
forget about the empty-room ritual forever.
Pace's Problem ruvnet#3 ("threshold=1.0 means different things on different devices") solved by normalizing the runtime CV against the empty-room baseline CV measured during calibration. norm_cv = current_cv / baseline_cv gates: norm_cv ≥ 3.0 → present_moving norm_cv ≥ 6.0 → active Baseline CV loaded per-node from data/baseline.json (full_broadband_cv_pct). When no calibration loaded, falls back to absolute gates (0.10 / 0.22) that were deployment-tuned earlier — keeps backwards compatibility. Both per-node `amp_node_level` and global `amp_classify_from_latest` use the same normalization. On the operator's deployment with baseline CV ~4 %, the universal 3×/6× gates map to ~12 %/24 % absolute — same numbers the hard-coded thresholds had, but now any-room-portable.
* ADR-101 raw-amplitude presence/motion classifier — per-node and cross-node fusion logic, hysteresis, per-node UI surface (`PerNodeFeatureInfo.classification` override). * ADR-102 server-side NBVI subcarrier selection — formula, dead-zone gate, ESPectre Step-1 quiet-window finder, why we split FULL vs NBVI-subset broadband. * ADR-103 persistent baseline + universal threshold normalization — JSON schema v2 at `v2/data/baseline.json`, FULL-broadband over NBVI for cross-restart stability, `norm_cv = cv / baseline_cv` with universal 3×/6× gates, recording script workflow. * Updated espectre-techniques.md to reflect the DONE items (Steps 1+2+4 of NBVI, baseline persistence, universal threshold) and the remaining open items in priority order. Each ADR ≤ 200 lines per the operator's docs convention; deep detail lives in `docs/references/espectre-techniques.md` (also ≤ 200) which the ADRs link to. README.md and CLAUDE.md unchanged (no extra content added; existing >200-line state pre-dates this session).
…section Catalogues, section-by-section against Pace's Part-2 article, every ESPectre technique RuView has and does not have, plus a prioritized roadmap (9 items, NVS persistence and FP-rate validation top of list). Replaces the 8-item inline "open items" stub in espectre-techniques.md with a 1-line forward link. Both files stay ≤ 200 lines per the docs convention.
…l data
Operator inspected the rich Docker UI tied to our backend and noticed
the dashboard showed a 17-keypoint skeleton even with no DensePose
model loaded. Tracing it: `derive_pose_from_sensing` synthesized
geometric placeholders, `pose_stats.average_confidence` was hard-coded
0.87, `pose_zones_summary` invented zones 2/3/4 as "clear", and
`/api/v1/info.features.pose_estimation` claimed `true` regardless.
All cosmetic noise that hid the real capability gap.
Changes:
* `derive_pose_from_sensing` is now an inert `Vec::new()` stub.
Heuristic logic kept in `derive_single_person_pose` (dead-code-warned
out by the rustc unused-fn lint) for the day someone wires a real
trained pose model in.
* `pose_current` returns persons only when `model_loaded == true`; the
endpoint always includes `model_loaded` so the UI can decide what
to render.
* `pose_stats` drops the fake `average_confidence: 0.87`.
* `pose_zones_summary` reports `zones_configured: 0` and an empty
`zones {}` instead of fabricating four zones.
* `api_info.features.pose_estimation` now mirrors `s.model_loaded`.
Sensing endpoints (`/api/v1/sensing/latest`, `/ws/sensing`) are
unchanged — they always carried real ESP32-derived data per ADR-101.
Continuation of ADR-105 (no synthetic outputs in production runtime). The 20×20 SignalField heatmap was generated by mapping subcarrier index k to angle 2π·k/N and dropping a Gaussian hotspot — a totally fabricated spatial layout. A single sensor has no directional info so the resulting heatmap had no correspondence to where anything actually was in the room; UI showed believable-looking but physically meaningless hotspots. Operator asked for boots-on-the- ground honesty. `generate_signal_field` now returns a zero-filled 20×1×20 grid. UI renders blank, which is the truthful state until a real multistatic localizer is wired (multi-AP attention from ADR-008 or the `MultistaticFuser` already in code). Audit of remaining fields confirmed they are either: - already gated on real data (vital_signs returns None when br < 1 BPM, persons/pose_keypoints/posture/signal_quality_score all None without model loaded), - or processed from real CSI (classification, features.mean_rssi, features.variance, enhanced_motion when multi-AP pipeline active). `--source simulate` was already disabled by an earlier change (exit code 2). `--pretrain` and `--train` synthetic fallbacks remain in code as developer tools but never touch the runtime sensing path.
Records the cleanup of five fake outputs the rich Docker UI exposed when pointed at our backend without a trained pose model loaded: D1 derive_pose_from_sensing → Vec::new() D2 pose_current → gated on s.model_loaded D3 pose_stats → drop hard-coded average_confidence 0.87 D4 pose_zones_summary → drop fabricated zones, report real presence D5 api_info.pose_estimation → reflects s.model_loaded D6 generate_signal_field → returns zero-filled grid (was procedural) Two implementation commits already on the branch: 9aa027e and 30244d2. Audit table confirms /api/v1/sensing/latest now carries only real ESP32-derived state. Out-of-scope items (--source simulate already disabled; --pretrain/--train synthetic fallbacks are explicit dev flags; vital_signs already gated on real detection) are documented so the next reader doesn't re-audit them.
Operator asked for maximum raw signal off the sensors so a future trained pose / fine-motion model has everything it needs, instead of only the amplitude scalar we surfaced before. Adds four fields to NodeInfo: phases: Vec<f64> per-subcarrier atan2(Q,I), radians n_antennas: u8 RX antenna count from WiFi driver noise_floor_dbm: i8 noise floor reported by ESP-IDF timestamp_us: u64 per-frame µs timestamp from the sensor Each is `skip_serializing_if = zero-or-empty` so feature_state ticks (which carry no raw CSI) stay slim in the WS payload — only real raw CSI frames populate them. NodeState gains: latest_phases / latest_noise_floor / latest_n_antennas / latest_timestamp_us (per-node stash, replaces having to keep a parallel phase_history). The raw-CSI ingest path populates these on every frame. Verified live: WS now emits 185 messages over 4 s (~46 fps) with both amplitude[56] and phases[56] populated; noise_floor reports -91 dBm; n_antennas reports 1 (ESP32-S3 single antenna).
Continuation of ADR-106 (max raw signal off sensors). Operator was running `ping -i 0.05 192.168.0.101 &` by hand to keep CSI callbacks firing on the sensors. Server now does this itself: * Track per-node source addresses in NODE_ADDRS, populated on every recv_from via a cheap magic-byte peek (works for 0xC5110001 raw, 0xC5110002 vitals, 0xC5110006 feature_state). * csi_keepalive_task spawns one `ping -i <interval> <ip>` child per discovered sensor, re-spawns if the child dies or the sensor IP changes. Default 25 pkt/s via --csi-keepalive-pps; 0 disables. Why ICMP, not UDP: tried a UDP-based keepalive (send tiny UDP packet to sensor's known src port). Sensor's closed-port UDP rejected before the CSI callback fired on its side. ICMP echo gets handled in the WiFi stack regardless of any user-space listener so CSI fires reliably. Verified live, no external `ping` running: keepalive: ping -i 0.040 192.168.0.101 for node 1 node 1: 55.6 Hz raw CSI (amp+phase populated) node 2: 55.6 Hz raw CSI (amp+phase populated) Combined with ADR-106 NodeInfo fields (phases, noise_floor_dbm, n_antennas, timestamp_us) this gives downstream consumers — UI, classifier, future ML model — the full complex CSI signal at high rate without any operator-side ritual.
Records the two-part change that gets the maximum raw signal off the
sensors so the future model — and current fine-motion detection —
has everything the parent project describes:
D1 NodeInfo exposes phases[56], n_antennas, noise_floor_dbm,
timestamp_us in the WS payload (was amplitude-only).
D2 NodeState stashes latest phases/noise/timestamp/antenna count
so build_node_features can populate the new fields uniformly
without a parallel phase_history buffer.
D3 csi_keepalive_task spawns managed `ping` children per
discovered sensor address; replaces the operator's hand-run
`ping -i 0.05 …` workflow. CLI --csi-keepalive-pps controls
rate (default 25), 0 disables.
D4 Why ICMP not UDP: sensor rejects closed-port UDP before its
CSI callback fires; ICMP is handled in WiFi RX path regardless.
Verified: 55.6 Hz raw CSI per node with no shell ping; both
amplitude[56] and phases[56] populated; noise_floor=-91 dBm.
Two impl commits already on the branch: 4daa2c9, 8489efe.
Closes the first ADR-106 open item without an FW change. On every raw-CSI frame we now stamp `ns.latest_timestamp_us` with SystemTime::now() in µs since UNIX epoch. NodeInfo.timestamp_us surfaces it on WS via the already-wired skip_serializing_if guard. Accuracy is wall-clock + Mac monotonic + LAN jitter ≈ ~1 ms. Verified cross-node skew ts(node1) - ts(node2) = 1556 µs in a single test, well within the 5-10 ms tolerance needed for FFT-based vital-signs correlation across sensors. Sensor-side ESP-IDF rx_ctrl.timestamp (true RX-time µs) is still better and remains on the open list for a future FW header bump (reserved bytes [18..19] are only 2 of the 4 we'd need — header extension required, opt-in via new magic).
Eliminates the manual `scripts/record-baseline.py` ritual:
REST endpoints
GET /api/v1/baseline — current per-node baseline +
last_written_sec_ago + calibration_status
POST /api/v1/baseline/calibrate — start a background capture, optional
JSON body { duration_sec, trim_sec,
clean_window_sec, out }. Returns
immediately; status transitions
idle → running → complete | error: ...
Auto-recalibrate background task
Watches the live classifier. When motion_level=="absent" and CV<0.08 for
--auto-recalibrate-quiet-sec (default 1800 = 30 min) AND the last write
is older than --auto-recalibrate-min-age-sec (default 3600 = 1h),
silently re-runs the capture and live-reloads the override map. No
operator action needed.
Implementation
capture_baseline_to_disk() — in-process port of record-baseline.py:
trim head/tail, scan windows for lowest-
CV chunk, compute full-broadband stats,
write baseline.json, hot-reload override.
BASELINE_BUS — broadcast bus carrying every sensing_update
JSON so the capture can read live frames
without re-binding any sockets.
BASELINE_LAST_WRITTEN — SystemTime tracker for the cool-down.
BASELINE_CALIBRATION_STATUS — status string for the REST endpoint.
Verified live: POST /api/v1/baseline/calibrate (5 s test window) ->
capture wrote `/tmp/test_baseline.json` with n_samples=86 per node,
override hot-reloaded (visible via GET /api/v1/baseline). Real baseline
restored on next server restart from data/baseline.json.
UI side of ADR-107: green "calibrate empty" button in raw.html next
to the existing reset/log-y controls. Click → confirm dialog tells
the operator to step out → POST /api/v1/baseline/calibrate with
90 s capture window → polls GET /api/v1/baseline every 2 s, surfaces
"recording… N/90 s" then "baseline updated ✓".
ADR-107 documents:
D1 in-process capture_baseline_to_disk (port of record-baseline.py)
D2 BASELINE_BUS broadcast forwarder so capture stays decoupled from
WS clients
D3 POST /api/v1/baseline/calibrate (immediate ack, background work)
D4 GET /api/v1/baseline (current state + cooldown + status)
D5 auto_recalibrate_task — 30-min absent+low-CV trigger, 1-h cooldown
D6 raw.html button + polling
Saves the comprehensive OTA pipeline reference written by another agent so future sessions don't lose the diagnostic flowchart or the "three FW prerequisites" causal chain. Tested live against current FW (v0.6.4): port 8032 reachable on both sensors, scripts/ota-deploy.sh round-trip works, both nodes successfully switched partitions (ota_0 ↔ ota_1) without USB+BOOT dance. OTA is the supported path for future FW changes from this session — sensor µs timestamp (ADR-106 open item), NVS persistence of gain-lock (gap-analysis ruvnet#5), and any larger FW work. Kept whole (329 lines, over the usual 200 line cap for docs) because the flowchart and pitfall table lose meaning if split. The cap is a guideline for new project ADRs; a verbatim recipe is justified by diagnostic value.
… via OTA Closes ADR-106 open item ruvnet#1: server now receives the real WiFi RX timestamp from the sensor's hardware controller instead of stamping on receipt with SystemTime. FW (csi_collector.c csi_serialize_frame): Append uint32_t = info->rx_ctrl.timestamp (µs since FW boot, monotonic per ESP-IDF docs) as 4 trailing bytes after I/Q data. Header layout unchanged → old server parsers still work (they ignore tail bytes per existing `if buf.len() >= expected` check). Server (parse_esp32_frame): Opportunistically read trailing 4 bytes as u32 LE into Esp32Frame.sensor_timestamp_us. Old FW → None, new FW → Some(µs). udp_receiver_task uses sensor timestamp when present, falls back to server SystemTime if not. Result published as NodeInfo.timestamp_us. Flashed both sensors via OTA (no USB dance): 192.168.0.101: ota_0 → ota_1 ✓ 192.168.0.100: ota_1 → ota_0 ✓ Live verify: WS timestamps now sub-1e12 (sensor monotonic, ~39s after FW boot), Δ between successive frames = 43.3 ms ≈ 23 fps sampling jitter, sub-ms precision. Cross-node skew = sensor boot time delta (here ~292 ms). For sync the host can subtract per-node boot offset learned from the first packet pair.
…sence
ADR-102 Step 3 (FP-rate validation) — `nbvi_select_top_k` no longer
takes the literal top-K. Evaluates candidate K ∈ {6,8,10,12,16,20}
over the quiet window: for each, computes per-subset broadband CV
on a sliding sub-window and counts how many sub-windows cross the
moving threshold (0.10). Picks smallest K with fewest "false
positives" (ties broken by smallest total-NBVI). Defends against
the rare case where the literal top-12 happens to include a
subcarrier overlapping a noise source — the FP count surfaces it
and a tighter K wins.
ADR-104 (off-axis presence via per-subcarrier drift) — when
baseline.json carries `per_subcarrier_mean` for a node, server
loads the vector into AMP_BASELINE_PER_SUB. Each classifier tick
computes `drift = mean |Δ amp / baseline|` over the recent
AMP_SHORT_WIN frames vs that baseline. Drift ≥ 10 % → trigger
`present_still` even if broadband mean barely shifted. Catches the
case where the operator is in the room but off the AP→sensor line,
so individual subcarriers are perturbed without a global drop.
amp_node_level / amp_node_snapshot — per-node drift trigger
amp_classify_from_latest — cross-node MAX drift trigger
Drift channel is opportunistic: if baseline.json predates ADR-104
(no per_subcarrier_mean field), drift = 0 and classifier behaves
exactly as before. Re-record baseline via the calibrate-empty button
to populate the field and activate the channel.
ADR-108: after the first successful gain-lock on FW, save the AGC and
FFT median values to NVS (namespace "csi_cfg", keys "gl_agc" / "gl_fft").
On every subsequent boot the FW loads them and immediately calls
phy_force_rx_gain / phy_fft_scale_force without waiting 300 packets
(~3-12 s) for fresh calibration.
Mechanics:
rv_gain_load_from_nvs / rv_gain_save_to_nvs — small NVS helpers in
the gain-lock module.
rv_gain_lock_process — `s_nvs_checked` static gate triggers a one-
shot load on the first packet after boot. If
a saved AGC ≥ MIN_SAFE_AGC is found, lock
immediately + mark locked. Otherwise fall
through to the existing 300-packet sampler.
Existing lock branch — after the median + force_*, save to NVS so
the next boot has the values.
Verified live: second OTA → 44 Hz raw CSI at WS in the first 3-s
sample after boot (was ~5-12 s gap before). Both nodes flashed via
WiFi (no USB), no MIN_SAFE_AGC skip in operator's deployment (AGC=44).
Tradeoff: NVS values are tied to sensor location + AP MAC + channel +
antenna. If the operator moves the sensor or swaps the AP, stale
values may be slightly off-optimal until they re-trigger calibration.
Today: erase NVS keys via console; future: dedicated FW endpoint.
New REST endpoint on FW HTTP server (port 8032) writes csi_cfg/target_ip + target_port to NVS and reboots. Body is plain text "IPv4:PORT" (e.g. 192.168.0.103:5005). Verified on both 192.168.0.100 and 192.168.0.101 — sensors silent after Mac IP move came back online in ~3 min instead of needing USB. Same PSK auth as /ota/recalibrate (ADR-050). Strict body parser rejects malformed input before touching NVS. Binary size +1 KB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pure-Rust port of scripts/train-wiflow-supervised.js inference path.
Loads ruv/ruview/wiflow-v1.json (lite scale, 186946 params) — base64
weights, 2 TCN blocks (k=3, d=[1,2]), 35→32→32 channels, FC 640→256→34.
BatchNorm uses per-window mean/var matching the JS impl. No new crates;
inline base64 decoder, hand-written math.
CLI: --wiflow-model PATH flips /api/v1/info {pose_estimation:true},
populates SensingUpdate.pose_keypoints per tick, pose_current returns
17 COCO keypoints. Verified on TP-Link/.100/.101 deployment.
Output values are sigmoid-saturated (transfer w/o fine-tune) — model
needs per-deployment LoRA adapter or re-train, follow-up Pack E.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LiveDemoTab.fetchModels() now probes /api/v1/info after the RVF model list; when features.pose_estimation is true (i.e. --wiflow-model was loaded), inserts a virtual 'WiFlow-v1 (lite, 186K params, --wiflow-model)' option, marks it active, and populates name + PCK 0.929 in the panel. Cosmetic only — does not change inference path or pose_keypoints flow. Closes the UX inconsistency where the badge said MODEL INFERENCE but the dropdown said 'No model loaded'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit fix bundle (10 areas; details in ADR-117 + commit body below). Server (main.rs / wiflow_v1.rs): - UDP receiver filters loopback/multicast/unspecified before NODE_ADDRS registration. Defends against `cargo test` cross-talk that spawned 250+ ping zombies on the production server's :5005 port. - csi_keepalive_task pre-reaps `/sbin/ping -i 0.040` orphans at task entry. macOS doesn't propagate parent death, so killed servers used to leave init-parented pings running indefinitely. - run_wiflow_inference stamps real classifier confidence onto every keypoint (was hardcoded 1.0) — reads 0.037 on live data, honest. - run_wiflow_inference clones only the tail-20 frames inside the lock, not the full 600-deep VecDeque (~270 KB → ~9 KB per tick). - wiflow_v1::build_input_from_history: zero-pad dead channel slots instead of duplicating subcarrier 0 across all of them. Comment said "zero the rest", prior code did the opposite. - GET / now 308-redirects to /ui/index.html; API index moved to /api. UI (ui/index.html, ui/components/LiveDemoTab.js): - <section id="sensing"> gets a <div id="sensing-container"> child so app.js::SensingTab.mount has its mount point. Sensing tab was permanently blank. - LiveDemoTab.fetchModels: only inject WiFlow into the dropdown if no RVF model is already active. Prevents silent flip back to WiFlow after every poll. Tests (multi_node_test.rs): - test_multi_node_udp_send probes 127.0.0.1:5005 first; if bind fails (e.g. a dev server is running), skip the send. Two-layer defense with the server-side filter above. Docs (CHECKLIST.md, ADR-115, espectre-gap-analysis.md, ota-pipeline.md): - CHECKLIST head sha + count refreshed (43→47 Done, head 0ec1e4b, ADR range to 001-117 with ADR-111 noted as intentionally absent). - ADR-115 typo fixes: "ADR-100" → "ADR-110" (TP-Link WISP), "ADR-111" → "ADR-109" (AP-MAC tracking actually lives there). - gap-analysis "Still open" table: 8 shipped items annotated with commit hashes; remainder reclassified Deferred with reason. - ota-pipeline.md: new "Operator REST endpoints" section listing /ota/recalibrate (ADR-109) and /ota/set-target (ADR-115) with unauthed + bearer-token curl examples. Verified post-restart: - exactly 2 ping children, both parented to current PID, one per real sensor IP, no 127.0.0.1. - GET / → 308 → /ui/index.html. - /api/v1/info: pose_estimation=true, version 0.3.0. - /api/v1/pose/current: 17 COCO keypoints, confidence 0.037 (real). - cargo test --workspace: 13 passed / 0 failed / 5 ignored. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit on 6-node training data (151,329 frames) found 21 multicollinear pairs (|r|>0.85), one dead feature (amp_min constant 0), and only node[0] used in 8 of 15 features. Top per-feature F-stat = 15,497 but accuracy stuck at 44.4% — classifier couldn't extract the signal that physical sensors were already capturing. Refactor: - Drop 8 dead/redundant features (amp_min, amp_range, breath_bp, spec_pow, motion_bp, amp_mean, amp_max, amp_iqr, amp_kurt). - Keep 4 globals: variance, mean_rssi, dom_hz, change_pts. - Add per-node features × all 6 nodes: amp_std, amp_skew, amp_entropy. - New N_FEATURES = 22 (was 15). Z-score normalisation kept. API change: features_from_runtime now takes &[(u8, &[f64])] — caller must supply per-node amplitudes. New helper current_per_node_amps() reads AMP_HIST.nbvi_history.back() for all live nodes. Old data/adaptive_model.json removed (incompatible 15-feature schema). Retrain result on same 151k frames: 44.4% → 49.58% accuracy (+5.2 pts) Total improvement vs 2-node baseline (40.4%): +9.2 pts. Live confidence distribution now meaningful (0.30-0.85) vs pre-fix near-uniform 0.04-0.10. Sensor placement matters: n6 (near door, far from AP) sep_ratio=0.60 best; n1/n5 (near AP) ~0.01-0.06 nearly dead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-hidden-layer perceptron (~3k params, ReLU + softmax) trained via manual backprop (no external ML crate). SGD + momentum 0.9 + weight decay 1e-4 + cosine LR decay, 30 epochs over 151,329 frames. AdaptiveModel carries both LogReg and MLP weights side-by-side; classify() prefers MLP via is_trained() check, falls back to LogReg when loading legacy 15-feature models. Result on same 6-node 7-class dataset: LogReg (ADR-118): 49.58% MLP (this): 53.53% (+3.95 pts) Per-class gains concentrated on motion classes — exactly where non-linear feature combinations matter: absent +1 (40% → 41%) present_still tied (99% → 99%, class-imbalance ceiling) transition +7 (29% → 36%) active +8 (22% → 30%) waving +4 (34% → 38%) present_moving +9 (24% → 33%) Cumulative session improvement vs 2-node 15-feature baseline: 40.4% → 53.53% (+13.1 pts). Loss flatlines at 1.15 around epoch 10 — frame-level information ceiling for the 22-feature representation. Next big lever is temporal context (windowed LSTM/TCN), documented in Out-of-scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds WindowedMlpModel: 440 → 64 ReLU → n_classes, stacks last 20
frames × 22 features as input. Captures temporal patterns that
frame-level classifiers physically cannot see (walking cadence,
sit-stand cycles, gesture rhythm).
AppStateInner gets feature_window: VecDeque<[f64; 22]> (cap 20)
auto-pushed at the 3 tick sites before adaptive_override. The
classify_window API flattens the buffer (oldest first) + current
frame's features → 440-d input → softmax over classes. Cold-start
(<20 frames) falls back to frame-level MLP.
AdaptiveModel now carries all three classifiers side-by-side:
LogReg (ADR-118), MLP (ADR-119), W-MLP (this). classify_window
picks W-MLP first; legacy classify() picks MLP > LogReg.
Result on the same 6-node, 7-class, 151,329-frame dataset:
LogReg: 49.58%
MLP: 53.53%
W-MLP: 90.40% (+36.87 pts over MLP, +50.0 pts over original
2-node 15-feature LogReg baseline)
Per-class W-MLP accuracy:
absent 100% (was 41%)
present_still 100% (was 99%, saturated)
transition 86% (was 36%) — sit/stand cadence captured
waving 90% (was 38%) — gesture cadence captured
present_moving 82% (was 33%) — walking step cadence captured
active 74% (was 30%) — jumping bursts captured
Loss broke through frame-level plateau (1.15 → 0.25). Caveat:
90.4% is training-set accuracy; ~28k weights on ~30k windowed
samples means some overfitting likely. Held-out test set
recommended as follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
W-MLP claimed 90.4% training accuracy in ADR-120 but live UI kept
showing only the 4 baseline classes (absent/still/moving/active).
Root cause: 3 amp_presence_override / amp_classify_from_latest call
sites ALWAYS overwrite classification.motion_level after
adaptive_override runs, regardless of what the model decided. The
rule-based path only knows 4 classes; the 2 new ones (waving,
transition) emitted by the adaptive W-MLP were silently clobbered
every tick.
Hybrid priority:
rule-based wins → absent / present_still / present_moving / active
(ESPectre-style F1>96%, battle-tested)
adaptive wins → waving / transition (exclusive to ADR-120 W-MLP)
Implementation: new helper adaptive_owns_class() + ADAPTIVE_EXCLUSIVE_CLASSES
constant. Each of the 3 rule-based override blocks (multi-BSSID tick,
feature_state path, per-node loop) now guards on `if !adaptive_owns_class(
classification.motion_level)`. Skips the overwrite when the adaptive
model has just emitted a new class.
Live verification (30s sample):
transition: 14/30 (47%) — visible in live UI for the first time
present_still: 10/30 (33%)
present_moving: 1/30
absent: 1/30
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After hybrid priority fix (442c03d) the W-MLP labels reach the live UI but at ~10 Hz tick rate they flip between adjacent classes (transition / present_still / present_moving) too fast to read. Adds majority-vote smoothing over last 7 ticks (~700ms window) — snappy enough for real- time feedback, stable enough that the displayed label persists long enough to be readable. Implementation: static ADAPTIVE_LABEL_HISTORY VecDeque + helper adaptive_label_smooth() called at end of adaptive_override after the model emits its raw decision. Mode of last 7 raw labels wins; ties break sticky to the previous committed label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…2 5-tick confirm
Previous 15-tick majority window still flickered visibly in the live
UI ("переключается со скоростью света"). Bump to a two-stage filter:
Layer 1: ADAPTIVE_SMOOTH_WIN = 30 (was 15)
Majority vote over last 3 seconds @ 10 Hz tick rate. Doubles the
window — sustained signal dominates, brief glitches lose.
Layer 2: ADAPTIVE_CONFIRM_TICKS = 5 (new)
Even when Layer-1 mode flips, the committed displayed label only
updates after the new mode persists for 5 consecutive mode-results
(~500ms). Stops rapid bouncing between near-tied classes.
Effective dwell time: ≥3 seconds before any visible label change.
Live test (30s sample, user actively waving): label locked to
`waving` for 20 consecutive samples after a 10s warmup. No flicker.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous smoothing covered only the adaptive_override path. The 5 other classification.motion_level writes (amp_presence_override and amp_classify_from_latest in 3 different tick handlers) wrote raw values that bypassed the smoother entirely — explaining the lingering "переключается со скоростью света" complaint after the two-layer fix. New finalize_motion_label(&mut classification) runs at end-of-tick AFTER all overrides have settled, applies the same two-layer (30-tick mode + 5-tick confirm) smoothing uniformly to whatever label survived the priority cascade. Called from 3 sites: - multi-BSSID tick handler - feature_state tick handler - per-node loop in broadcast tick task adaptive_override now emits raw model label (no double-smoothing). Verified: 30-second sample, user actively performing transitions, ZERO flips. Label persisted as `transition` all 30 samples. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds diagnostic endpoint returning the last 30 RAW model labels, their distribution, the smoother's internal buffer, committed + candidate labels, and consecutive count. Lets the operator distinguish "smoothing is sticky" from "model genuinely keeps outputting the same class" — without that signal, tuning smoothing parameters is shooting in the dark. Also relaxes smoothing back to 15/2 (Layer-1 1.5s majority + Layer-2 200ms confirm). The earlier 30/5 setting was over-damped because the actual problem was model overfitting, not flicker. Diagnostic finding on current live data: transition raw count: 25/30 (83%) present_still: 2 absent: 2 present_moving: 1 Model believes user is performing sit/stand transitions even when they're typing at the keyboard. Likely cause: `train_transition` recording captured ~3s pauses between sit-stand cycles, so the class signature is broad enough to grab typing/mouse motion. Fix is data-side (re-record cleaner transition class or add a desk_work class), not algorithm-side. ADR-120 follow-up notes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ruvnet
left a comment
There was a problem hiding this comment.
Did a deep review across security and code-quality dimensions. Substantial, well-attested work — but two critical security findings need to land before this can merge, and the structure deserves a serious look before bringing 13K lines + 20 ADRs onto main in one shot.
Critical (must-fix-before-merge)
1. Unauthenticated baseline poisoning + arbitrary file write
v2/crates/wifi-densepose-sensing-server/src/main.rs:4485-4523 (baseline_calibrate), route registration at main.rs:7621.
The new POST /api/v1/baseline/calibrate is mounted under /api/v1/*, but the require_bearer middleware (src/bearer_auth.rs:86-114) is off whenever RUVIEW_API_TOKEN is unset — and that's the documented LAN-mode default from #443. In the default deployment, any host on the WiFi can:
- DoS-poison the baseline (trigger calibration with the operator present), corrupting org-wide presence detection.
- Write attacker-controlled JSON to arbitrary filesystem paths via the request body's
outfield, which is passed straight tostd::fs::write(out_path, …)incapture_baseline_to_diskatmain.rs:5720with no path sanitisation. Example:{"out": "../../../etc/cron.d/x"}or{"out": "/proc/self/…"}writes the payload anywhere the server can write.
No CSRF, no rate limit, no auth.
Fix: gate this route behind bearer auth unconditionally (move .layer(require_bearer) above this specific route, or split it). Reject any out that escapes a fixed data/ dir via Path::canonicalize() + prefix check. Add a per-IP rate limit (1 calibration / 5 min).
2. OTA flashing accepts arbitrary firmware with auth off by default + plaintext PSK
firmware/esp32-csi-node/main/ota_update.c:45-50 — ota_check_auth() returns true whenever the NVS security/ota_psk key is absent ("permissive for dev"). Init at :455-472 only logs a warning when no PSK is provisioned; nothing forces it. scripts/ota-deploy.sh:140-162 confirms the wire protocol is http://<ip>:8032/ota with Authorization: Bearer <psk> — never TLS.
Attack: a rogue host on the same WiFi (including the open WISP guest segment ADR-110 describes) POSTs an attacker-built firmware (≤900 KB) and the chip flashes + reboots into it — no signed-image verification beyond ESP-IDF's structural SHA-256, and Secure Boot V2 is not enforced anywhere in the repo. Even with a PSK provisioned, plaintext transport leaks the bearer token to any LAN sniffer. Same story for new POST /ota/recalibrate (:108-152) and POST /ota/set-target (:210-272) — both inherit "no PSK ⇒ no auth" and can factory-reset gain-lock or redirect ALL CSI traffic to an attacker-controlled aggregator IP.
Fix:
- Refuse to start the OTA server when
s_ota_psk[0] == '\0'(fail-closed). - Document and enforce ESP-IDF Secure Boot V2 + signed image verification (
CONFIG_SECURE_BOOT=y,CONFIG_SECURE_SIGNED_APPS_*) before any production deployment. - Gate
set-targetandrecalibratebehind the same PSK and require the IP to match an allowlist or the existing aggregator's subnet.
High (must-address)
- No rate-limit / re-calibration lock on
/api/v1/baseline/calibrate—main.rs:4493-4513runningguard only blocks while one capture is mid-flight; once it flips tocomplete, the attacker re-triggers immediately. Combined with finding 1 this is a sustained baseline-grinding DoS. Fix: enforcemin_age_seccool-down in the handler, not only inauto_recalibrate_task(main.rs:5772-5776). ota-deploy.shreadsOTA_PSKfrom env and embeds it in plaintext HTTP headers (scripts/ota-deploy.sh:225-227, 142-149) with no warning to the operator. Fix: print a loud warning unless an env opt-in is set, and add a TLS path (e.g., serve OTA over:8033viaesp_https_server).- No per-IP cap or payload backpressure on
/ws/sensing,/ws/introspection(main.rs:3519-3548, 3564-3593). A single attacker pins broadcast subscriber slots;bearer_auth.rs:11-13explicitly excludes/ws/*from auth even when token mode is on.
Medium / informational
static/raw.html:88, 114usesinnerHTMLtwice (both with server-controllednodeIdvalues; dynamic CSI usestextContent). AddContent-Security-Policy: default-src 'self'on thenest_service("/static", …)mount atmain.rs:7664-7669.parse_esp32_frametruncatesn_subcarriersu16→u8 (src/csi.rs:153,main.rs:2076); length check at:162/:2086keeps it safe but worth a comment.- ADR-105 dropping synthetic data —
generate_signal_fieldreturns zero-filled grid,derive_pose_from_sensingreturns emptyVec. Both degrade gracefully (no panic paths introduced). Confirmed safe. - NVS gain-lock load (
csi_collector.c:111-122, :142-181) correctly guardsagc >= RV_GAIN_MIN_SAFE_AGC=30before applying — the "sabotaged NVS → unsafe gain" scenario is mitigated. scripts/record-baseline.pyis operator-controlled CLI, no remote vector.
Structural concerns
main.rsadds 2,940 lines in one file.CLAUDE.mdsays "Keep files under 500 lines." Onlywiflow_v1.rs(473 lines) was extracted. The new functionality should be split intobaseline.rs,amp_classifier.rs,nbvi.rs,drift_channel.rs,keepalive.rs,rest_baseline.rs.adaptive_classifier.rsadds 714 lines with three new model types (MlpModel,WindowedMlpModel, retained legacy LogReg) co-located. ADR-118/119/120 should each have been their own module.OnceLock<Mutex<HashMap<…>>>global state pattern (amp_hist_init,amp_latest_init, etc.) is inconsistent with the rest of the codebase, which usesArc<RwLock<…>>insideAppState. The fact that the replay test needs areset_classifier_state()to clear globals confirms this is a code smell.
ADR concerns
- 20 ADRs in one PR vs the project's "one ADR per design decision, merged via its own PR" convention.
- ADR-098 looks like a backfill, should have shipped in a prior PR.
- ADR-111 is intentionally absent ("folded into 109") — leaves a confusing gap. Better: don't reserve, or reissue 109 with merged scope.
- ADR-118 / 119 / 120 are sequential supersession — 118 adds a 22-feature extractor + LogReg, 119 replaces with MLP, 120 replaces with Windowed MLP. The previous two are dead before review. Should have been one ADR with a "Considered alternatives" section.
- Several diff lines still reference the old "ADR-099" name in body text after the renumber to ADR-110 — cross-references will be confusing.
Test coverage
- 6 new
#[test]functions for ~6,500 net production lines (~0.9 tests / 1K LOC). csi.rs: 2 (frame parser offset regression — good).wiflow_v1.rs: 3 trivial unit smoke (base64, sigmoid, zero-history); no golden-vector forward-pass test.main.rs::replay_tests: 1 integration test using the 2000-line JSONL fixtures, asserts F1 >= 0.85 — good coverage for ADR-101/103/104.- No tests for: ADR-102 (NBVI), ADR-103 (baseline persistence), ADR-104 (drift channel), ADR-107 (
/calibrateREST), ADR-106 D3 (keepalive), ADR-119 (MLP forward), ADR-120 (Windowed MLP), ADR-118 (feature extractor). These are the highest-risk additions.
File org
CHECKLIST.mdat repo root —CLAUDE.mdexplicitly forbids "saving working files / text / mds to the root folder". 187 lines of session-handoff dashboard content. Either move todocs/CHECKLIST.mdwith an "ephemeral, may be stale" header, or — better — drop it entirely and capture the Done list in this PR's description andCHANGELOG.md.
Recommendation
block-pending-fix for the two criticals, then re-review. The criticals are small, surgical fixes — not a structural rewrite — so a single follow-up commit can address them.
If you have the bandwidth, please also consider splitting into 4–5 smaller PRs along these lines:
- firmware only — ADRs 098, 100, 108, 109, 115 +
firmware/**+scripts/ota-deploy.sh+docs/references/ota-pipeline.md(~1.2K LOC, self-contained, easy to roll back). - server CSI plumbing + safety-net removal — ADRs 105, 106 (~1.5K LOC, highest review priority).
- classifier + baseline + new modules — ADRs 101–104, 107, 110, 112–114 + new
baseline.rs/amp_classifier.rs/nbvi.rs/drift_channel.rs(~2K LOC, hard capmain.rsnet delta at 500 lines). - pose loader — ADR-116 +
wiflow_v1.rs(~500 LOC, already isolated; add a golden-vector test). - classifier evolution — collapse ADR-118/119/120 into one ADR; fold ADR-117 into the relevant PR.
Merge-as-one is technically safe for what I sampled (no obvious bugs, mergeable, live-verified) — but the structural debt sets a precedent and the next main.rs refactor will be painful.
Genuine thanks for the depth of this work — the gain-lock, NBVI, persistent baseline, complex CSI exposure, and synthetic-data audit are all real improvements that the project needs. The criticals are well-defined and small. Happy to help with the two security fixes or with the restructure.
- CHECKLIST.md: refresh head sha (12e1cf9), date (2026-05-18), count (47 → 50 Done), explicit Done entries for ADR-118/119/120 with the full session accuracy trajectory (40.4% → 90.40%). - .gitignore: stop tracking deployment-specific training artifacts: v2/data/recordings/ (175 MB each), v2/data/adaptive_model.json (regenerated on each retrain), v2/data/baseline.json (regenerated on /api/v1/baseline/calibrate). - ui/style.css: ship the .sensing-class-label color rules for present_moving (yellow), waving (purple), transition (orange) — written during ADR-117 conversation but missed by that commit. - git rm --cached v2/data/adaptive_model.json (stays on disk; untracked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-stated rule: README.md and CLAUDE.md must not exceed 200 lines;
all detail goes into docs/ with a link. ADRs also targeted at ≤200.
Before:
README.md 542 lines
CLAUDE.md 407 lines
CHECKLIST 235 lines
ADR-116 224
ADR-117 245
ADR-120 209
After:
README.md 198 ✓
CLAUDE.md 149 ✓
CHECKLIST 199 ✓
ADR-116 191 ✓
ADR-117 199 ✓
ADR-120 200 ✓
ADR-115/118/119 already under (161 / 193 / 161)
New supporting docs (extracted content):
docs/use-cases.md — full deployment-tier catalogue + 60 ADR-041 edge modules
+ ADR-024 self-learning section, all moved from README
docs/architecture.md — pipeline diagram + module breakdown from README
docs/dev-handbook.md — crate map, RuvSense modules, build/firmware/release
/publish, witness verification — all moved from CLAUDE.md
docs/claude-swarm.md — V3 CLI commands, agent types, memory commands —
moved from CLAUDE.md
Trims (compress prose without losing facts):
ADR-116 — D7 honesty section + Verified Acceptance + Open Items
ADR-117 — Context narrative folded to bullets + Out of Scope condensed
ADR-120 — Out of Scope condensed
CHECKLIST — adaptive classifier entries compacted + Deferred grouped
CLAUDE.md now adds the ≤200-line rule explicitly to Behavioral Rules
+ Project Architecture + Pre-Merge Checklist so future sessions can't
forget it. README.md was a 67% reduction; CLAUDE.md 63%.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9 photos of the additional sensor/antenna hardware staged for ADR-120+ experimentation (captured 2026-05-18): sensor_01 5× u.FL pigtail antennas (bare) sensor_02 4× flat PCB-strip 2.4 GHz antennas w/ 3M backing + u.FL sensor_03 HLK-LD2402 24 GHz mmWave radar (close-up, chip S1KM0008) sensor_04 CP2102 USB-to-UART bridge (AMS1117-3.3 LDO) sensor_05 HLK-LD2402 + USB-UART wired together (working setup) sensor_06 ESP32-S3 dev board with microSD slot (back) sensor_07 ESP32-S3-WROOM with OV-camera + ribbon FFC mounted sensor_08 YD-ESP32-23 2022-V1.3 (back) — spare matching nodes 1-6 sensor_09 YD-ESP32-23 (front) — ESP32-S3-N16R8 + FTDI assets/sensors/README.md catalogues each photo + suggests where each piece fits in the roadmap: * u.FL antennas → attach to n1/n5 (near-AP, sep_ratio ~0.05 per ADR-118) * HLK-LD2402 → vitals ground-truth reference for WiFi pipeline * Camera-ESP32-S3 → on-device camera capture for WiFlow Pack E.2 retrain * YD-ESP32-23 spare → flashable as node 7 when needed Photos referenced only from this README, not used by any code path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operator clarified: nodes 1 and 2 (.101 / .100) are ESP32-S3 + OV-camera
boards (sensor_06, sensor_07 in the photo set), NOT YD-ESP32-23. Nodes
3-6 (.102 / .104 / .105 / .106) are the YD-ESP32-23 boards with u.FL
external-antenna connectors (sensor_08, sensor_09).
Impact: Pack E.2 (WiFlow camera-supervised retrain) is closer than
previously assumed — the camera hardware is already deployed at nodes
1 and 2. Path becomes:
1. Extend FW with parallel camera_capture.c → stream MJPEG over UDP/HTTP
2. Run MediaPipe Pose on server (deps already installed in
~/.venv/ruview-train from earlier session)
3. Time-align with existing scripts/align-ground-truth.js
4. Retrain via scripts/train-wiflow-supervised.js --scale lite
The 4 PCB-strip antennas in sensor_02 map 1:1 to nodes 3-6 — drop-in
upgrade once each board is power-cycled to swap the antenna feed.
README now lists the per-node board type, IP, camera/u.FL status, and
which photos show each. No code changes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a dedicated blocking serial-reader thread that opens the
HLK-LD2402 over a CP2102 USB-UART bridge (default 115200 8N1),
parses ASCII `distance:<cm>\r\n` lines @ ~6 Hz, stores the latest
reading in a static OnceLock<Mutex<…>>, and exposes it via:
GET /api/v1/mmwave/latest →
{ "available": true, "distance_cm": 152, "age_ms": 90 }
{ "available": false } (port absent, stale > 2 s)
UI (Sensing tab) polls the endpoint every visible WS tick and
shows a new blue card "mmWave Radar (24 GHz)" with distance +
age bar. Card hides when unavailable.
CLI:
--mmwave-port /dev/cu.usbserial-1140
--mmwave-baud 115200 (default)
Both optional — server runs as before if the module is absent.
Open failure: single WARN log, reader thread exits, server keeps
serving WiFi sensing.
Verified live: distance 149-153 cm at ~6 Hz, REST returns fresh
readings with age_ms 55-127.
Out of scope (logged in ADR-121): Engineering Mode binary frames,
vitals cross-check vs ADR-021, W-MLP feature fusion, auto-reconnect.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After ADR-117/118 docs sweep (commit 4075b60) extracted Use Cases, How It Works, Edge Modules, Self-Learning sections from README into docs/use-cases.md + docs/architecture.md, but two classes of links were left dangling: 1. README anchor links pointing at section IDs that no longer exist in README: #edge-intelligence-adr-041 → moved to docs/use-cases.md #esp32-s3-hardware-pipeline → architecture detail in docs/ #vital-sign-detection → moved out #sensing-server → moved out #-quick-start → renamed during slim Replaced with deep links into docs/use-cases.md or docs/dev-handbook.md / docs/architecture.md where appropriate. 2. Extracted docs (docs/use-cases.md etc.) had path links written from the perspective of repo root (docs/edge-modules/, v2/crates/...) — broken once the file moved into docs/. Bulk-rewrote via Python regex pass: docs/edge-modules/X → edge-modules/X docs/adr/X → adr/X v2/... → ../v2/... archive/... → ../archive/... scripts/... → ../scripts/... plugins/... → ../plugins/... firmware/... → ../firmware/... 3. docs/use-cases.md self-reference #ai-backbone-ruvector → that section was never moved; replaced with prose + link to architecture.md. Final scan: ZERO dangling anchors in the doc tree. One valid anchor `#edge-module-list` in use-cases.md points to a local `<details id="edge-module-list">` block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hable Previously raw.html lived only at v2/crates/wifi-densepose-sensing-server/static/raw.html. When the server is started with --ui-path /Users/arsen/Desktop/RuView/ui (the SPA path) the calibration console returns 404 on /ui/raw.html. Copy the file into ui/ alongside index.html so a single --ui-path covers both the SPA and the engineer-facing raw view. The static/ copy in the crate stays as the canonical source (referenced by ADRs 104/107); ui/raw.html is a deploy mirror. Live at http://localhost:8080/ui/raw.html. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a hidden-by-default 📡 mmWave pill next to the global badge + CV stat. Polls /api/v1/mmwave/latest at 5 Hz (~200 ms) — well above the HLK-LD2402's 6 Hz native cadence so no information is lost. Pill shows: 📡 mmWave 152 cm · 60 ms Distance + age (ms since last reading). Fades to 50% opacity when age >1.5 s, hides entirely when the server reports `available: false` (port absent or stale >2 s). Synced both copies — ui/raw.html (deploy mirror) + static/raw.html (canonical source referenced by ADR-104 / ADR-107). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ADR-021 already publishes `vital_signs` inside SensingUpdate but the raw calibration console had no readout — the operator had to curl /api/v1/vital-signs to see breathing/HR. Add two pills (🫁 + 💓) next to the mmWave one and update them on every WS tick. Confidence < 20 % dims the pill so noise-floor estimates don't read as real values. Missing/zero rates fall back to "— BPM". Mirrored ui/raw.html → static/raw.html so both deployment paths serve the same console. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The breathing/HR pills carried raw BPM with no context. An operator glancing at "94 BPM" can't tell if that's normal or tachycardia without external reference. Add inline "норма 12–20" / "норма 60–100" hints (dimmed so they don't compete with the live value), and tint the number amber when it falls outside the adult-at-rest range. Tooltip carries the medical terminology (bradypnea/tachypnea, bradycardia/tachycardia). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously only WiFi-CSI produced breathing/HR estimates. With the HLK-LD2402 radar wired up we can compute a second, physically independent breathing estimate from chest-induced cm flicker in the distance time-series — a useful cross-check that catches the case when one modality is blind (e.g. WiFi-CSI when nodes are offline, or mmWave when nothing's in the radar's field of view). mmwave.rs: - Plumb a per-reading VitalSignDetector tuned for the module's 6 Hz Normal-Mode cadence (Nyquist 3 Hz comfortably covers the 0.1-0.5 Hz breathing band). - Distance (cm) feeds the detector as the "amplitude" channel; phase is empty so heartbeat falls back to amplitude residual. - Gate `current_vitals()` on data freshness so a disconnected radar doesn't return stale cached BPMs. main.rs: - New GET /api/v1/mmwave/vitals returning the same shape as /api/v1/vital-signs plus buffer status for UI warm-up feedback. ui/raw.html: - Each vital pill now shows both 📶 (WiFi-CSI) and 📡 (mmWave) values side-by-side, separated by `|`. mmWave HR is labelled "n/a" — cm precision at 6 Hz puts heartbeat below the noise floor. Buffer fill (e.g. "120/180") shown while detector is warming up so the operator knows BPM is on the way. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ADR-121 (Normal Mode) gave us distance and a passable breathing estimate but couldn't see the heartbeat — cardiac chest displacement (~0.5 mm) is well below the cm quantisation of `distance:NNN`. Engineering Mode streams per-range-gate energy at the same 6 Hz cadence (15 motion + 15 micromotion gates, u32 LE each). The micromotion bin at the target's distance carries enough cardiac modulation for FFT peak-detection in the 0.8-2.0 Hz band. Live result, seated operator ~1.5 m from the radar: 🫁 📡 13.0 BPM · 37% норма 12-20 💓 📡 76 BPM · 63% норма 60-100 Implementation: - Send enable-config → set-mode(0x04) → disable-config on startup; fall back to Normal-Mode ASCII parsing if the sequence fails. - Binary frame parser: F4 F3 F2 F1 | len(2) | 0x01 | dist(2) | 8z | motion[15]×u32 LE | micro[15]×u32 LE | F8 F7 F6 F5. Gate the ASCII line-drain on the engineering_mode flag — first cut ran both unconditionally and destroyed 80% of partial frames mid-buffer. - Target-gate selection: distance-bracketed gate first, mid-range micro-peak fallback, gate 1 default. Per-gate ring buffer of log-energies feeds a Hann + radix-2 FFT. - /api/v1/mmwave/vitals now returns real `heart_rate_bpm`. - raw.html: 💓 📡 pill now shows real values (no more "n/a" placeholder). - New probe script v2/scripts/probe_ld2402_engineering.py used to reverse-engineer the wire format; kept in tree for next time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The FFT will always find *some* peak in 0.8-2.0 Hz, even on pure clutter, and the peak-to-mean ratio frequently lands at 0.5-0.7 "confidence" from noise alone. Net result: the HR pill showed 75-97 BPM with 60%+ confidence while the operator was across the room with their back to the radar. Add a presence gate based on the target gate's micromotion energy: empty room peak_micro_mid 1k-3k person nearby peak_micro_mid 10k-20k person in beam peak_micro_mid 40k-80k Threshold at 20k. Below it we null both BR and HR (the breathing detector's internal buffer is still fed so it stays warm for instant re-acquisition). New diagnostic endpoint GET /api/v1/mmwave/gates dumps current motion/micro arrays + the target gate so we can re-calibrate the threshold on new firmware. UI: pill now shows "· нет цели" (no target) when presence=false, so the operator can tell "buffer warming up" from "nobody in beam" from "module fell back to Normal Mode". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HLK-LD2402's antenna near-zone (0–70 cm) is a dead spot for its internal distance algorithm: gate-0 micromotion energy collapses to zero, and the firmware falls back to a sidelobe pick that lands at 1.5–2 m. Operator sitting 40 cm away saw "180 cm" jumping ±10 cm. Detect the near-field state from the gate snapshot: motion[0] > 5k AND motion[0] >= peak_motion_mid AND micro[0] < 3k Debounce across the last 6 frames (≈1 s) so a single jittery frame doesn't toggle the UI — gate energies swing 5–30k frame-to-frame when the target is breathing right against the module. When the flag is set, the distance pill renders "<70 cm" with a tooltip explaining that vitals are unreliable at this range; the recommended sweet-spot is 0.7–2 m. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… audit) (#623) ota_check_auth() previously returned true when s_ota_psk[0] == '\0' ("permissive for dev"). A freshly-flashed node — or any node where nobody had provisioned an OTA PSK yet — accepted attacker-controlled firmware over plain HTTP on port 8032 from any host on the WiFi. No Secure Boot V2, no signed-image verification, no transport encryption. Single LAN call could brick or backdoor a node. This was flagged in the deep security review of PR #596 but was a PRE-EXISTING bug in main, not new code from that PR — so it stood as a critical-severity production issue until this commit. Fix: - ota_check_auth() now returns false when no PSK is provisioned, with ESP_LOGW("OTA rejected: no PSK in NVS …") at the call site so the operator can diagnose the rejection from serial logs - ota_update_init() ESP_LOGW message updated to surface the new posture at boot ("upload endpoint will REJECT all requests until provisioned") - Doc comment on ota_check_auth() rewritten to make the contract explicit and reference the audit The OTA HTTP server itself still starts even when no PSK is set. That lets the operator run `provision.py --ota-psk <hex>` over USB-CDC to write the NVS key without reflashing the firmware. The upload endpoint just refuses every request in the meantime. Breaking change for any deployment that depended on the unauthenticated OTA path working out of the box. Documented in CHANGELOG under [Unreleased] / Security so it's visible at the next release cut. Fix-marker RuView#596-ota-fail-closed (scripts/fix-markers.json) requires the new behaviour and forbids the old "permissive for dev" fallback strings, so a future revert fails CI.
End-to-end sensing pipeline overhaul on real ESP32-S3 hardware
(room01 192.168.0.101, room02 192.168.0.100, TP-Link_8340 WISP AP).
9 ADRs added, ~6.5k lines net, 32 commits, all OTA-flashed and live-
verified on the operator's deployment. Merged origin/main cleanly
(introspection tap from PR #554 coexists, our ADR-099 renamed to
ADR-110 to free the slot).
Shipped
Server (
v2/crates/wifi-densepose-sensing-server)data/baseline.json) with universal threshold via CV-normalization (one threshold set works in any room)Firmware (`firmware/esp32-csi-node`)
Ops & docs
Verified live
Notes for review
See `CHECKLIST.md` for the full Done / Open list with effort estimates.