Skip to content

hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249)#418

Merged
ruvnet merged 17 commits intomainfrom
hailo-pipeline-pool
May 4, 2026
Merged

hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249)#418
ruvnet merged 17 commits intomainfrom
hailo-pipeline-pool

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented May 4, 2026

Summary

Sixteen iterations on the hailo-backend follow-up branch, covering NPU pool exploration, bridge feature parity with embed.rs/bench.rs, observability gaps, and a mmwave parser type-safety fix.

What ships

NPU pipeline pool (iter 234-237, 239)

  • HefEmbedderPool: N independent network-group + vstream pairs on the shared vdevice
  • Wired behind RUVECTOR_NPU_POOL_SIZE env (default 1; deploys default 2)
  • Measured negative result on throughput — HailoRT serializes inferences at the vdevice level, 70 RPS ceiling holds (1000ms / 14ms per inference)
  • Measured positive result on tail latency — pool=2 at concurrency=4 cuts p50 from 56.7ms → 43.5ms (-23%) under multi-bridge concurrent load
  • Memory cost measured: pool=2 = +55 MB RSS, pool=4 = +164 MB; pool=2 captures the full latency win at minimum cost
  • Findings documented in hef_embedder_pool.rs so the next optimizer doesn't re-run the experiment

Bridge feature parity (iter 238, 240, 242-245)

All three bridges (ruvllm, ruview-csi, mmwave) now expose:

  • --cache <N> (32500× speedup at full hit rate; iter-238 measurement)
  • --cache-ttl <secs> (max-staleness bound; defense-in-depth against silent worker drift)
  • --health-check <secs> (background fingerprint probe; closes the silent-drift threat for long-running bridges)

ADR-172 §2a fingerprint+cache gate enforced uniformly across all four cluster CLIs (embed.rs, bench.rs, all three bridges).

Observability + log hygiene (iter 247, 248)

  • iter-247: RUVECTOR_LOG_TEXT_CONTENT=full capped at 200 chars + ellipsis marker. Worst-case journal volume drops 100× (4.5 MB/s → 42 KB/s @ 70 RPS, 64 KB requests). Char-boundary-safe with multi-byte UTF-8.
  • iter-248: worker logs RUVECTOR_NPU_POOL_SIZE at startup alongside the iter-180+ DoS-gate banner

CI + docs (iter 241, 244, 246)

  • iter-241: refreshed 4 stale "once iteration N" references that pointed at completed milestones
  • iter-244: dispatch microbench smoke-tested in hailo-backend-audit.yml CI (~30s, catches harness/API regressions)
  • iter-246: bridge env examples document the iter-238..245 flags with recommended hardened-deploy lines

Type safety (iter 249)

  • mmwave parser: Event::Unknown.payload_len widened u8 → u16 to match the protocol's 2-byte length field

Test plan

  • Cluster-bench measurements on cognitum-v0 (Pi 5 + Hailo-8) for iter-235/236/237
  • Worker rebuilt + deployed; iter-248 startup log verified live
  • All cluster + mmwave + hailo unit tests pass (added 4 new tests)
  • hailo-backend-audit.yml CI gate (cargo-deny + cargo-audit + clippy + test pyramid + cross-build aarch64)

🤖 Generated with claude-flow

ruvnet and others added 17 commits May 4, 2026 08:34
Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps
cluster throughput at ~70 RPS because every gRPC request serializes
on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per
inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A
multi-pipeline pool should unlock 2-4× throughput.

# Baseline (iter 227, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

Throughput plateaus regardless of concurrency; p50 scales linearly
confirming the lock is the choke point.

# Skeleton (this commit)
- `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern.
- N independent HefPipeline instances on the shared vdevice;
  HailoRT's network-group scheduler arbitrates NPU access.
- `embed()`: try_lock each slot in turn; first free wins; fall back
  to blocking on slot 0 if all busy (matches cpu_embedder.rs).
- DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read /
  host pre-post-processing without scheduler exhaustion).
- Compile-only test asserts Send + Sync so worker can hand out
  Arc<HefEmbedderPool> across tokio tasks.

# Iter 235 plan (next)
- Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag.
- Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8.
- Sweep pool_size ∈ {2,4,8} to find the throughput knee.
- Document delta vs iter-227 baseline.

# Why a separate type, not a HefEmbedder field
Single-pipeline path stays cheaper for low-load deploys (init time,
RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps
HefEmbedder; cluster workers handling many concurrent gRPC streams
switch to HefEmbedderPool.

Co-Authored-By: claude-flow <ruv@ruv.net>
… 235)

Builds on iter-234's pool skeleton. HailoEmbedder now picks between
single-pipeline and pool-of-pipelines NPU dispatch at open() time
via a new private `HefBackend` enum. Selector is the
`RUVECTOR_NPU_POOL_SIZE` env var:

  unset / = 1  → Single (preserves iter-162 default)
  >= 2         → Pool with N pipelines on the shared vdevice
  bad value    → falls back to Single (logs would be added later)

Default behavior unchanged — operators must opt into the pool. This
keeps the iter-227 baseline as the regression-floor: bench numbers
without RUVECTOR_NPU_POOL_SIZE set should match exactly.

# Baseline (re-stating from iter 234, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# Next (iter 236)
- Cross-compile the worker for aarch64 with the hailo feature
- Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4`
- Re-run cluster-bench at concurrency 1/4/8
- Document the throughput delta in the iter-236 commit
- Sweep pool_size ∈ {2,4,8} to find the knee

Co-Authored-By: claude-flow <ruv@ruv.net>
…iter 236)

Deployed iter-235's HefEmbedderPool to cognitum-v0 with
RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8
plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS
across every configuration — identical to iter-227 baseline.

# Before (iter 227, single pipeline)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms |
| 4           | 70.7 RPS   | 43.5ms | 84.9ms |
| 8           | 70.7 RPS   | 112.9ms| 211.7ms|

# Pool-size sweep at fixed concurrency
| pool | concurrency | throughput | p50    |
|------|-------------|------------|--------|
| 2    | 4           | 70.7 RPS   | 43.3ms |
| 4    | 4           | 70.7 RPS   | 43.5ms |
| 8    | 8           | 70.7 RPS   | 112.9ms|

Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23%
tail-latency improvement) because each request gets its own host-side
queue slot — but the NPU itself remains the choke point.

# Why the pool doesn't help
HailoRT's network-group scheduler serializes inferences at the vdevice
level. The Hailo-8 has one inference engine per chip and HailoRT does
NOT pipeline DMA-write / NPU-compute / DMA-read across configured
network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is
a hard NPU+PCIe limit per single-batch HEF.

# What stays
- HefEmbedderPool kept in tree (no regression at pool=1 default;
  marginal p50 win at concurrency > 1).
- RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled.
- Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the
  iter-227 acceptance baseline).
- Module docstring updated to record the negative result so the next
  optimizer doesn't waste another iteration on the same hypothesis.

# Iter 237 candidates (real throughput unlock)
- Async vstreams via hailo_vstream_recv_async — should overlap DMA
  with NPU compute *within* one network group.
- Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on
  a host machine; multi-day fork.

Co-Authored-By: claude-flow <ruv@ruv.net>
…237)

iter-236 confirmed pool size doesn't affect throughput (NPU-bound at
70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23%
vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for
multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge,
ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so
in-flight concurrency >1 is the steady state, not the exception.

# After (iter 237 deployed default)
| concurrency | throughput | p50    | p99    | vs baseline |
|-------------|------------|--------|--------|-------------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms | -           |
| 4           | 70.7 RPS   | 43.3ms | 84.7ms | -23% p50    |

Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4
gives the same p50). Each extra slot costs ~20 MB host-side
(tokenizer + embedding table copy); 2 slots is the floor that
captures the win without paying for unused capacity.

Cognitum-v0 systemd env updated to pool=2. Default in
ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2
so future deploys get the latency win out of the box. Operators who
want the iter-227 baseline (single pipeline) can set =1.

Co-Authored-By: claude-flow <ruv@ruv.net>
The bridge previously constructed `HailoClusterEmbedder::new(...)`
without the existing coordinator-side LRU cache. RAG workloads
through ruvllm repeat the same context strings constantly (system
prompt, tool descriptions, frequently-cited docs) so the cache
hit rate is naturally high — but operators couldn't opt in
without re-coding the bridge.

# Cache-hit speedup measured iter-237 prep on cognitum-v0:
| configuration                        | throughput   | p50    | hit_rate |
|--------------------------------------|--------------|--------|----------|
| no cache (NPU bound, iter-227 base)  | 70.7 RPS     | 43.5ms | n/a      |
| --cache 4096 --cache-keyspace 64     | 2305282 RPS  | 0us    | 1.000    |

Delta: 32500x throughput, ~all latency removed at 100% hit rate.
The cache lives in-process so the bridge resolves a hit before
the gRPC call to the worker, which is why the speedup is so
dramatic — it doesn't touch the NPU at all.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint
  unless --allow-empty-fingerprint is set (mirrors embed.rs +
  bench.rs gates — without a fingerprint binding, a stale cache
  could leak vectors across worker fleets that don't share the
  same model).
- --help updated with the iter-238 measurement.
- Operator-controlled, opt-in. No deploy default change.

Same cache implementation already exposed via embed.rs's --cache
and HailoClusterEmbedder::with_cache. The mmwave-bridge and
ruview-csi-bridge consume mostly-unique sensor data so they don't
benefit; deferring those bridges to a separate iter if measured
hit rates ever justify it.

Co-Authored-By: claude-flow <ruv@ruv.net>
iter-237's commit message claimed pool=2 cost "~20 MB per extra slot".
Direct ps measurement on cognitum-v0 showed the real cost is much
higher — ~55 MB per slot, dominated by HailoRT's per-network-group
DMA and ring buffers, not the host-side state I'd assumed:

  pool=1 → 87 MB RSS  (baseline)
  pool=2 → 142 MB RSS (+55 MB / +64%)
  pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline)

The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated
by the kernel page cache, but each HailoRT-configured network group
allocates its own DMA + ring-buffer set on top of the shared mmaps.

# What changes
- env example explains the actual measured cost so operators can
  budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB
  Pi 5 should run pool=1 to leave room for bridges + system.
- DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected
  from 4 to 2, matching the iter-237 deploy default and the
  iter-236 measurement that proved pool=4 buys nothing extra.

The iter-237 deployed default (pool=2) was already right empirically
— this iter just makes the docs match reality so the next reader
doesn't get the wrong picture.

Co-Authored-By: claude-flow <ruv@ruv.net>
Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary
text is a fixed-template NL string interpolating seven
small-cardinality fields (node_id, channel, rssi, noise, antennas,
subcarriers, magic-kind). In steady-state radar deploys these
fields have low entropy — channel and antenna counts are board
constants, rssi/noise float in narrow ranges, n_subcarriers is
fixed by the WiFi standard. Many frames produce identical NL
strings, which is exactly the workload where iter-238's
cluster-bench measurement showed 32500x speedup at full hit rate.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs:
  refuses cache > 0 with empty fingerprint unless explicit opt-out.
- Startup banner reports cache size when enabled.
- --help updated with the iter-240 rationale.

Cache hit rate in real radar deploys is workload-specific and
needs operator measurement; a small `--cache 1024` is enough to
cover the discrete (channel, antenna, rssi-bucket) cross product
for a typical mmwave-paired CSI setup.

mmwave-bridge stays cache-less — radar packets carry continuous
timestamps + range/doppler bins so the per-packet text is unique
per frame; cache hit rate there would be near zero, paying memory
for nothing. Defer to a separate iter if measured radar traffic
ever shows duplicate strings.

Co-Authored-By: claude-flow <ruv@ruv.net>
Four cross-crate doc strings still pointed at "once iteration X
lands" milestones that have already shipped:

  ruvector-hailo/src/lib.rs:5      "once iter 3 lands the path dep"
  ruvector-hailo/src/lib.rs:424    "once iter 4 brings Mutex<Device>"
  ruvector-hailo-cluster/src/lib.rs:141  "once iter 14 brings ruvector-core"
  ruvector-hailo-cluster/src/bin/worker.rs:380  "later iters pipeline NPU"

The first three were closed by iter-218 (ADR-178 Gap B path-dep +
EmbeddingProvider impl). The fourth was partially addressed by the
iter-234..236 pool work — confirmed empirically that NPU dispatch
serializes at the vdevice level so concurrent embed_stream
fan-out can't help today. Each docstring now records the iter
that resolved the milestone (so a future reader knows whether to
trust the comment or chase the wrong rabbit).

Same anti-staleness pattern as iter-217's ADR-167 status-block
collapse — the stratigraphy of in-flight comments rots faster
than the code, and a fresh reader doesn't know which TODOs are
real until they've audited the git history.

No behavioral change.

Co-Authored-By: claude-flow <ruv@ruv.net>
Corrects iter-240's incorrect claim that mmwave radar packets
produce unique strings per frame. The radar payload carries
timestamps but the NL summary template *discards* them — only
four templates exist:

  "breathing rate {N} bpm at radar sensor"
  "heart rate {N} bpm at radar sensor"
  "nearest target distance {N} cm at radar sensor"
  "(no )?person detected at radar sensor"

The {N} integers live in narrow physiological ranges (breathing
10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200
unique strings total across the entire mmwave domain. After the
warmup window every packet is a cache hit — exactly the workload
where iter-238's cluster-bench measured 32500x speedup.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge /
  embed.rs / bench.rs.
- Startup banner reports cache size when enabled.
- --help updated with the iter-242 rationale.

All three sensor bridges now expose --cache symmetrically:

  ruvllm-bridge      iter 238  (RAG context repeats)
  ruview-csi-bridge  iter 240  (CSI summary low-cardinality)
  mmwave-bridge      iter 242  (radar templates low-cardinality)

Co-Authored-By: claude-flow <ruv@ruv.net>
embed.rs and bench.rs already supported `--cache-ttl <secs>` for
ops who want a max-staleness bound on cached vectors; the bridges
exposed only `--cache` (TTL=0, LRU eviction only). Closes the
parity gap.

# Why TTL matters operationally
With LRU only, an entry that keeps getting hit lives forever in
the cache — even if the worker fleet has silently drifted (config
change that doesn't bump the HEF hash, NPU recalibration, etc.).
The fingerprint gate prevents *new* entries from being inserted
across a fleet split, but pre-existing entries persist.

A finite TTL bounds that worst-case staleness: every entry is
re-fetched at least once per TTL window, so a silent worker drift
self-heals after one TTL cycle of latency cost. Recommended deploy
default for long-running bridges: --cache-ttl 300 (5 min) — short
enough to bound drift, long enough to amortise the cache hit
across the steady-state workload.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only).
- Wired through the same `with_cache_ttl(cap, Duration)` API
  embed.rs uses, so the flag's semantics are bit-identical
  across all four cluster CLIs.
- Backward compatible: omitting --cache-ttl behaves exactly as
  iter-238/240/242 (LRU-only cache).

Co-Authored-By: claude-flow <ruv@ruv.net>
The cluster crate has had a Criterion microbench at
`benches/dispatch.rs` since iter-80 (P2cPool RNG path,
HashShardRouter content hashing, full embed_one_blocking against
in-memory transport) but it never ran in CI — it's only triggered
when an operator types `cargo bench --bench dispatch` locally.

Adding `cargo bench --bench dispatch -- --test` to the audit
workflow's test job. The `--test` flag runs each bench function
exactly once instead of criterion's default (~100 iterations +
warmup), so the cost is ~30 seconds in CI but the smoke catches:

  * bench harness panic from a removed dep or API change
  * imports broken by a refactor of the cluster surface
  * a hot-path function renamed without updating the bench

This is the fast variant of regression-gating — it doesn't detect
*numerical* regressions (a 2x slowdown that still completes
successfully). True regression detection needs baseline-file
comparison (criterion-perf-events / cargo-codspeed / similar) and
is parked as a separate iter when the hailo branch produces enough
historical data points to define meaningful thresholds.

Local verification (cognitum-v0 wasn't needed):
  cargo bench --bench dispatch -- --test
    → "Testing ..." for each bench function, all "Success"

Co-Authored-By: claude-flow <ruv@ruv.net>
embed.rs and bench.rs already supported background health checking
via spawn_health_checker since iter-99 — periodic fingerprint
probes with automatic ejection of mismatched workers and cache
clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't,
which is exactly the wrong place to skip it: bridges are the
*long-running* CLIs (mmwave deploys run for days), so silent
worker drift goes uncaught the longest there.

# Threat closed
Worker A is deployed with HEF X and fingerprint x-hash. Bridge
starts, validates fp at startup, hands out vectors. Operator
re-deploys worker A with HEF Y (new model) and fingerprint
y-hash. Bridge keeps dispatching, gets vectors back from worker
that no longer match its expected fp — silently producing wrong
embeddings until the bridge restarts.

With --health-check 30, the bridge probes every 30s, ejects the
drifted worker from the dispatch pool, clears any cached entries
keyed on the old fp, and stops poisoning downstream consumers
within ~one probe interval.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--health-check <secs>` flag (default 0 = disabled, backward
  compat with iter-238/240/242 behavior).
- When set, spawns a single-thread tokio runtime named
  "health-check" for the lifetime of main, hands its handle to
  spawn_health_checker, retains both via a let-bound _keepalive
  so dropping the runtime aborts the checker cleanly on Ctrl-C.
- Same HealthCheckerConfig as embed.rs (interval override, all
  other defaults from health_checker_config()).
- --help text updated with the iter-245 rationale.

Recommended deploy interval for long-running bridges: 30-60
seconds. Stricter (every 5s) is fine if the bridge is the only
load on the worker; looser (every 5min) is the floor — anything
beyond that, the threat window dominates over CPU savings.

Co-Authored-By: claude-flow <ruv@ruv.net>
…ter 246)

iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges
--cache), iter-243 (--cache-ttl), iter-245 (--health-check) all
shipped CLI flags but didn't update the deploy env templates.
Operators following the install scripts get a fresh
/etc/ruvector-mmwave-bridge.env that has no hint these knobs
even exist.

Closing the doc gap by adding annotated suggestions to all three
RUVECTOR_*_EXTRA_ARGS sections:

  ruvector-mmwave-bridge.env.example  → --cache + --cache-ttl + --health-check
  ruview-csi-bridge.env.example       → --cache + --cache-ttl + --health-check
  ruvllm-bridge.env.example           → --cache + --cache-ttl

Each example shows the recommended hardened deploy line so
operators can copy-paste:

  RUVECTOR_*_EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30

(ruvllm-bridge omits --health-check from the typical deploy because
ruvllm typically forks the bridge per-session — health checking a
sub-second-lifetime process is a no-op.)

No code change. No behavioral change. Deploy parity / discoverability
fix only.

Co-Authored-By: claude-flow <ruv@ruv.net>
The audit-log Full mode rendered text verbatim — for an embed
request the iter-180 byte cap allows up to 64 KB. An operator
who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could
push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which:
  * burns journal disk fast (10s of GB/hour)
  * produces single-line entries that break most ops tooling
    (long-line scanners, journalctl --grep regex backtracking)
  * makes individual entries unscannable by humans anyway

Capping at 200 chars per text preserves the debug utility — you
can still grep for content correlations against request_id — at
1/300th the worst-case journald volume. The cut is char-boundary-
safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic
the rendering path.

# Worst case before vs after
Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full
  Before: 64 KB × 70 = 4.5 MB/s journal volume per worker
  After:  600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing)

Three tests added: short (≤cap, unchanged), long (truncated +
ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB,
truncates on a char boundary not byte boundary).

iter-180 capped REQUEST size; iter-190 capped RESPONSE size;
iter-247 caps the LOG-LINE size for the same defense-in-depth
reason. Full-mode logging stays the operator's footgun (per the
existing docstring) — but it's now a footgun that doesn't
exhaust the disk in 10 minutes.

Co-Authored-By: claude-flow <ruv@ruv.net>
iter-235 added the env-var knob for the HefEmbedderPool selector,
but the worker never logged the resolved value at startup. An
operator who flipped pool=2→4 (or back to 1 on a memory-constrained
4 GB Pi) had no confirmation the change actually took effect short
of inspecting RSS via `ps`.

Now the worker emits an info-level log line alongside the existing
iter-180/181/182/183/184 DoS-gate startup banner:

  NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...)

Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT,
RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every
operator-tunable env knob ends up in the journal at startup so
post-incident review can reconstruct the running config without
reading /etc/ruvector-hailo.env at the time of the incident.

No behavior change. Pure observability.

Co-Authored-By: claude-flow <ruv@ruv.net>
`Event::Unknown { frame_type, payload_len }` carried a u8 payload_len
even though the MR60BHA2 protocol uses a 2-byte length field. The
current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so
this was never a runtime truncation, but:

- Type didn't match the protocol's intent — operators reading the
  emitted JSONL had to remember the implicit cap.
- `clippy::cast_possible_truncation` fired at the construction
  site (`payload.len() as u8`) and the bridge's emission site.
  Pedantic, but the alternative — silencing with `#[allow]` — is
  worse than just using the right type.

Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`,
which honestly handles any future MAX_PAYLOAD bump up to 65535
bytes. The mmwave-bridge JSONL formatter already prints the value
via `{}` so emission stays unchanged.

Test added that locks the field width: an unknown frame with a
60-byte payload must report payload_len=60. (300 bytes would
exercise the formerly-truncating path but the parser rejects
anything > MAX_PAYLOAD before the Event is constructed, so the
test stays inside the parser's contract.)

Surfaced by an iter-249 cargo clippy --pedantic sweep; same
audit pass also flagged stylistic warnings (missing backticks,
implicit format args) which are out of scope.

Co-Authored-By: claude-flow <ruv@ruv.net>
… 250)

Closes the doc gap surfaced by the iter-234..249 PR review:
ruvector-hailo-cluster had a 424-line operator README, but the 3
sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys)
shipped without one — `cargo doc --open` was the only on-ramp.

# What ships

- crates/ruvector-hailo/README.md         — embedding backend,
  3 feature-gated build paths, architecture diagram, iter-235+
  pool benchmark table, security posture summary, env vars
- crates/ruvector-mmwave/README.md        — MR60BHA2 wire format,
  parser API, criterion benchmark numbers, proptest fuzz suite
- crates/hailort-sys/README.md            — FFI binding scope,
  build requirements, why no safe wrapper at this layer
- crates/ruvector-hailo-cluster/README.md — added the iter-238
  cache-hit measurement table + the iter-234..237 pool benchmark
  table; refreshed the CLI section to enumerate all four cluster
  CLIs + the three bridges with their iter-243/245 flags

All builds verified clean:
  cargo build -p ruvector-hailo --no-default-features
  cargo build -p ruvector-hailo --features cpu-fallback
  cargo build -p ruvector-mmwave
  cargo build -p hailort-sys
  cargo build -p ruvector-hailo-cluster --bins

No code change. Documentation parity only.

Co-Authored-By: claude-flow <ruv@ruv.net>
@ruvnet ruvnet merged commit c7b0ba4 into main May 4, 2026
21 of 27 checks passed
@ruvnet ruvnet deleted the hailo-pipeline-pool branch May 4, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant