hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249)#418
Merged
hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249)#418
Conversation
Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps
cluster throughput at ~70 RPS because every gRPC request serializes
on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per
inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A
multi-pipeline pool should unlock 2-4× throughput.
# Baseline (iter 227, single pipeline, cognitum-v0)
| concurrency | throughput | p50 | p99 |
|-------------|------------|--------|--------|
| 1 | 70.6 RPS | 14.1ms | 15.8ms |
| 4 | 70.7 RPS | 56.7ms | 74.7ms |
| 8 | 70.7 RPS | 112.7ms| 170.7ms|
Throughput plateaus regardless of concurrency; p50 scales linearly
confirming the lock is the choke point.
# Skeleton (this commit)
- `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern.
- N independent HefPipeline instances on the shared vdevice;
HailoRT's network-group scheduler arbitrates NPU access.
- `embed()`: try_lock each slot in turn; first free wins; fall back
to blocking on slot 0 if all busy (matches cpu_embedder.rs).
- DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read /
host pre-post-processing without scheduler exhaustion).
- Compile-only test asserts Send + Sync so worker can hand out
Arc<HefEmbedderPool> across tokio tasks.
# Iter 235 plan (next)
- Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag.
- Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8.
- Sweep pool_size ∈ {2,4,8} to find the throughput knee.
- Document delta vs iter-227 baseline.
# Why a separate type, not a HefEmbedder field
Single-pipeline path stays cheaper for low-load deploys (init time,
RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps
HefEmbedder; cluster workers handling many concurrent gRPC streams
switch to HefEmbedderPool.
Co-Authored-By: claude-flow <ruv@ruv.net>
… 235)
Builds on iter-234's pool skeleton. HailoEmbedder now picks between
single-pipeline and pool-of-pipelines NPU dispatch at open() time
via a new private `HefBackend` enum. Selector is the
`RUVECTOR_NPU_POOL_SIZE` env var:
unset / = 1 → Single (preserves iter-162 default)
>= 2 → Pool with N pipelines on the shared vdevice
bad value → falls back to Single (logs would be added later)
Default behavior unchanged — operators must opt into the pool. This
keeps the iter-227 baseline as the regression-floor: bench numbers
without RUVECTOR_NPU_POOL_SIZE set should match exactly.
# Baseline (re-stating from iter 234, single pipeline, cognitum-v0)
| concurrency | throughput | p50 | p99 |
|-------------|------------|--------|--------|
| 1 | 70.6 RPS | 14.1ms | 15.8ms |
| 4 | 70.7 RPS | 56.7ms | 74.7ms |
| 8 | 70.7 RPS | 112.7ms| 170.7ms|
# Next (iter 236)
- Cross-compile the worker for aarch64 with the hailo feature
- Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4`
- Re-run cluster-bench at concurrency 1/4/8
- Document the throughput delta in the iter-236 commit
- Sweep pool_size ∈ {2,4,8} to find the knee
Co-Authored-By: claude-flow <ruv@ruv.net>
…iter 236)
Deployed iter-235's HefEmbedderPool to cognitum-v0 with
RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8
plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS
across every configuration — identical to iter-227 baseline.
# Before (iter 227, single pipeline)
| concurrency | throughput | p50 | p99 |
|-------------|------------|--------|--------|
| 1 | 70.6 RPS | 14.1ms | 15.8ms |
| 4 | 70.7 RPS | 56.7ms | 74.7ms |
| 8 | 70.7 RPS | 112.7ms| 170.7ms|
# After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4)
| concurrency | throughput | p50 | p99 |
|-------------|------------|--------|--------|
| 1 | 70.6 RPS | 14.1ms | 16.7ms |
| 4 | 70.7 RPS | 43.5ms | 84.9ms |
| 8 | 70.7 RPS | 112.9ms| 211.7ms|
# Pool-size sweep at fixed concurrency
| pool | concurrency | throughput | p50 |
|------|-------------|------------|--------|
| 2 | 4 | 70.7 RPS | 43.3ms |
| 4 | 4 | 70.7 RPS | 43.5ms |
| 8 | 8 | 70.7 RPS | 112.9ms|
Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23%
tail-latency improvement) because each request gets its own host-side
queue slot — but the NPU itself remains the choke point.
# Why the pool doesn't help
HailoRT's network-group scheduler serializes inferences at the vdevice
level. The Hailo-8 has one inference engine per chip and HailoRT does
NOT pipeline DMA-write / NPU-compute / DMA-read across configured
network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is
a hard NPU+PCIe limit per single-batch HEF.
# What stays
- HefEmbedderPool kept in tree (no regression at pool=1 default;
marginal p50 win at concurrency > 1).
- RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled.
- Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the
iter-227 acceptance baseline).
- Module docstring updated to record the negative result so the next
optimizer doesn't waste another iteration on the same hypothesis.
# Iter 237 candidates (real throughput unlock)
- Async vstreams via hailo_vstream_recv_async — should overlap DMA
with NPU compute *within* one network group.
- Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on
a host machine; multi-day fork.
Co-Authored-By: claude-flow <ruv@ruv.net>
…237) iter-236 confirmed pool size doesn't affect throughput (NPU-bound at 70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23% vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge, ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so in-flight concurrency >1 is the steady state, not the exception. # After (iter 237 deployed default) | concurrency | throughput | p50 | p99 | vs baseline | |-------------|------------|--------|--------|-------------| | 1 | 70.6 RPS | 14.1ms | 16.7ms | - | | 4 | 70.7 RPS | 43.3ms | 84.7ms | -23% p50 | Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4 gives the same p50). Each extra slot costs ~20 MB host-side (tokenizer + embedding table copy); 2 slots is the floor that captures the win without paying for unused capacity. Cognitum-v0 systemd env updated to pool=2. Default in ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2 so future deploys get the latency win out of the box. Operators who want the iter-227 baseline (single pipeline) can set =1. Co-Authored-By: claude-flow <ruv@ruv.net>
The bridge previously constructed `HailoClusterEmbedder::new(...)` without the existing coordinator-side LRU cache. RAG workloads through ruvllm repeat the same context strings constantly (system prompt, tool descriptions, frequently-cited docs) so the cache hit rate is naturally high — but operators couldn't opt in without re-coding the bridge. # Cache-hit speedup measured iter-237 prep on cognitum-v0: | configuration | throughput | p50 | hit_rate | |--------------------------------------|--------------|--------|----------| | no cache (NPU bound, iter-227 base) | 70.7 RPS | 43.5ms | n/a | | --cache 4096 --cache-keyspace 64 | 2305282 RPS | 0us | 1.000 | Delta: 32500x throughput, ~all latency removed at 100% hit rate. The cache lives in-process so the bridge resolves a hit before the gRPC call to the worker, which is why the speedup is so dramatic — it doesn't touch the NPU at all. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint unless --allow-empty-fingerprint is set (mirrors embed.rs + bench.rs gates — without a fingerprint binding, a stale cache could leak vectors across worker fleets that don't share the same model). - --help updated with the iter-238 measurement. - Operator-controlled, opt-in. No deploy default change. Same cache implementation already exposed via embed.rs's --cache and HailoClusterEmbedder::with_cache. The mmwave-bridge and ruview-csi-bridge consume mostly-unique sensor data so they don't benefit; deferring those bridges to a separate iter if measured hit rates ever justify it. Co-Authored-By: claude-flow <ruv@ruv.net>
iter-237's commit message claimed pool=2 cost "~20 MB per extra slot". Direct ps measurement on cognitum-v0 showed the real cost is much higher — ~55 MB per slot, dominated by HailoRT's per-network-group DMA and ring buffers, not the host-side state I'd assumed: pool=1 → 87 MB RSS (baseline) pool=2 → 142 MB RSS (+55 MB / +64%) pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline) The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated by the kernel page cache, but each HailoRT-configured network group allocates its own DMA + ring-buffer set on top of the shared mmaps. # What changes - env example explains the actual measured cost so operators can budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB Pi 5 should run pool=1 to leave room for bridges + system. - DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected from 4 to 2, matching the iter-237 deploy default and the iter-236 measurement that proved pool=4 buys nothing extra. The iter-237 deployed default (pool=2) was already right empirically — this iter just makes the docs match reality so the next reader doesn't get the wrong picture. Co-Authored-By: claude-flow <ruv@ruv.net>
Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary text is a fixed-template NL string interpolating seven small-cardinality fields (node_id, channel, rssi, noise, antennas, subcarriers, magic-kind). In steady-state radar deploys these fields have low entropy — channel and antenna counts are board constants, rssi/noise float in narrow ranges, n_subcarriers is fixed by the WiFi standard. Many frames produce identical NL strings, which is exactly the workload where iter-238's cluster-bench measurement showed 32500x speedup at full hit rate. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs: refuses cache > 0 with empty fingerprint unless explicit opt-out. - Startup banner reports cache size when enabled. - --help updated with the iter-240 rationale. Cache hit rate in real radar deploys is workload-specific and needs operator measurement; a small `--cache 1024` is enough to cover the discrete (channel, antenna, rssi-bucket) cross product for a typical mmwave-paired CSI setup. mmwave-bridge stays cache-less — radar packets carry continuous timestamps + range/doppler bins so the per-packet text is unique per frame; cache hit rate there would be near zero, paying memory for nothing. Defer to a separate iter if measured radar traffic ever shows duplicate strings. Co-Authored-By: claude-flow <ruv@ruv.net>
Four cross-crate doc strings still pointed at "once iteration X lands" milestones that have already shipped: ruvector-hailo/src/lib.rs:5 "once iter 3 lands the path dep" ruvector-hailo/src/lib.rs:424 "once iter 4 brings Mutex<Device>" ruvector-hailo-cluster/src/lib.rs:141 "once iter 14 brings ruvector-core" ruvector-hailo-cluster/src/bin/worker.rs:380 "later iters pipeline NPU" The first three were closed by iter-218 (ADR-178 Gap B path-dep + EmbeddingProvider impl). The fourth was partially addressed by the iter-234..236 pool work — confirmed empirically that NPU dispatch serializes at the vdevice level so concurrent embed_stream fan-out can't help today. Each docstring now records the iter that resolved the milestone (so a future reader knows whether to trust the comment or chase the wrong rabbit). Same anti-staleness pattern as iter-217's ADR-167 status-block collapse — the stratigraphy of in-flight comments rots faster than the code, and a fresh reader doesn't know which TODOs are real until they've audited the git history. No behavioral change. Co-Authored-By: claude-flow <ruv@ruv.net>
Corrects iter-240's incorrect claim that mmwave radar packets
produce unique strings per frame. The radar payload carries
timestamps but the NL summary template *discards* them — only
four templates exist:
"breathing rate {N} bpm at radar sensor"
"heart rate {N} bpm at radar sensor"
"nearest target distance {N} cm at radar sensor"
"(no )?person detected at radar sensor"
The {N} integers live in narrow physiological ranges (breathing
10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200
unique strings total across the entire mmwave domain. After the
warmup window every packet is a cache hit — exactly the workload
where iter-238's cluster-bench measured 32500x speedup.
# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge /
embed.rs / bench.rs.
- Startup banner reports cache size when enabled.
- --help updated with the iter-242 rationale.
All three sensor bridges now expose --cache symmetrically:
ruvllm-bridge iter 238 (RAG context repeats)
ruview-csi-bridge iter 240 (CSI summary low-cardinality)
mmwave-bridge iter 242 (radar templates low-cardinality)
Co-Authored-By: claude-flow <ruv@ruv.net>
embed.rs and bench.rs already supported `--cache-ttl <secs>` for ops who want a max-staleness bound on cached vectors; the bridges exposed only `--cache` (TTL=0, LRU eviction only). Closes the parity gap. # Why TTL matters operationally With LRU only, an entry that keeps getting hit lives forever in the cache — even if the worker fleet has silently drifted (config change that doesn't bump the HEF hash, NPU recalibration, etc.). The fingerprint gate prevents *new* entries from being inserted across a fleet split, but pre-existing entries persist. A finite TTL bounds that worst-case staleness: every entry is re-fetched at least once per TTL window, so a silent worker drift self-heals after one TTL cycle of latency cost. Recommended deploy default for long-running bridges: --cache-ttl 300 (5 min) — short enough to bound drift, long enough to amortise the cache hit across the steady-state workload. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only). - Wired through the same `with_cache_ttl(cap, Duration)` API embed.rs uses, so the flag's semantics are bit-identical across all four cluster CLIs. - Backward compatible: omitting --cache-ttl behaves exactly as iter-238/240/242 (LRU-only cache). Co-Authored-By: claude-flow <ruv@ruv.net>
The cluster crate has had a Criterion microbench at
`benches/dispatch.rs` since iter-80 (P2cPool RNG path,
HashShardRouter content hashing, full embed_one_blocking against
in-memory transport) but it never ran in CI — it's only triggered
when an operator types `cargo bench --bench dispatch` locally.
Adding `cargo bench --bench dispatch -- --test` to the audit
workflow's test job. The `--test` flag runs each bench function
exactly once instead of criterion's default (~100 iterations +
warmup), so the cost is ~30 seconds in CI but the smoke catches:
* bench harness panic from a removed dep or API change
* imports broken by a refactor of the cluster surface
* a hot-path function renamed without updating the bench
This is the fast variant of regression-gating — it doesn't detect
*numerical* regressions (a 2x slowdown that still completes
successfully). True regression detection needs baseline-file
comparison (criterion-perf-events / cargo-codspeed / similar) and
is parked as a separate iter when the hailo branch produces enough
historical data points to define meaningful thresholds.
Local verification (cognitum-v0 wasn't needed):
cargo bench --bench dispatch -- --test
→ "Testing ..." for each bench function, all "Success"
Co-Authored-By: claude-flow <ruv@ruv.net>
embed.rs and bench.rs already supported background health checking via spawn_health_checker since iter-99 — periodic fingerprint probes with automatic ejection of mismatched workers and cache clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't, which is exactly the wrong place to skip it: bridges are the *long-running* CLIs (mmwave deploys run for days), so silent worker drift goes uncaught the longest there. # Threat closed Worker A is deployed with HEF X and fingerprint x-hash. Bridge starts, validates fp at startup, hands out vectors. Operator re-deploys worker A with HEF Y (new model) and fingerprint y-hash. Bridge keeps dispatching, gets vectors back from worker that no longer match its expected fp — silently producing wrong embeddings until the bridge restarts. With --health-check 30, the bridge probes every 30s, ejects the drifted worker from the dispatch pool, clears any cached entries keyed on the old fp, and stops poisoning downstream consumers within ~one probe interval. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--health-check <secs>` flag (default 0 = disabled, backward compat with iter-238/240/242 behavior). - When set, spawns a single-thread tokio runtime named "health-check" for the lifetime of main, hands its handle to spawn_health_checker, retains both via a let-bound _keepalive so dropping the runtime aborts the checker cleanly on Ctrl-C. - Same HealthCheckerConfig as embed.rs (interval override, all other defaults from health_checker_config()). - --help text updated with the iter-245 rationale. Recommended deploy interval for long-running bridges: 30-60 seconds. Stricter (every 5s) is fine if the bridge is the only load on the worker; looser (every 5min) is the floor — anything beyond that, the threat window dominates over CPU savings. Co-Authored-By: claude-flow <ruv@ruv.net>
…ter 246) iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges --cache), iter-243 (--cache-ttl), iter-245 (--health-check) all shipped CLI flags but didn't update the deploy env templates. Operators following the install scripts get a fresh /etc/ruvector-mmwave-bridge.env that has no hint these knobs even exist. Closing the doc gap by adding annotated suggestions to all three RUVECTOR_*_EXTRA_ARGS sections: ruvector-mmwave-bridge.env.example → --cache + --cache-ttl + --health-check ruview-csi-bridge.env.example → --cache + --cache-ttl + --health-check ruvllm-bridge.env.example → --cache + --cache-ttl Each example shows the recommended hardened deploy line so operators can copy-paste: RUVECTOR_*_EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30 (ruvllm-bridge omits --health-check from the typical deploy because ruvllm typically forks the bridge per-session — health checking a sub-second-lifetime process is a no-op.) No code change. No behavioral change. Deploy parity / discoverability fix only. Co-Authored-By: claude-flow <ruv@ruv.net>
The audit-log Full mode rendered text verbatim — for an embed
request the iter-180 byte cap allows up to 64 KB. An operator
who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could
push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which:
* burns journal disk fast (10s of GB/hour)
* produces single-line entries that break most ops tooling
(long-line scanners, journalctl --grep regex backtracking)
* makes individual entries unscannable by humans anyway
Capping at 200 chars per text preserves the debug utility — you
can still grep for content correlations against request_id — at
1/300th the worst-case journald volume. The cut is char-boundary-
safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic
the rendering path.
# Worst case before vs after
Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full
Before: 64 KB × 70 = 4.5 MB/s journal volume per worker
After: 600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing)
Three tests added: short (≤cap, unchanged), long (truncated +
ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB,
truncates on a char boundary not byte boundary).
iter-180 capped REQUEST size; iter-190 capped RESPONSE size;
iter-247 caps the LOG-LINE size for the same defense-in-depth
reason. Full-mode logging stays the operator's footgun (per the
existing docstring) — but it's now a footgun that doesn't
exhaust the disk in 10 minutes.
Co-Authored-By: claude-flow <ruv@ruv.net>
iter-235 added the env-var knob for the HefEmbedderPool selector, but the worker never logged the resolved value at startup. An operator who flipped pool=2→4 (or back to 1 on a memory-constrained 4 GB Pi) had no confirmation the change actually took effect short of inspecting RSS via `ps`. Now the worker emits an info-level log line alongside the existing iter-180/181/182/183/184 DoS-gate startup banner: NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...) Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT, RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every operator-tunable env knob ends up in the journal at startup so post-incident review can reconstruct the running config without reading /etc/ruvector-hailo.env at the time of the incident. No behavior change. Pure observability. Co-Authored-By: claude-flow <ruv@ruv.net>
`Event::Unknown { frame_type, payload_len }` carried a u8 payload_len
even though the MR60BHA2 protocol uses a 2-byte length field. The
current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so
this was never a runtime truncation, but:
- Type didn't match the protocol's intent — operators reading the
emitted JSONL had to remember the implicit cap.
- `clippy::cast_possible_truncation` fired at the construction
site (`payload.len() as u8`) and the bridge's emission site.
Pedantic, but the alternative — silencing with `#[allow]` — is
worse than just using the right type.
Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`,
which honestly handles any future MAX_PAYLOAD bump up to 65535
bytes. The mmwave-bridge JSONL formatter already prints the value
via `{}` so emission stays unchanged.
Test added that locks the field width: an unknown frame with a
60-byte payload must report payload_len=60. (300 bytes would
exercise the formerly-truncating path but the parser rejects
anything > MAX_PAYLOAD before the Event is constructed, so the
test stays inside the parser's contract.)
Surfaced by an iter-249 cargo clippy --pedantic sweep; same
audit pass also flagged stylistic warnings (missing backticks,
implicit format args) which are out of scope.
Co-Authored-By: claude-flow <ruv@ruv.net>
… 250) Closes the doc gap surfaced by the iter-234..249 PR review: ruvector-hailo-cluster had a 424-line operator README, but the 3 sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys) shipped without one — `cargo doc --open` was the only on-ramp. # What ships - crates/ruvector-hailo/README.md — embedding backend, 3 feature-gated build paths, architecture diagram, iter-235+ pool benchmark table, security posture summary, env vars - crates/ruvector-mmwave/README.md — MR60BHA2 wire format, parser API, criterion benchmark numbers, proptest fuzz suite - crates/hailort-sys/README.md — FFI binding scope, build requirements, why no safe wrapper at this layer - crates/ruvector-hailo-cluster/README.md — added the iter-238 cache-hit measurement table + the iter-234..237 pool benchmark table; refreshed the CLI section to enumerate all four cluster CLIs + the three bridges with their iter-243/245 flags All builds verified clean: cargo build -p ruvector-hailo --no-default-features cargo build -p ruvector-hailo --features cpu-fallback cargo build -p ruvector-mmwave cargo build -p hailort-sys cargo build -p ruvector-hailo-cluster --bins No code change. Documentation parity only. Co-Authored-By: claude-flow <ruv@ruv.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sixteen iterations on the
hailo-backendfollow-up branch, covering NPU pool exploration, bridge feature parity with embed.rs/bench.rs, observability gaps, and a mmwave parser type-safety fix.What ships
NPU pipeline pool (iter 234-237, 239)
HefEmbedderPool: N independent network-group + vstream pairs on the shared vdeviceRUVECTOR_NPU_POOL_SIZEenv (default 1; deploys default 2)hef_embedder_pool.rsso the next optimizer doesn't re-run the experimentBridge feature parity (iter 238, 240, 242-245)
All three bridges (ruvllm, ruview-csi, mmwave) now expose:
--cache <N>(32500× speedup at full hit rate; iter-238 measurement)--cache-ttl <secs>(max-staleness bound; defense-in-depth against silent worker drift)--health-check <secs>(background fingerprint probe; closes the silent-drift threat for long-running bridges)ADR-172 §2a fingerprint+cache gate enforced uniformly across all four cluster CLIs (embed.rs, bench.rs, all three bridges).
Observability + log hygiene (iter 247, 248)
RUVECTOR_LOG_TEXT_CONTENT=fullcapped at 200 chars + ellipsis marker. Worst-case journal volume drops 100× (4.5 MB/s → 42 KB/s @ 70 RPS, 64 KB requests). Char-boundary-safe with multi-byte UTF-8.RUVECTOR_NPU_POOL_SIZEat startup alongside the iter-180+ DoS-gate bannerCI + docs (iter 241, 244, 246)
hailo-backend-audit.ymlCI (~30s, catches harness/API regressions)Type safety (iter 249)
Event::Unknown.payload_lenwidened u8 → u16 to match the protocol's 2-byte length fieldTest plan
🤖 Generated with claude-flow