feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap, noise sweep by ruvnet · Pull Request #875 · ruvnet/RuView

ruvnet · 2026-05-30T21:17:30Z

Summary

Implements ADR-149 (statistically-rigorous swarm evaluation methodology, peer-reviewed/Accepted). Adds the Stage-1 kinematic evaluation harness for ruview-swarm: seeded multi-run rollouts → SAR + MARL metrics with IQM + 95% stratified-bootstrap CIs, a (σ, κ) CSI-noise sweep, GDOP tracking, and a RESULTS.md leaderboard generator. Pure Rust, no new dependencies.

What's new (`src/evals/`)

File	Contents
`gdop.rs`	2D Geometric Dilution of Precision via closed-form `(HᵀH)⁻¹`; `None` for <2 observers / collinear / singular geometry
`stats.rs`	IQM (Agarwal 2021), 95% stratified-bootstrap CI (deterministic LCG), probability-of-improvement
`metrics.rs`	`EpisodeMetrics` + `AggregateMetrics::from_strata` (IQM±CI, seed-stratified)
`runner.rs`	seeded kinematic rollout (driven by `FlightPattern`), seed×episode matrix, 3σ×3κ default noise sweep
`report.rs` + `bin/eval_swarm`	generates `evals/RESULTS.md`

The result it surfaces (the point of the methodology)

GDOP tracking exposes a real coverage-vs-localization-precision trade-off that point estimates would hide:

Flight pattern	Coverage IQM [95% CI]	Localization (m) IQM [95% CI]	Detection	Mean GDOP
partitioned_lawnmower	1.000 [1.000, 1.000]	7.022 [5.669, 8.379]	100%	0.000
pheromone	0.662 [0.652, 0.671]	4.110 [3.346, 5.141]	95%	1.598
levy_flight	0.490	3.523	100%	0.000
boustrophedon	0.370	2.740	100%	0.000
spiral	0.336	3.082	100%	0.000
potential_field	0.254	4.343	100%	0.000
Wi2SAR (paper baseline)	n/a	5.0 (paper)	n/a	n/a

partitioned wins coverage (disjoint strips) but its single-drone sightings (GDOP→0) give the worst localization; pheromone co-locates drones (GDOP 1.6) for better fusion. Coverage and localization-precision genuinely trade off — exactly what the harness is built to reveal.

Methodology (ADR-149)

Dual-stage pipeline: Stage 1 kinematic (this PR, full 10×50 matrix); Stage 2 Gazebo/PX4 SITL on the 3 median seeds for false-alarm + collision rate (documented follow-on)
Statistical standard: ≥10 seeds, IQM + 95% stratified bootstrap, ≥3 baselines (Agarwal 2021 / Gorsane 2022 / rliable)
Honest leaderboard position: no public leaderboard accepts CSI-SAR swarm submissions; Wi2SAR is a labeled paper-to-paper baseline

Tests

--no-default-features: 116/116 (+13 eval tests)
--features full,train: 133/133
Clippy: 0 warnings in-crate (-D warnings --no-deps)
cargo run --bin eval_swarm produces RESULTS.md ✓

Covered by the ruview-swarm CI guard (path-scoped feature matrix + clippy + ITAR guards).

…se sweep Stage-1 kinematic evaluator per ADR-149 (peer-reviewed). Pure Rust, no new deps. evals/: - gdop.rs: 2D Geometric Dilution of Precision ((HᵀH)⁻¹ trace-sqrt); None for <2 observers or collinear/singular geometry - stats.rs: IQM (Agarwal 2021) + 95% stratified-bootstrap CI (deterministic LCG) + probability_of_improvement - metrics.rs: EpisodeMetrics + AggregateMetrics::from_strata (IQM±CI, seed-stratified) - runner.rs: seeded kinematic rollout (FlightPattern-driven), seed×episode matrix, 3σ×3κ default noise sweep (Gaussian amplitude × von Mises phase) - report.rs + eval_swarm bin: generates evals/RESULTS.md leaderboard RESULTS.md surfaces the real coverage-vs-localization-precision trade-off via GDOP: partitioned wins coverage (100%) but single-drone sightings (GDOP 0 → 7.0m); pheromone gets multistatic fusion (GDOP 1.6 → 4.1m). Wi2SAR 5m paper-baseline row included. Stage-2 (Gazebo/PX4 SITL false-alarm + collision on median seeds) is documented follow-on. Tests: 116 default / 133 full+train (+13 eval tests), 0 failed. Clippy clean (-D warnings). Co-Authored-By: claude-flow <ruv@ruv.net>

ruvnet force-pushed the feat/adr-149-eval-harness branch from d6407ae to aabf7a7 Compare May 30, 2026 21:18

ruvnet merged commit 8d64434 into main May 30, 2026
49 of 51 checks passed

ruvnet deleted the feat/adr-149-eval-harness branch May 30, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap, noise sweep#875

feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap, noise sweep#875
ruvnet merged 1 commit into
mainfrom
feat/adr-149-eval-harness

ruvnet commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ruvnet commented May 30, 2026

Summary

What's new (src/evals/)

The result it surfaces (the point of the methodology)

Methodology (ADR-149)

Tests

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What's new (`src/evals/`)