Skip to content

feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap, noise sweep#875

Merged
ruvnet merged 1 commit into
mainfrom
feat/adr-149-eval-harness
May 30, 2026
Merged

feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap, noise sweep#875
ruvnet merged 1 commit into
mainfrom
feat/adr-149-eval-harness

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented May 30, 2026

Summary

Implements ADR-149 (statistically-rigorous swarm evaluation methodology, peer-reviewed/Accepted). Adds the Stage-1 kinematic evaluation harness for ruview-swarm: seeded multi-run rollouts → SAR + MARL metrics with IQM + 95% stratified-bootstrap CIs, a (σ, κ) CSI-noise sweep, GDOP tracking, and a RESULTS.md leaderboard generator. Pure Rust, no new dependencies.

What's new (src/evals/)

File Contents
gdop.rs 2D Geometric Dilution of Precision via closed-form (HᵀH)⁻¹; None for <2 observers / collinear / singular geometry
stats.rs IQM (Agarwal 2021), 95% stratified-bootstrap CI (deterministic LCG), probability-of-improvement
metrics.rs EpisodeMetrics + AggregateMetrics::from_strata (IQM±CI, seed-stratified)
runner.rs seeded kinematic rollout (driven by FlightPattern), seed×episode matrix, 3σ×3κ default noise sweep
report.rs + bin/eval_swarm generates evals/RESULTS.md

The result it surfaces (the point of the methodology)

GDOP tracking exposes a real coverage-vs-localization-precision trade-off that point estimates would hide:

Flight pattern Coverage IQM [95% CI] Localization (m) IQM [95% CI] Detection Mean GDOP
partitioned_lawnmower 1.000 [1.000, 1.000] 7.022 [5.669, 8.379] 100% 0.000
pheromone 0.662 [0.652, 0.671] 4.110 [3.346, 5.141] 95% 1.598
levy_flight 0.490 3.523 100% 0.000
boustrophedon 0.370 2.740 100% 0.000
spiral 0.336 3.082 100% 0.000
potential_field 0.254 4.343 100% 0.000
Wi2SAR (paper baseline) n/a 5.0 (paper) n/a n/a

partitioned wins coverage (disjoint strips) but its single-drone sightings (GDOP→0) give the worst localization; pheromone co-locates drones (GDOP 1.6) for better fusion. Coverage and localization-precision genuinely trade off — exactly what the harness is built to reveal.

Methodology (ADR-149)

  • Dual-stage pipeline: Stage 1 kinematic (this PR, full 10×50 matrix); Stage 2 Gazebo/PX4 SITL on the 3 median seeds for false-alarm + collision rate (documented follow-on)
  • Statistical standard: ≥10 seeds, IQM + 95% stratified bootstrap, ≥3 baselines (Agarwal 2021 / Gorsane 2022 / rliable)
  • Honest leaderboard position: no public leaderboard accepts CSI-SAR swarm submissions; Wi2SAR is a labeled paper-to-paper baseline

Tests

  • --no-default-features: 116/116 (+13 eval tests)
  • --features full,train: 133/133
  • Clippy: 0 warnings in-crate (-D warnings --no-deps)
  • cargo run --bin eval_swarm produces RESULTS.md

Covered by the ruview-swarm CI guard (path-scoped feature matrix + clippy + ITAR guards).

Related

🤖 Generated with claude-flow

…se sweep

Stage-1 kinematic evaluator per ADR-149 (peer-reviewed). Pure Rust, no new deps.

evals/:
- gdop.rs: 2D Geometric Dilution of Precision ((HᵀH)⁻¹ trace-sqrt); None for
  <2 observers or collinear/singular geometry
- stats.rs: IQM (Agarwal 2021) + 95% stratified-bootstrap CI (deterministic LCG)
  + probability_of_improvement
- metrics.rs: EpisodeMetrics + AggregateMetrics::from_strata (IQM±CI, seed-stratified)
- runner.rs: seeded kinematic rollout (FlightPattern-driven), seed×episode matrix,
  3σ×3κ default noise sweep (Gaussian amplitude × von Mises phase)
- report.rs + eval_swarm bin: generates evals/RESULTS.md leaderboard

RESULTS.md surfaces the real coverage-vs-localization-precision trade-off via GDOP:
partitioned wins coverage (100%) but single-drone sightings (GDOP 0 → 7.0m);
pheromone gets multistatic fusion (GDOP 1.6 → 4.1m). Wi2SAR 5m paper-baseline row included.

Stage-2 (Gazebo/PX4 SITL false-alarm + collision on median seeds) is documented follow-on.

Tests: 116 default / 133 full+train (+13 eval tests), 0 failed. Clippy clean (-D warnings).

Co-Authored-By: claude-flow <ruv@ruv.net>
@ruvnet ruvnet force-pushed the feat/adr-149-eval-harness branch from d6407ae to aabf7a7 Compare May 30, 2026 21:18
@ruvnet ruvnet merged commit 8d64434 into main May 30, 2026
49 of 51 checks passed
@ruvnet ruvnet deleted the feat/adr-149-eval-harness branch May 30, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant