Skip to content

Earth Moon Performance Investigation

Test User edited this page May 11, 2026 · 2 revisions

Earth–Moon Clementine performance investigation

Status: First-cut report from the #447 performance toolkit (landed in PR #462). Captures Phase A (baselines) + Phase B (per-phase drilldown) + Phase C (candidate inventory). Each named candidate (C1..Cn) is a tracking hook for a follow-on optimization PR.

Target: halve the tier3 earth_moon_clem runtime — from the current ~17 min wall-clock (~26 min on WSL2 dev hosts) down to ≤8.5 min. Per-step is currently 83 µs/step; target ≤40 µs/step.

Bit-identity invariant. Every candidate below must preserve byte-equality of the tier3 cross-val JSON and pass all 53 bevy_parity_* tests. Rules out FMA contraction, op reordering, target-cpu=native, Rayon inside step(). Candidates that would require relaxing those constraints are flagged "RISKY" and not pursued without a separate policy discussion.

1. Executive summary

Metric Current Target Gap
tier3 wall-clock (CI runner) ~17 min ≤8.5 min
per_step_us (perf-runner, 100k × 5) 83.26 ≤40
integration phase 53.58 µs/step (64.4%) ≤27
ephemeris phase 15.09 µs/step (18.1%) ≤7.5
environment phase 13.63 µs/step (16.4%) ≤7

The hot path is gravity (combined 81% of step time across integration and environment phases), specifically the spherical-harmonics kernel for the Moon LP150Q 60×60 source, called 4× per RK4 step + 1× per environment update. Anything that reduces the per-call cost of gravitation compounds at 5× per step.

Top-3 candidates by estimated ROI (full table in §4):

  1. C1+C2 (flat-storage SH coefficients & helpers) — replace Vec<Vec<f64>> with flat arrays. Removes ≥1 pointer-deref per inner-loop access. Estimated 10-20% kernel speedup ⇒ 8-16% of step time (since gravity is ~80% of step). Bit-identical (same f64 ops, same order).
  2. C5 (ephemeris per-step Epoch reuse)Epoch::from_tdb_seconds
    • Epoch::to_time_scale ≈ 2-3% of step time each, called 3× per step with the same tdb_jd input. Cache once per step_internal. Estimated 3-5% of step time. Bit-identical.
  3. C3 (hoist vector3_transform from kernel to caller) — the inertial→pfix matrix-vector product runs once per gravitation call (5× per step, 1 source). Hoist once per step. Estimated 2-4% of step time. Bit-identical.

If C1+C2+C3+C5 land cumulatively, that's ~15-25% step-time reduction — a meaningful start but not yet a 2× speedup. Reaching the 8.5 min target requires further candidates from §4 or revisiting the bit-identity policy (§5).

2. Baseline measurements (Phase A)

A1 — perf-runner phase_timing baseline

Run: cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 --phase-timing --output target/perf/baseline.json

mean_secs       8.326    stdev_secs 0.279   (5 cold-setup repeats)
per_step_us     83.26
max_rss_mb      43.87

phase_timings (us/step, summed over 500k step calls):
  time_advance        0.164  ( 0.20%)
  ephemeris          15.088  (18.13%)   ← #2 hot phase
  mass_recompute      0.033  ( 0.04%)
  integ_origins_pre   0.056  ( 0.07%)
  kinematic_pre       0.039  ( 0.05%)
  environment        13.629  (16.37%)   ← #3 hot phase
  interactions        0.093  ( 0.11%)
  integration        53.581  (64.36%)   ← #1 hot phase
  integ_origins_post  0.051  ( 0.06%)
  kinematic_post      0.048  ( 0.06%)
  derived             0.047  ( 0.06%)
  detached_subtrees   0.000  ( 0.00%)

The three hot phases sum to 98.86% of measured step time. The remaining 1.14% is all the JEOD orchestration phases combined — they are not optimization targets.

A2 — Criterion micro-benches (full-sample)

Run: cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step

Bench Per-call
accumulate_gravity/d4 (Earth GGM05C @ 4×4) ~65 ns
accumulate_gravity/d20 (Earth GGM05C @ 20×20) 1.08 µs
accumulate_gravity/d60 (Moon LP150Q @ 60×60, production hot path) 8.51 µs
rk4_sixdof_step/trivial_accel (RK4 plumbing only) 43 ns
rk4_sixdof_step/degree60_accel (RK4 + d60 gravity) 27 µs
simulation_step/earth_moon_clem_dt32hz (full step) 57.88 µs

Cross-checks:

  • d60 × 4 RK4 stages = 8.51 × 4 = 34.0 µs vs. degree60_accel = 27 µs. The bench underruns the math because the trivial_accel cost (43 ns) is negligible, the closure-captured RefCell<GottliebScratch> adds one borrow per call (~1 ns × 4 = 4 ns), and the rest is LLVM cross-stage CSE between the 4 sub-steps.
  • simulation_step (57.88 µs) is 30% below the perf-runner per_step_us (83 µs) because the bench keeps the simulation hot in cache after 1000 warmup steps, whereas the perf-runner re-cold-starts between repeats and accumulates cache-miss + page-fault transients. The perf-runner number is the reliable production-cost estimate; the bench is the reliable per-iteration optimization tracker.

A3 — pprof flamegraphs

The release-with-debug profile's debug = "line-tables-only" was insufficient to symbolicate the samply trace without --unstable-presymbolicate. Criterion's pprof integration with PProfProfiler did produce symbolicated SVGs at:

  • target/criterion/simulation_step/earth_moon_clem_dt32hz/profile/flamegraph.svg
  • target/criterion/accumulate_gravity/{d4,d20,d60}/profile/flamegraph.svg
  • target/criterion/rk4_sixdof_step/{trivial_accel,degree60_accel}/profile/flamegraph.svg

The microbench flamegraphs (accumulate_gravity and rk4_sixdof_step) collapsed to bencher-harness frames — LTO inlined the kernel into the closure, so pprof can't see inside it without disabling LTO. The simulation_step flamegraph does resolve symbols at the step level because step_internal is too large to inline.

A4 — tier3 wall-clock baseline

Two runs captured during PR #462 verification:

  • Run 1 (pre-refactor, single-process): 1554 s = 25.9 min.
  • Run 2 (post-refactor, CPU-contended with parallel parity run): 2321 s = 38.7 min.

Run 1 is the reliable single-process baseline on this dev host. The CI runner profile is ~17 min per the issue.

Note on CI runtime-budget job (#447 PR-4): the 1500 s budget set in PR #462 assumed the issue's 17 min number. WSL2 dev host produces ~26 min single-process. If the CI runner profile drifts toward WSL2-like cost, the budget should be raised — not papered over.

3. Per-phase drilldown (Phase B)

Source: parsing <title> elements from the simulation_step flamegraph (the only flamegraph that resolves below the bencher harness). Inclusive-sample percentages (a sample counts for every frame in its stack, so percentages add to > 100%).

B1 — Integration phase (64.36% of step time)

  35.60%  run_integration                       (RK4 driver)
  35.26%  integrate_body                        (per-body RK4 driver)
  35.10%  gravitation                           (gravity dispatch — both env + integration)
  35.10%  GravityControl::evaluate_inner        (single-source eval site)

Reading: virtually all of run_integration is gravitation. The RK4 plumbing itself is <0.5% of step time (consistent with the bench showing trivial_accel at 43 ns vs degree60_accel at 27 µs).

The kernel itself (calc_nonspherical_with_scratch) does not appear as a separate frame because LTO inlined it into gravitation. The attribution lands on gravitation directly.

B2 — Environment phase (16.37% of step time)

The environment phase runs update_environment which calls gravitation once per source before RK4. The hot symbols are identical to B1 — same kernel, same dispatch. Optimizations on the kernel improve both phases.

B3 — Ephemeris phase (18.13% of step time)

Three queries per step in the Earth–Moon scenario:

  1. eph.get_state_typed(Earth, Moon, tdb_jd) — Earth position.
  2. eph.get_state_typed(Sun, Moon, tdb_jd) — Sun position.
  3. eph.get_body_rotation(Moon, tdb_jd) — Moon BPC libration matrix.

Hot symbols from the flamegraph (inclusive samples within step_internal):

   8.77%  Ephemeris::get_state_typed
   5.63%  anise::translate_to_parent           (Chebyshev evaluator entry)
   3.31%  Type2ChebyshevSet::evaluate          (the actual polynomial math)
   2.98%  ephemeris_path_to_root               (frame-chain walk)
   2.81%  spk_summary_at_epoch                 (SPK segment lookup)
   2.48%  hifitime::Epoch::to_time_scale       (time-scale conversion, inside translate)
   0.83%  Ephemeris::get_body_rotation         (Moon BPC; smaller than translate)

Insights:

  • Epoch::to_time_scale is 2.48% and called from inside anise's translate path. Since all three of our per-step queries use the same tdb_jd, the Epoch ctor + time-scale conversion is repeated redundantly 3× per step. Hoisting saves the duplicated work.
  • spk_summary_at_epoch and ephemeris_path_to_root are anise internals doing per-call SPK segment lookup + frame-chain walk. Both inputs (segment + chain) are time-invariant — the same (target, observer) pair traces the same chain every step. Caching the resolved frame chain across steps could eliminate both.
  • The Moon BPC rotation is only 0.83% — cheaper than the issue's "~16%" total ephemeris breakdown suggested. The bulk of the ephemeris phase is the two DE421 translate calls.

4. Candidate inventory (Phase C)

ROI estimates come from % of step time saved — derived from §3 percentages and conservative assumptions about how much of each hot symbol's cost the candidate eliminates. Each candidate must pass the §5 bit-identity gate before merge.

ID Candidate Mechanism Est. ROI Cost Risk
C1 Flat-storage SH coefficients Replace SphericalHarmonicsData.cnm / snm: Vec<Vec<f64>> (jagged, 60+1 rows) with flat Vec<f64> + (n,m) → idx helper. Inner-loop access drops one pointer-deref. 8-16% M (~200 LOC, all in one crate) LOW — same f64 ops, same order
C2 Flat-storage Gottlieb helpers Same as C1 for xi / eta / zeta / upsilon: Vec<Vec<f64>>. Inner loop accesses these alongside coefficients. 3-5% (bundle with C1) M LOW
C3 Hoist vector3_transform from kernel vector3_transform(t_parent_this, pos) runs once per gravitation entry. Move to the caller (accumulate.rs) so it runs once per step per source, not 5×. 2-4% M (cross-crate API change) LOW
C4 Inline annotations on hot helpers Audit pnm() / pnm_mut() and other small helpers; add #[inline(always)] if codegen evidence shows missed inlining under LTO. 0-2% (likely noise) S LOW
C5 Per-step Epoch cache Build Epoch::from_tdb_seconds(tdb_s) once at the top of update_ephemeris and pass to the 3 query calls. Saves Epoch ctor (line 105 + 158) and to_time_scale inside anise translate (2.48% of step time per query). 3-5% S (~30 LOC) LOW
C6 Anise frame-chain cache across steps ephemeris_path_to_root (2.98%) + spk_summary_at_epoch (2.81%) re-resolve the same (target, observer) chain every step. Cache the resolved chain on Ephemeris keyed by (target, observer). Anise may already have this internally — verify before adding ours. 3-5% (overlapping with C5) M (depends on anise's internal API surface) MEDIUM — need to confirm anise's chain resolution is deterministic per epoch
C7 Bounds-check elimination audit The kernel's pnm_flat[offset + m] indexed accesses may not be eliding bounds checks under LTO. Inspect codegen via cargo asm on calc_nonspherical_with_scratch. If bounds checks fire, use get_unchecked (locally #[allow(clippy::indexing_slicing)]) with safety comments. 1-3% S LOW — read-only access into pre-sized buffers
C8 RK4 sub-stage gravity reuse RISKY. Inside RK4, the 4 sub-stage gravity evals use linearly-extrapolated source positions (the ephemeris already approximates this). For very small dt, the SH coefficient products (e.g. c_ii × cos_mlambda) might be reusable across sub-stages. Bit-identity breaks unless we can prove the f64 round-off is < 1 ULP at the integration accumulator. (potentially 10-20%) L HIGH — likely breaks bit-identity; flag for policy discussion

Cumulative bit-identical estimate (C1+C2+C3+C5+C6+C7): ~20-35% step-time reduction. Brings tier3 from 17 min → 11-13 min. Still short of the 8.5 min target.

The remaining gap to a 2× speedup requires either:

  • Successful C8 (RK4 sub-stage reuse) — but bit-identity-risky.
  • Algorithmic change to the Gottlieb recursion itself — e.g. precomputing the trigonometric cos_mlambda / sin_mlambda rows for a fixed-axis rotation, or using a tabulated normalized associated Legendre function. Out of scope for this report; needs numerical-methods research.
  • Reducing the SH degree from 60 → some lower number (would change physics, not in scope).

5. Bit-identity verification gate (Phase D)

Before any C1..C7 candidate can land:

  1. cargo nextest run --workspace -E 'test(bevy_parity)' → 233/233 passing. Must remain bit-identical between astrodyn_runner and astrodyn_bevy paths.
  2. cargo nextest run -E 'test(tier3_simulation_earth_moon_clem)' → passes under existing tolerances [0.832, 0.331, 0.972] m with no widening.
  3. diff target/tier3_crossval/tier3_earth_moon_clem.json{,.pre-change} → exit 0 (byte-identical max position errors).
  4. scripts/check_no_bypass_deps.sh → OK.
  5. cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step → measurable speedup with stdev separation from baseline.

C8 fails #3 by hypothesis and is excluded until/unless the bit-identity policy is amended.

6. Sequencing recommendation

Order to attack candidates, by ROI / cost / dependency:

  1. C5 (per-step Epoch cache) — small, isolated change in crates/astrodyn_runner/src/simulation/step/ephemeris.rs. Lowest blast radius. Probably 3-5% in one PR. Land first to validate the bit-identity gate is easy to satisfy.
  2. C1 + C2 (flat-storage refactor, bundled) — same file (crates/astrodyn_gravity/src/spherical_harmonics_gravity_source.rs), same migration. Bundling avoids two rounds of bench/parity-test noise. ~8-20% — the biggest single PR in this set.
  3. C3 (hoist vector3_transform) — cross-crate API change; wait until after C1+C2 so the bench numbers are stable.
  4. C6 (anise frame-chain cache) — depends on anise API discovery; may be obsoleted by an anise upstream fix.
  5. C7 (bounds-check audit) — opportunistic; do alongside any of the above when reading the kernel.
  6. C4 (inline annotations) — opportunistic; combine with C7.

After 1-5 land, re-baseline and decide whether to invest in C8 (with policy discussion) or new candidates surfacing from the updated flamegraphs.

7. Risk inventory

  • PR #462 not merged yet. This page references the toolkit that's on 447-perf-toolkit. The methodology and numbers stand regardless of merge timing; the file paths are pre-merge.
  • Anise upstream churn. C5/C6 touch anise call sites; an anise version bump may obsolete our caching. Pin anise's version in Cargo.lock for the duration of the C5/C6 PRs.
  • WSL2 vs CI variance. Numbers here are from a WSL2 dev host. CI runner numbers will differ; treat the percentage breakdowns as the portable metric, not the absolute µs/step values.
  • C8 policy unlock. If the team decides the 2× target is worth relaxing bit-identity (e.g. accepting tolerance widening from [0.832, 0.331, 0.972] m to [1.5, 1.5, 1.5] m), C8 becomes feasible and the candidate inventory grows. That's a CLAUDE.md amendment + maintainer call.

8. Follow-on issues

Filed against the top-3 candidates by ROI / cost:

  • #464 (C5) — Cache Epoch once per update_ephemeris instead of 3×. Estimated 3-5% step time, S cost, LOW risk.
  • #465 (C1 + C2) — Flat-storage refactor for SH coefficients + Gottlieb helpers. Estimated 8-20% step time, M cost, LOW risk. Biggest single PR.
  • #466 (C3) — Hoist vector3_transform from gravity kernel to caller. Estimated 2-4% step time, M cost (cross-crate API change), LOW risk.

Each issue cites this page, names the verification gates, and lists the file:line loci.

Appendix A — How to reproduce

# Phase A baselines (writes target/perf/baseline.json)
cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 \
    --phase-timing --output target/perf/baseline.json

# Phase A full criterion run (writes target/criterion/<bench>/)
cargo bench -p astrodyn_gravity --bench accumulate \
            -p astrodyn_dynamics --bench integration \
            -p astrodyn_runner --bench step

# Phase A flamegraphs via pprof
# (writes target/criterion/<bench>/profile/flamegraph.svg)
cargo bench -p astrodyn_gravity --bench accumulate \
            -p astrodyn_runner --bench step \
            -p astrodyn_dynamics --bench integration -- --profile-time 10

# Phase A tier3 wall-clock baseline
cargo nextest run --release -p astrodyn_verif_jeod \
    --test tier3_sim_earth_moon \
    -E 'test(tier3_simulation_earth_moon_clem)'

Investigation captured against PR #462 commit 2ca4631 (447-perf-toolkit branch).

Clone this wiki locally