Earth Moon Performance Investigation

Earth–Moon Clementine performance investigation

Status: First-cut report from the #447 performance toolkit (landed in PR #462). Captures Phase A (baselines) + Phase B (per-phase drilldown) + Phase C (candidate inventory). Each named candidate (C1..Cn) is a tracking hook for a follow-on optimization PR.

Target: halve the tier3 earth_moon_clem runtime — from the current ~17 min wall-clock (~26 min on WSL2 dev hosts) down to ≤8.5 min. Per-step is currently 83 µs/step; target ≤40 µs/step.

Bit-identity invariant. Every candidate below must preserve byte-equality of the tier3 cross-val JSON and pass all 53 bevy_parity_* tests. Rules out FMA contraction, op reordering, target-cpu=native, Rayon inside step(). Candidates that would require relaxing those constraints are flagged "RISKY" and not pursued without a separate policy discussion.

1. Executive summary

Metric	Current	Target	Gap
tier3 wall-clock (CI runner)	~17 min	≤8.5 min	2×
`per_step_us` (perf-runner, 100k × 5)	83.26	≤40	2×
`integration` phase	53.58 µs/step (64.4%)	≤27	2×
`ephemeris` phase	15.09 µs/step (18.1%)	≤7.5	2×
`environment` phase	13.63 µs/step (16.4%)	≤7	2×

The hot path is gravity (combined 81% of step time across integration and environment phases), specifically the spherical-harmonics kernel for the Moon LP150Q 60×60 source, called 4× per RK4 step + 1× per environment update. Anything that reduces the per-call cost of gravitation compounds at 5× per step.

Top-3 candidates by estimated ROI (full table in §4):

C1+C2 (flat-storage SH coefficients & helpers) — replace Vec<Vec<f64>> with flat arrays. Removes ≥1 pointer-deref per inner-loop access. Estimated 10-20% kernel speedup ⇒ 8-16% of step time (since gravity is ~80% of step). Bit-identical (same f64 ops, same order).
C5 (ephemeris per-step Epoch reuse) — Epoch::from_tdb_seconds
- Epoch::to_time_scale ≈ 2-3% of step time each, called 3× per step with the same tdb_jd input. Cache once per step_internal. Estimated 3-5% of step time. Bit-identical.
C3 (hoist vector3_transform from kernel to caller) — the inertial→pfix matrix-vector product runs once per gravitation call (5× per step, 1 source). Hoist once per step. Estimated 2-4% of step time. Bit-identical.

If C1+C2+C3+C5 land cumulatively, that's ~15-25% step-time reduction — a meaningful start but not yet a 2× speedup. Reaching the 8.5 min target requires further candidates from §4 or revisiting the bit-identity policy (§5).

2. Baseline measurements (Phase A)

A1 — perf-runner phase_timing baseline

Run: cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 --phase-timing --output target/perf/baseline.json

mean_secs       8.326    stdev_secs 0.279   (5 cold-setup repeats)
per_step_us     83.26
max_rss_mb      43.87

phase_timings (us/step, summed over 500k step calls):
  time_advance        0.164  ( 0.20%)
  ephemeris          15.088  (18.13%)   ← #2 hot phase
  mass_recompute      0.033  ( 0.04%)
  integ_origins_pre   0.056  ( 0.07%)
  kinematic_pre       0.039  ( 0.05%)
  environment        13.629  (16.37%)   ← #3 hot phase
  interactions        0.093  ( 0.11%)
  integration        53.581  (64.36%)   ← #1 hot phase
  integ_origins_post  0.051  ( 0.06%)
  kinematic_post      0.048  ( 0.06%)
  derived             0.047  ( 0.06%)
  detached_subtrees   0.000  ( 0.00%)

The three hot phases sum to 98.86% of measured step time. The remaining 1.14% is all the JEOD orchestration phases combined — they are not optimization targets.

A2 — Criterion micro-benches (full-sample)

Run: cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step

Bench	Per-call
`accumulate_gravity/d4` (Earth GGM05C @ 4×4)	~65 ns
`accumulate_gravity/d20` (Earth GGM05C @ 20×20)	1.08 µs
`accumulate_gravity/d60` (Moon LP150Q @ 60×60, production hot path)	8.51 µs
`rk4_sixdof_step/trivial_accel` (RK4 plumbing only)	43 ns
`rk4_sixdof_step/degree60_accel` (RK4 + d60 gravity)	27 µs
`simulation_step/earth_moon_clem_dt32hz` (full step)	57.88 µs

Cross-checks:

d60 × 4 RK4 stages = 8.51 × 4 = 34.0 µs vs. degree60_accel = 27 µs. The bench underruns the math because the trivial_accel cost (43 ns) is negligible, the closure-captured RefCell<GottliebScratch> adds one borrow per call (~1 ns × 4 = 4 ns), and the rest is LLVM cross-stage CSE between the 4 sub-steps.
simulation_step (57.88 µs) is 30% below the perf-runner per_step_us (83 µs) because the bench keeps the simulation hot in cache after 1000 warmup steps, whereas the perf-runner re-cold-starts between repeats and accumulates cache-miss + page-fault transients. The perf-runner number is the reliable production-cost estimate; the bench is the reliable per-iteration optimization tracker.

A3 — pprof flamegraphs

The release-with-debug profile's debug = "line-tables-only" was insufficient to symbolicate the samply trace without --unstable-presymbolicate. Criterion's pprof integration with PProfProfiler did produce symbolicated SVGs at:

target/criterion/simulation_step/earth_moon_clem_dt32hz/profile/flamegraph.svg
target/criterion/accumulate_gravity/{d4,d20,d60}/profile/flamegraph.svg
target/criterion/rk4_sixdof_step/{trivial_accel,degree60_accel}/profile/flamegraph.svg

The microbench flamegraphs (accumulate_gravity and rk4_sixdof_step) collapsed to bencher-harness frames — LTO inlined the kernel into the closure, so pprof can't see inside it without disabling LTO. The simulation_step flamegraph does resolve symbols at the step level because step_internal is too large to inline.

A4 — tier3 wall-clock baseline

Two runs captured during PR #462 verification:

Run 1 (pre-refactor, single-process): 1554 s = 25.9 min.
Run 2 (post-refactor, CPU-contended with parallel parity run): 2321 s = 38.7 min.

Run 1 is the reliable single-process baseline on this dev host. The CI runner profile is ~17 min per the issue.

Note on CI runtime-budget job (#447 PR-4): the 1500 s budget set in PR #462 assumed the issue's 17 min number. WSL2 dev host produces ~26 min single-process. If the CI runner profile drifts toward WSL2-like cost, the budget should be raised — not papered over.

3. Per-phase drilldown (Phase B)

Source: parsing <title> elements from the simulation_step flamegraph (the only flamegraph that resolves below the bencher harness). Inclusive-sample percentages (a sample counts for every frame in its stack, so percentages add to > 100%).

B1 — Integration phase (64.36% of step time)

  35.60%  run_integration                       (RK4 driver)
  35.26%  integrate_body                        (per-body RK4 driver)
  35.10%  gravitation                           (gravity dispatch — both env + integration)
  35.10%  GravityControl::evaluate_inner        (single-source eval site)

Reading: virtually all of run_integration is gravitation. The RK4 plumbing itself is <0.5% of step time (consistent with the bench showing trivial_accel at 43 ns vs degree60_accel at 27 µs).

The kernel itself (calc_nonspherical_with_scratch) does not appear as a separate frame because LTO inlined it into gravitation. The attribution lands on gravitation directly.

B2 — Environment phase (16.37% of step time)

The environment phase runs update_environment which calls gravitation once per source before RK4. The hot symbols are identical to B1 — same kernel, same dispatch. Optimizations on the kernel improve both phases.

B3 — Ephemeris phase (18.13% of step time)

Three queries per step in the Earth–Moon scenario:

eph.get_state_typed(Earth, Moon, tdb_jd) — Earth position.
eph.get_state_typed(Sun, Moon, tdb_jd) — Sun position.
eph.get_body_rotation(Moon, tdb_jd) — Moon BPC libration matrix.

Hot symbols from the flamegraph (inclusive samples within step_internal):

   8.77%  Ephemeris::get_state_typed
   5.63%  anise::translate_to_parent           (Chebyshev evaluator entry)
   3.31%  Type2ChebyshevSet::evaluate          (the actual polynomial math)
   2.98%  ephemeris_path_to_root               (frame-chain walk)
   2.81%  spk_summary_at_epoch                 (SPK segment lookup)
   2.48%  hifitime::Epoch::to_time_scale       (time-scale conversion, inside translate)
   0.83%  Ephemeris::get_body_rotation         (Moon BPC; smaller than translate)

Insights:

Epoch::to_time_scale is 2.48% and called from inside anise's translate path. Since all three of our per-step queries use the same tdb_jd, the Epoch ctor + time-scale conversion is repeated redundantly 3× per step. Hoisting saves the duplicated work.
spk_summary_at_epoch and ephemeris_path_to_root are anise internals doing per-call SPK segment lookup + frame-chain walk. Both inputs (segment + chain) are time-invariant — the same (target, observer) pair traces the same chain every step. Caching the resolved frame chain across steps could eliminate both.
The Moon BPC rotation is only 0.83% — cheaper than the issue's "~16%" total ephemeris breakdown suggested. The bulk of the ephemeris phase is the two DE421 translate calls.

4. Candidate inventory (Phase C)

ROI estimates come from % of step time saved — derived from §3 percentages and conservative assumptions about how much of each hot symbol's cost the candidate eliminates. Each candidate must pass the §5 bit-identity gate before merge.

ID	Candidate	Mechanism	Est. ROI	Cost	Risk
C1	Flat-storage SH coefficients	Replace `SphericalHarmonicsData.cnm / snm: Vec<Vec<f64>>` (jagged, 60+1 rows) with flat `Vec<f64>` + `(n,m) → idx` helper. Inner-loop access drops one pointer-deref.	8-16%	M (~200 LOC, all in one crate)	LOW — same f64 ops, same order
C2	Flat-storage Gottlieb helpers	Same as C1 for `xi / eta / zeta / upsilon: Vec<Vec<f64>>`. Inner loop accesses these alongside coefficients.	3-5% (bundle with C1)	M	LOW
C3	Hoist `vector3_transform` from kernel	`vector3_transform(t_parent_this, pos)` runs once per `gravitation` entry. Move to the caller (`accumulate.rs`) so it runs once per step per source, not 5×.	2-4%	M (cross-crate API change)	LOW
C4	Inline annotations on hot helpers	Audit `pnm()` / `pnm_mut()` and other small helpers; add `#[inline(always)]` if codegen evidence shows missed inlining under LTO.	0-2% (likely noise)	S	LOW
C5	Per-step `Epoch` cache	Build `Epoch::from_tdb_seconds(tdb_s)` once at the top of `update_ephemeris` and pass to the 3 query calls. Saves Epoch ctor (line 105 + 158) and `to_time_scale` inside anise translate (2.48% of step time per query).	3-5%	S (~30 LOC)	LOW
C6	Anise frame-chain cache across steps	`ephemeris_path_to_root` (2.98%) + `spk_summary_at_epoch` (2.81%) re-resolve the same `(target, observer)` chain every step. Cache the resolved chain on `Ephemeris` keyed by `(target, observer)`. Anise may already have this internally — verify before adding ours.	3-5% (overlapping with C5)	M (depends on anise's internal API surface)	MEDIUM — need to confirm anise's chain resolution is deterministic per epoch
C7	Bounds-check elimination audit	The kernel's `pnm_flat[offset + m]` indexed accesses may not be eliding bounds checks under LTO. Inspect codegen via `cargo asm` on `calc_nonspherical_with_scratch`. If bounds checks fire, use `get_unchecked` (locally `#[allow(clippy::indexing_slicing)]`) with safety comments.	1-3%	S	LOW — read-only access into pre-sized buffers
C8	RK4 sub-stage gravity reuse	RISKY. Inside RK4, the 4 sub-stage gravity evals use linearly-extrapolated source positions (the ephemeris already approximates this). For very small `dt`, the SH coefficient products (e.g. `c_ii × cos_mlambda`) might be reusable across sub-stages. Bit-identity breaks unless we can prove the f64 round-off is < 1 ULP at the integration accumulator.	(potentially 10-20%)	L	HIGH — likely breaks bit-identity; flag for policy discussion

Cumulative bit-identical estimate (C1+C2+C3+C5+C6+C7): ~20-35% step-time reduction. Brings tier3 from 17 min → 11-13 min. Still short of the 8.5 min target.

The remaining gap to a 2× speedup requires either:

Successful C8 (RK4 sub-stage reuse) — but bit-identity-risky.
Algorithmic change to the Gottlieb recursion itself — e.g. precomputing the trigonometric cos_mlambda / sin_mlambda rows for a fixed-axis rotation, or using a tabulated normalized associated Legendre function. Out of scope for this report; needs numerical-methods research.
Reducing the SH degree from 60 → some lower number (would change physics, not in scope).

5. Bit-identity verification gate (Phase D)

Before any C1..C7 candidate can land:

cargo nextest run --workspace -E 'test(bevy_parity)' → 233/233 passing. Must remain bit-identical between astrodyn_runner and astrodyn_bevy paths.
cargo nextest run -E 'test(tier3_simulation_earth_moon_clem)' → passes under existing tolerances [0.832, 0.331, 0.972] m with no widening.
diff target/tier3_crossval/tier3_earth_moon_clem.json{,.pre-change} → exit 0 (byte-identical max position errors).
scripts/check_no_bypass_deps.sh → OK.
cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step → measurable speedup with stdev separation from baseline.

C8 fails #3 by hypothesis and is excluded until/unless the bit-identity policy is amended.

6. Sequencing recommendation

Order to attack candidates, by ROI / cost / dependency:

C5 (per-step Epoch cache) — small, isolated change in crates/astrodyn_runner/src/simulation/step/ephemeris.rs. Lowest blast radius. Probably 3-5% in one PR. Land first to validate the bit-identity gate is easy to satisfy.
C1 + C2 (flat-storage refactor, bundled) — same file (crates/astrodyn_gravity/src/spherical_harmonics_gravity_source.rs), same migration. Bundling avoids two rounds of bench/parity-test noise. ~8-20% — the biggest single PR in this set.
C3 (hoist vector3_transform) — cross-crate API change; wait until after C1+C2 so the bench numbers are stable.
C6 (anise frame-chain cache) — depends on anise API discovery; may be obsoleted by an anise upstream fix.
C7 (bounds-check audit) — opportunistic; do alongside any of the above when reading the kernel.
C4 (inline annotations) — opportunistic; combine with C7.

After 1-5 land, re-baseline and decide whether to invest in C8 (with policy discussion) or new candidates surfacing from the updated flamegraphs.

7. Risk inventory

PR #462 not merged yet. This page references the toolkit that's on 447-perf-toolkit. The methodology and numbers stand regardless of merge timing; the file paths are pre-merge.
Anise upstream churn. C5/C6 touch anise call sites; an anise version bump may obsolete our caching. Pin anise's version in Cargo.lock for the duration of the C5/C6 PRs.
WSL2 vs CI variance. Numbers here are from a WSL2 dev host. CI runner numbers will differ; treat the percentage breakdowns as the portable metric, not the absolute µs/step values.
C8 policy unlock. If the team decides the 2× target is worth relaxing bit-identity (e.g. accepting tolerance widening from [0.832, 0.331, 0.972] m to [1.5, 1.5, 1.5] m), C8 becomes feasible and the candidate inventory grows. That's a CLAUDE.md amendment + maintainer call.

8. Follow-on issues

Filed against the top-3 candidates by ROI / cost:

#464 (C5) — Cache Epoch once per update_ephemeris instead of 3×. Estimated 3-5% step time, S cost, LOW risk.
#465 (C1 + C2) — Flat-storage refactor for SH coefficients + Gottlieb helpers. Estimated 8-20% step time, M cost, LOW risk. Biggest single PR.
#466 (C3) — Hoist vector3_transform from gravity kernel to caller. Estimated 2-4% step time, M cost (cross-crate API change), LOW risk.

Each issue cites this page, names the verification gates, and lists the file:line loci.

Appendix A — How to reproduce

# Phase A baselines (writes target/perf/baseline.json)
cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 \
    --phase-timing --output target/perf/baseline.json

# Phase A full criterion run (writes target/criterion/<bench>/)
cargo bench -p astrodyn_gravity --bench accumulate \
            -p astrodyn_dynamics --bench integration \
            -p astrodyn_runner --bench step

# Phase A flamegraphs via pprof
# (writes target/criterion/<bench>/profile/flamegraph.svg)
cargo bench -p astrodyn_gravity --bench accumulate \
            -p astrodyn_runner --bench step \
            -p astrodyn_dynamics --bench integration -- --profile-time 10

# Phase A tier3 wall-clock baseline
cargo nextest run --release -p astrodyn_verif_jeod \
    --test tier3_sim_earth_moon \
    -E 'test(tier3_simulation_earth_moon_clem)'

Investigation captured against PR #462 commit 2ca4631 (447-perf-toolkit branch).

Earth Moon Performance Investigation

Earth–Moon Clementine performance investigation

1. Executive summary

2. Baseline measurements (Phase A)

A1 — perf-runner phase_timing baseline

A2 — Criterion micro-benches (full-sample)

A3 — pprof flamegraphs

A4 — tier3 wall-clock baseline

3. Per-phase drilldown (Phase B)

B1 — Integration phase (64.36% of step time)

B2 — Environment phase (16.37% of step time)

B3 — Ephemeris phase (18.13% of step time)

4. Candidate inventory (Phase C)

5. Bit-identity verification gate (Phase D)

6. Sequencing recommendation

7. Risk inventory

8. Follow-on issues

Appendix A — How to reproduce

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally