-
Notifications
You must be signed in to change notification settings - Fork 0
Earth Moon Performance Investigation
Status: First-cut report from the #447 performance toolkit (landed in PR #462). Captures Phase A (baselines) + Phase B (per-phase drilldown) + Phase C (candidate inventory). Each named candidate (C1..Cn) is a tracking hook for a follow-on optimization PR.
Target: halve the tier3 earth_moon_clem runtime — from the
current ~17 min wall-clock (~26 min on WSL2 dev hosts) down to ≤8.5 min.
Per-step is currently 83 µs/step; target ≤40 µs/step.
Bit-identity invariant. Every candidate below must preserve
byte-equality of the tier3 cross-val JSON and pass all 53
bevy_parity_* tests. Rules out FMA contraction, op reordering,
target-cpu=native, Rayon inside step(). Candidates that would
require relaxing those constraints are flagged "RISKY" and not
pursued without a separate policy discussion.
| Metric | Current | Target | Gap |
|---|---|---|---|
| tier3 wall-clock (CI runner) | ~17 min | ≤8.5 min | 2× |
per_step_us (perf-runner, 100k × 5) |
83.26 | ≤40 | 2× |
integration phase |
53.58 µs/step (64.4%) | ≤27 | 2× |
ephemeris phase |
15.09 µs/step (18.1%) | ≤7.5 | 2× |
environment phase |
13.63 µs/step (16.4%) | ≤7 | 2× |
The hot path is gravity (combined 81% of step time across
integration and environment phases), specifically the
spherical-harmonics kernel for the Moon LP150Q 60×60 source, called
4× per RK4 step + 1× per environment update. Anything that reduces
the per-call cost of gravitation compounds at 5× per step.
Top-3 candidates by estimated ROI (full table in §4):
-
C1+C2 (flat-storage SH coefficients & helpers) — replace
Vec<Vec<f64>>with flat arrays. Removes ≥1 pointer-deref per inner-loop access. Estimated 10-20% kernel speedup ⇒ 8-16% of step time (since gravity is ~80% of step). Bit-identical (same f64 ops, same order). -
C5 (ephemeris per-step
Epochreuse) —Epoch::from_tdb_seconds-
Epoch::to_time_scale≈ 2-3% of step time each, called 3× per step with the sametdb_jdinput. Cache once perstep_internal. Estimated 3-5% of step time. Bit-identical.
-
-
C3 (hoist
vector3_transformfrom kernel to caller) — the inertial→pfix matrix-vector product runs once pergravitationcall (5× per step, 1 source). Hoist once per step. Estimated 2-4% of step time. Bit-identical.
If C1+C2+C3+C5 land cumulatively, that's ~15-25% step-time reduction — a meaningful start but not yet a 2× speedup. Reaching the 8.5 min target requires further candidates from §4 or revisiting the bit-identity policy (§5).
Run: cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 --phase-timing --output target/perf/baseline.json
mean_secs 8.326 stdev_secs 0.279 (5 cold-setup repeats)
per_step_us 83.26
max_rss_mb 43.87
phase_timings (us/step, summed over 500k step calls):
time_advance 0.164 ( 0.20%)
ephemeris 15.088 (18.13%) ← #2 hot phase
mass_recompute 0.033 ( 0.04%)
integ_origins_pre 0.056 ( 0.07%)
kinematic_pre 0.039 ( 0.05%)
environment 13.629 (16.37%) ← #3 hot phase
interactions 0.093 ( 0.11%)
integration 53.581 (64.36%) ← #1 hot phase
integ_origins_post 0.051 ( 0.06%)
kinematic_post 0.048 ( 0.06%)
derived 0.047 ( 0.06%)
detached_subtrees 0.000 ( 0.00%)
The three hot phases sum to 98.86% of measured step time. The remaining 1.14% is all the JEOD orchestration phases combined — they are not optimization targets.
Run: cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step
| Bench | Per-call |
|---|---|
accumulate_gravity/d4 (Earth GGM05C @ 4×4) |
~65 ns |
accumulate_gravity/d20 (Earth GGM05C @ 20×20) |
1.08 µs |
accumulate_gravity/d60 (Moon LP150Q @ 60×60, production hot path) |
8.51 µs |
rk4_sixdof_step/trivial_accel (RK4 plumbing only) |
43 ns |
rk4_sixdof_step/degree60_accel (RK4 + d60 gravity) |
27 µs |
simulation_step/earth_moon_clem_dt32hz (full step) |
57.88 µs |
Cross-checks:
-
d60× 4 RK4 stages = 8.51 × 4 = 34.0 µs vs.degree60_accel= 27 µs. The bench underruns the math because the trivial_accel cost (43 ns) is negligible, the closure-capturedRefCell<GottliebScratch>adds one borrow per call (~1 ns × 4 = 4 ns), and the rest is LLVM cross-stage CSE between the 4 sub-steps. -
simulation_step(57.88 µs) is 30% below the perf-runner per_step_us (83 µs) because the bench keeps the simulation hot in cache after 1000 warmup steps, whereas the perf-runner re-cold-starts between repeats and accumulates cache-miss + page-fault transients. The perf-runner number is the reliable production-cost estimate; the bench is the reliable per-iteration optimization tracker.
The release-with-debug profile's debug = "line-tables-only" was
insufficient to symbolicate the samply trace without
--unstable-presymbolicate. Criterion's pprof integration with
PProfProfiler did produce symbolicated SVGs at:
target/criterion/simulation_step/earth_moon_clem_dt32hz/profile/flamegraph.svgtarget/criterion/accumulate_gravity/{d4,d20,d60}/profile/flamegraph.svgtarget/criterion/rk4_sixdof_step/{trivial_accel,degree60_accel}/profile/flamegraph.svg
The microbench flamegraphs (accumulate_gravity and rk4_sixdof_step)
collapsed to bencher-harness frames — LTO inlined the kernel into
the closure, so pprof can't see inside it without disabling LTO. The
simulation_step flamegraph does resolve symbols at the step level
because step_internal is too large to inline.
Two runs captured during PR #462 verification:
- Run 1 (pre-refactor, single-process): 1554 s = 25.9 min.
- Run 2 (post-refactor, CPU-contended with parallel parity run): 2321 s = 38.7 min.
Run 1 is the reliable single-process baseline on this dev host. The CI runner profile is ~17 min per the issue.
Note on CI runtime-budget job (#447 PR-4): the 1500 s budget set in PR #462 assumed the issue's 17 min number. WSL2 dev host produces ~26 min single-process. If the CI runner profile drifts toward WSL2-like cost, the budget should be raised — not papered over.
Source: parsing <title> elements from the simulation_step
flamegraph (the only flamegraph that resolves below the bencher harness).
Inclusive-sample percentages (a sample counts for every frame in its
stack, so percentages add to > 100%).
35.60% run_integration (RK4 driver)
35.26% integrate_body (per-body RK4 driver)
35.10% gravitation (gravity dispatch — both env + integration)
35.10% GravityControl::evaluate_inner (single-source eval site)
Reading: virtually all of run_integration is gravitation. The
RK4 plumbing itself is <0.5% of step time (consistent with the bench
showing trivial_accel at 43 ns vs degree60_accel at 27 µs).
The kernel itself (calc_nonspherical_with_scratch) does not appear as
a separate frame because LTO inlined it into gravitation. The
attribution lands on gravitation directly.
The environment phase runs update_environment which calls
gravitation once per source before RK4. The hot symbols are
identical to B1 — same kernel, same dispatch. Optimizations on the
kernel improve both phases.
Three queries per step in the Earth–Moon scenario:
-
eph.get_state_typed(Earth, Moon, tdb_jd)— Earth position. -
eph.get_state_typed(Sun, Moon, tdb_jd)— Sun position. -
eph.get_body_rotation(Moon, tdb_jd)— Moon BPC libration matrix.
Hot symbols from the flamegraph (inclusive samples within step_internal):
8.77% Ephemeris::get_state_typed
5.63% anise::translate_to_parent (Chebyshev evaluator entry)
3.31% Type2ChebyshevSet::evaluate (the actual polynomial math)
2.98% ephemeris_path_to_root (frame-chain walk)
2.81% spk_summary_at_epoch (SPK segment lookup)
2.48% hifitime::Epoch::to_time_scale (time-scale conversion, inside translate)
0.83% Ephemeris::get_body_rotation (Moon BPC; smaller than translate)
Insights:
-
Epoch::to_time_scaleis 2.48% and called from inside anise's translate path. Since all three of our per-step queries use the sametdb_jd, the Epoch ctor + time-scale conversion is repeated redundantly 3× per step. Hoisting saves the duplicated work. -
spk_summary_at_epochandephemeris_path_to_rootare anise internals doing per-call SPK segment lookup + frame-chain walk. Both inputs (segment + chain) are time-invariant — the same(target, observer)pair traces the same chain every step. Caching the resolved frame chain across steps could eliminate both. - The Moon BPC rotation is only 0.83% — cheaper than the issue's "~16%" total ephemeris breakdown suggested. The bulk of the ephemeris phase is the two DE421 translate calls.
ROI estimates come from % of step time saved — derived from §3
percentages and conservative assumptions about how much of each hot
symbol's cost the candidate eliminates. Each candidate must pass the
§5 bit-identity gate before merge.
| ID | Candidate | Mechanism | Est. ROI | Cost | Risk |
|---|---|---|---|---|---|
| C1 | Flat-storage SH coefficients | Replace SphericalHarmonicsData.cnm / snm: Vec<Vec<f64>> (jagged, 60+1 rows) with flat Vec<f64> + (n,m) → idx helper. Inner-loop access drops one pointer-deref. |
8-16% | M (~200 LOC, all in one crate) | LOW — same f64 ops, same order |
| C2 | Flat-storage Gottlieb helpers | Same as C1 for xi / eta / zeta / upsilon: Vec<Vec<f64>>. Inner loop accesses these alongside coefficients. |
3-5% (bundle with C1) | M | LOW |
| C3 | Hoist vector3_transform from kernel |
vector3_transform(t_parent_this, pos) runs once per gravitation entry. Move to the caller (accumulate.rs) so it runs once per step per source, not 5×. |
2-4% | M (cross-crate API change) | LOW |
| C4 | Inline annotations on hot helpers | Audit pnm() / pnm_mut() and other small helpers; add #[inline(always)] if codegen evidence shows missed inlining under LTO. |
0-2% (likely noise) | S | LOW |
| C5 | Per-step Epoch cache |
Build Epoch::from_tdb_seconds(tdb_s) once at the top of update_ephemeris and pass to the 3 query calls. Saves Epoch ctor (line 105 + 158) and to_time_scale inside anise translate (2.48% of step time per query). |
3-5% | S (~30 LOC) | LOW |
| C6 | Anise frame-chain cache across steps |
ephemeris_path_to_root (2.98%) + spk_summary_at_epoch (2.81%) re-resolve the same (target, observer) chain every step. Cache the resolved chain on Ephemeris keyed by (target, observer). Anise may already have this internally — verify before adding ours. |
3-5% (overlapping with C5) | M (depends on anise's internal API surface) | MEDIUM — need to confirm anise's chain resolution is deterministic per epoch |
| C7 | Bounds-check elimination audit | The kernel's pnm_flat[offset + m] indexed accesses may not be eliding bounds checks under LTO. Inspect codegen via cargo asm on calc_nonspherical_with_scratch. If bounds checks fire, use get_unchecked (locally #[allow(clippy::indexing_slicing)]) with safety comments. |
1-3% | S | LOW — read-only access into pre-sized buffers |
| C8 | RK4 sub-stage gravity reuse | RISKY. Inside RK4, the 4 sub-stage gravity evals use linearly-extrapolated source positions (the ephemeris already approximates this). For very small dt, the SH coefficient products (e.g. c_ii × cos_mlambda) might be reusable across sub-stages. Bit-identity breaks unless we can prove the f64 round-off is < 1 ULP at the integration accumulator. |
(potentially 10-20%) | L | HIGH — likely breaks bit-identity; flag for policy discussion |
Cumulative bit-identical estimate (C1+C2+C3+C5+C6+C7): ~20-35% step-time reduction. Brings tier3 from 17 min → 11-13 min. Still short of the 8.5 min target.
The remaining gap to a 2× speedup requires either:
- Successful C8 (RK4 sub-stage reuse) — but bit-identity-risky.
- Algorithmic change to the Gottlieb recursion itself — e.g.
precomputing the trigonometric
cos_mlambda/sin_mlambdarows for a fixed-axis rotation, or using a tabulated normalized associated Legendre function. Out of scope for this report; needs numerical-methods research. - Reducing the SH degree from 60 → some lower number (would change physics, not in scope).
Before any C1..C7 candidate can land:
-
cargo nextest run --workspace -E 'test(bevy_parity)'→ 233/233 passing. Must remain bit-identical betweenastrodyn_runnerandastrodyn_bevypaths. -
cargo nextest run -E 'test(tier3_simulation_earth_moon_clem)'→ passes under existing tolerances[0.832, 0.331, 0.972]m with no widening. -
diff target/tier3_crossval/tier3_earth_moon_clem.json{,.pre-change}→ exit 0 (byte-identical max position errors). -
scripts/check_no_bypass_deps.sh→ OK. -
cargo bench -p astrodyn_gravity --bench accumulate -p astrodyn_dynamics --bench integration -p astrodyn_runner --bench step→ measurable speedup with stdev separation from baseline.
C8 fails #3 by hypothesis and is excluded until/unless the bit-identity policy is amended.
Order to attack candidates, by ROI / cost / dependency:
-
C5 (per-step Epoch cache) — small, isolated change in
crates/astrodyn_runner/src/simulation/step/ephemeris.rs. Lowest blast radius. Probably 3-5% in one PR. Land first to validate the bit-identity gate is easy to satisfy. -
C1 + C2 (flat-storage refactor, bundled) — same file
(
crates/astrodyn_gravity/src/spherical_harmonics_gravity_source.rs), same migration. Bundling avoids two rounds of bench/parity-test noise. ~8-20% — the biggest single PR in this set. -
C3 (hoist
vector3_transform) — cross-crate API change; wait until after C1+C2 so the bench numbers are stable. - C6 (anise frame-chain cache) — depends on anise API discovery; may be obsoleted by an anise upstream fix.
- C7 (bounds-check audit) — opportunistic; do alongside any of the above when reading the kernel.
- C4 (inline annotations) — opportunistic; combine with C7.
After 1-5 land, re-baseline and decide whether to invest in C8 (with policy discussion) or new candidates surfacing from the updated flamegraphs.
-
PR #462 not merged yet. This page references the toolkit
that's on
447-perf-toolkit. The methodology and numbers stand regardless of merge timing; the file paths are pre-merge. -
Anise upstream churn. C5/C6 touch anise call sites; an anise
version bump may obsolete our caching. Pin anise's version in
Cargo.lockfor the duration of the C5/C6 PRs. - WSL2 vs CI variance. Numbers here are from a WSL2 dev host. CI runner numbers will differ; treat the percentage breakdowns as the portable metric, not the absolute µs/step values.
-
C8 policy unlock. If the team decides the 2× target is worth
relaxing bit-identity (e.g. accepting tolerance widening from
[0.832, 0.331, 0.972]m to[1.5, 1.5, 1.5]m), C8 becomes feasible and the candidate inventory grows. That's a CLAUDE.md amendment + maintainer call.
Filed against the top-3 candidates by ROI / cost:
-
#464 (C5) —
Cache
Epochonce perupdate_ephemerisinstead of 3×. Estimated 3-5% step time, S cost, LOW risk. - #465 (C1 + C2) — Flat-storage refactor for SH coefficients + Gottlieb helpers. Estimated 8-20% step time, M cost, LOW risk. Biggest single PR.
-
#466 (C3) —
Hoist
vector3_transformfrom gravity kernel to caller. Estimated 2-4% step time, M cost (cross-crate API change), LOW risk.
Each issue cites this page, names the verification gates, and lists the file:line loci.
# Phase A baselines (writes target/perf/baseline.json)
cargo xtask perf-baseline --steps 100000 --warmup 1000 --repeat 5 \
--phase-timing --output target/perf/baseline.json
# Phase A full criterion run (writes target/criterion/<bench>/)
cargo bench -p astrodyn_gravity --bench accumulate \
-p astrodyn_dynamics --bench integration \
-p astrodyn_runner --bench step
# Phase A flamegraphs via pprof
# (writes target/criterion/<bench>/profile/flamegraph.svg)
cargo bench -p astrodyn_gravity --bench accumulate \
-p astrodyn_runner --bench step \
-p astrodyn_dynamics --bench integration -- --profile-time 10
# Phase A tier3 wall-clock baseline
cargo nextest run --release -p astrodyn_verif_jeod \
--test tier3_sim_earth_moon \
-E 'test(tier3_simulation_earth_moon_clem)'Investigation captured against PR #462 commit 2ca4631
(447-perf-toolkit branch).