Perf: clone DM in Integral.evaluate() (3× speedup; closes #171 narrative)#172
Conversation
Integral.evaluate() called PetscDSSetObjective on the shared mesh DS
every call, then DMPlexComputeIntegralFEM. The user reports this makes
diagnostic-integral cost grow linearly across long runs — convection
diagnostics at Ra=1e7 / 1/64 mesh / ~3000 timesteps showed t_vol going
from 5 s to 833 s (≈ +0.65 s per integral call), while solve costs
(t_stokes, t_adv) stayed flat over the same window. By step ~3700 the
diagnostic was ~20× the actual solve.
The asymmetric sister class CellWiseIntegral.evaluate() already takes
the right path: clone the mesh DM, attach a default FE field, create a
fresh DS, set the objective on that fresh DS, integrate, and return.
This commit brings Integral.evaluate() in line with that pattern, with
one improvement over CellWiseIntegral: explicit dmc.destroy() at the
end so long runs don't accumulate cloned DMs (the broader DM-lifecycle
cleanup is tracked separately).
The fix doesn't reproduce the symptomatic linear growth at small
mesh + small call-count locally (200 calls × 100-cell mesh stays
flat at ~2.5 ms per 3 integrals both before and after the fix) — the
behaviour is scale-dependent. Confirmation at the original Ra=1e7 /
1/64 / 3000-step scale is left as a re-run on the reporter's side.
Verified
--------
- /tmp/repro_171.py — 200 evaluate() calls × 3 integrals each on a
100-cell mesh: post-fix mean ratio last/first = 0.04x (flat).
- /tmp/repro_171b.py — same, with a Stokes solve interleaved between
integrations: post-fix ratio 0.21x (flat).
- pytest -m "level_1 and tier_a" on amr-dev: 62 passed, 3 skipped,
0 failed. No numerical regressions.
Underworld development team with AI support from Claude Code
There was a problem hiding this comment.
Pull request overview
This PR aims to fix the long-run performance regression in uw.maths.Integral.evaluate() by stopping it from mutating the shared mesh DS on every call. In the Underworld3 maths/PETSc integration layer, it replaces the shared-DS path with a per-call scratch DM/DS path.
Changes:
- Reworks
Integral.evaluate()to clone the mesh DM and build a fresh DS for each evaluation. - Switches the integral computation to run on the cloned DM instead of the shared mesh DM.
- Adds explicit cleanup of the cloned DM after each integral call.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cdef DM dmc = self.mesh.dm.clone() | ||
| cdef FE fec = FE().createDefault(self.mesh.dim, 1, False, -1) | ||
| dmc.setField(0, fec) | ||
| dmc.createDS() |
| cdef DM dm = self.dm | ||
| cdef DS ds = self.dm.getDS() | ||
| cdef DM dmc = self.mesh.dm.clone() | ||
| cdef FE fec = FE().createDefault(self.mesh.dim, 1, False, -1) |
| cdef DM dmc = self.mesh.dm.clone() | ||
| cdef FE fec = FE().createDefault(self.mesh.dim, 1, False, -1) | ||
| dmc.setField(0, fec) | ||
| dmc.createDS() | ||
|
|
||
| cdef DS ds = dmc.getDS() | ||
| cdef PetscScalar val_array[256] | ||
|
|
||
| # Now set callback... | ||
| ierr = PetscDSSetObjective(ds.ds, 0, ext.fns_residual[0]); CHKERRQ(ierr) | ||
| ierr = DMPlexComputeIntegralFEM(dm.dm, cgvec.vec, &(val_array[0]), NULL); CHKERRQ(ierr) | ||
| ierr = DMPlexComputeIntegralFEM(dmc.dm, cgvec.vec, &(val_array[0]), NULL); CHKERRQ(ierr) |
|
Apples-to-apples timing on the harder reproducer (P3 T, P2 V, full Stokes + AdvDiff per step, 800-cell mesh, 200 steps, 3 integrals per step):
Two findings:
So the perf benefit is real and immediate, even before any linear-growth claim is confirmed at scale. Underworld development team with AI support from Claude Code |
Summary
Replaces `Integral.evaluate()`'s shared-mesh-DS path with a per-call clone-DM + `createDS` (matching the working `CellWiseIntegral` pattern). On current development this delivers a clean ~3× per-call speedup for volume integrals.
The original linear-growth narrative from #171 no longer reproduces on current development — see "What changed" below. The fix is still worth landing as a perf cleanup and a structural improvement.
Re-framed for context; #171 closes either way.
What the reproducer shows
@lmoresi's full reproducer (`/tmp/repro_issue_171.py` from `~/.claude/plans/repro-issue-171-integral-growth.md`): mirrors the original 1/64 Ra=1e7 setup (UnstructuredSimplexBox, P2-P1 Stokes, P3 T, qdegree=3), three cached `uw.maths.Integral` objects (volume, mean T, V·V), 100 evaluate() calls per integrand at cellsize 1/32.
Per-call cost is ~3× lower with the fix (16 ms → 5 ms), but the linear-growth claim that originally motivated #171 doesn't manifest on either branch at this scale.
What changed since the original report
The original convection run at 1/64 (Ra=1e7, ~3000 timesteps) showed t_vol going from 5 s to 833 s. Predicted-with-bug at 1/32 was 1.3 s → 17 s. Actual unpatched today: 28 ms → 16 ms. That's ~1000× lower than predicted-with-bug — strong evidence that some intermediate landing on development killed the linear-growth pathology, before this PR. Plausible candidates: #160 (SNES + DM-hierarchy lifecycle leak), #155 (swarm-routed point eval), #161 (ETD lifecycle changes). Bisecting which one isn't tracked here.
Why land it anyway
Why the perf is faster (mechanism)
The shared mesh DS has every registered field on it (T, V, P, plus solver internals). `DMPlexComputeIntegralFEM` walks all of them per call. The cloned DM has only the dummy P1 field we attach via `setField(0, fec)`, so the per-integration DS is leaner. Same reason CellWiseIntegral's pattern works.
Test plan
Closes
#171 — the narrative shifts but the issue resolves either way (the linear-growth pathology is gone on current dev; this PR closes the only remaining mechanism that could re-introduce it, plus delivers a 3× perf win).
Underworld development team with AI support from Claude Code