Core memory leak fixes: Cython deallocation & callback hardening#181
Conversation
Swarm.advection() ended with self.migrate(delete_lost_points=True, max_its=1). max_its=1 only consults the closest domain centroid for each unclaimed particle, so any boundary particle whose owner happened to be the 2nd-or-3rd-closest rank failed points_in_domain on its sole try and was silently deleted. The override looks like leftover debugging — there is no reason for the end-of-advection migrate to truncate the kdtree retry. Drop it so the default max_its=10 is used, restoring the boundary-claim retry path. Adds a parallel regression test (tests/parallel/test_0765) adapted from @bknight1's reproducer in the bug report. With the fix in place: 4 ranks, no particles lost. Without it: 30/726 lost on order=1 advection, 66/726 on order=2 rotation. Reported and diagnosed by @bknight1. Underworld development team with AI support from Claude Code
Long parallel runs occasionally OOM on HPC even when each step looks
small. memprobe gives a way to "light up" memory tracking on demand,
sample at regular intervals, and pin which subsystem is growing.
Components:
- src/underworld3/utilities/memprobe.py — snapshot/diff/probe/instrument
API. Snapshots capture process RSS (resource.getrusage), KDTree live +
total-constructed counts, and (with full=True) per-class Python
instance counts via gc.get_objects. Diffs filter out unchanged keys
and sort py_classes by absolute change so dominant suspects surface
first.
- ckdtree.pyx — module-level live-instance counter, accurate via
Cython's deterministic __cinit__/__dealloc__. Exposed as
uw.kdtree.live_count() and uw.kdtree.total_constructed().
- Stokes.solve() and NavierStokes.solve() decorated with
@memprobe.instrument(...). The decorator fast-returns when ENABLED
is False (one bool check, sub-microsecond), so it's safe on hot paths.
Activation:
- UW_MEMPROBE=1 env var → enables instrumentation hooks at import time.
- memprobe.enable() / disable() for runtime toggling.
- with memprobe.probe("label"): … for ad-hoc blocks.
Skipped: parsing PETSc.Log object tables from Python. The runtime
flags -log_view and -malloc_dump give the same information more
reliably; documented in the guide and exposed via
memprobe.dump_petsc_leaks_at_finalize().
Adds tests/test_0780_memprobe.py (9 smoke tests) and
docs/developer/guides/memory-diagnostics.md with debugging recipes.
Does not fix issue #176 (Ben's OOM on HPC) — provides the
instrumentation he needs to bisect it.
Underworld development team with AI support from Claude Code
- _rss_mb(): switch from resource.ru_maxrss (peak/high-water RSS) to psutil.Process().memory_info().rss (current RSS), with /proc/self/statm Linux fallback and ru_maxrss kept only as a last resort. Current RSS is preferred so freed memory shows as a negative delta. (Copilot #1, #6) - diff(): apply 0.01 MiB threshold before including rss_mb in the delta. Smaller noise prints as "+0.00 MiB" and defeats the "no change" fast path. (Copilot #2) - dump_petsc_leaks_at_finalize(): drop leading "-" from PETSc option keys (codebase convention is no dash; "-malloc_dump" was a no-op), and align docstring with what the function actually sets. (Copilot #3) - Remove unused `os` import from test module. (Copilot #4) - Soften ckdtree.pyx comment claiming Cython "guarantees deterministic destruction" — that's CPython refcounting, which can lag for objects trapped in reference cycles until cyclic GC runs. (Copilot #5) - Test: assert KDTree count deltas relative to immediate before/after, never to absolute baselines. Earlier tests in the same pytest session can leave KDTree refs alive that the cyclic GC may collect at any time, shifting the absolute count. CI surfaced this with "assert 332 == 338" — six trees from earlier tests collected during our gc.collect(). Underworld development team with AI support from Claude Code
CI on PR #179 surfaced a flaky failure in test_bc_accepts_raw_numbers: any(k[0] == name for k in UWexpression._ephemeral_expr_names) RuntimeError: dictionary changed size during iteration The dict is mutated asynchronously by weakref finalizers running during cyclic GC, so iterating it directly races against those callbacks. The fix is to snapshot the keys with list(...) before iterating — at most a few hundred entries, negligible cost. Pre-existing bug — surfaces depending on test ordering, GC pressure, and timing. The memprobe PR's added tests happen to shift teardown state enough to trigger it consistently. Worth landing here so #179 isn't blocked. Underworld development team with AI support from Claude Code
Implement __dealloc__ to ensure malloc'd function pointer arrays are freed when the object is garbage collected. Added re-allocation guards in allocate() to prevent leaks when a container is reused. This resolves heap growth observed when solvers are repeatedly instantiated and destroyed during parameter sweeps or time-looping.
Updated mesh and variable update callbacks to verify 'array.owner' before processing. This prevents AttributeErrors during complex object teardowns (e.g. at exit or during mesh replacement). Integrated explicit recursion guards and mesh_update_lock checks in MeshVariable callbacks to ensure PETSc synchronization doesn't occur during sensitive coordinate deformations, preventing state corruption.
There was a problem hiding this comment.
Pull request overview
This PR targets stability and memory behavior in Underworld3’s PETSc/NumPy interoperability paths, addressing leaks from Cython-managed allocations and hardening array callbacks that can fire during teardown or mesh/variable rebuilds.
Changes:
- Hardened SwarmVariable/MeshVariable/mesh coordinate array callbacks by resolving the owning object via
array.ownerand exiting early when the owner is gone. - Added recursion/mesh-update locking guards in MeshVariable callbacks to reduce unsafe PETSc synchronization during mesh deformation.
- Added Cython-side cleanup for
PtrContainerby introducing__dealloc__and freeing/reallocating function-pointer arrays on reuse.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/underworld3/swarm.py |
Uses array.owner in swarm variable array callbacks to avoid failures during teardown and to reference the correct owning variable. |
src/underworld3/discretisation/discretisation_mesh.py |
Hardens mesh coordinate callbacks by dereferencing the mesh via array.owner before deforming and versioning. |
src/underworld3/discretisation/discretisation_mesh_variables.py |
Updates MeshVariable callbacks to use array.owner, adds recursion/mesh-update lock guards, and adjusts canonical callback logic. |
src/underworld3/cython/petsc_types.pyx |
Adds __cinit__/__dealloc__ and explicit freeing on allocate() reuse to prevent leaks of malloc’d function pointer arrays. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from underworld3.utilities import NDArray_With_Callback | ||
|
|
||
| array_obj = NDArray_With_Callback(flat_petsc_data) | ||
|
|
||
| # Single canonical callback for PETSc synchronization |
| print(f"Mesh update callback - mesh deform") | ||
| coords = array.reshape(-1, array.owner.cdim) | ||
| self._deform_mesh(coords, verbose=True) | ||
| coords = array.reshape(-1, mesh.cdim) | ||
| mesh._deform_mesh(coords, verbose=True) | ||
|
|
||
| # Increment mesh version to notify registered swarms of coordinate changes |
|
Thanks @bknight1 — at a glance this looks orthogonal to the other in-flight work (#179 touched To check before merge
Otherwise this looks ready. CI passing + the above confirmations and I'd say merge. Underworld development team with AI support from Claude Code |
|
@bknight1 — CI on this PR is failing, two things to address: 1. Rebase onto current
|
Implement __dealloc__ to ensure malloc'd function pointer arrays are freed when the object is garbage collected. Added re-allocation guards in allocate() to prevent leaks when a container is reused. This resolves heap growth observed when solvers are repeatedly instantiated and destroyed during parameter sweeps or time-looping.
Updated mesh and variable update callbacks to verify 'array.owner' before processing. This prevents AttributeErrors during complex object teardowns (e.g. at exit or during mesh replacement). Integrated explicit recursion guards and mesh_update_lock checks in MeshVariable callbacks to ensure PETSc synchronization doesn't occur during sensitive coordinate deformations, preventing state corruption.
- Added safety comments to NDArray callbacks in MeshVariable and SwarmVariable explaining owner-existence checks for object teardown. - Documented _mesh_update_lock in Mesh and ensured _deform_mesh is properly wrapped to prevent PETSc sync conflicts during coordinate changes. - Verified test stability for integrals and swarm statistics.
Documented the ownership check guards in NDArray callbacks to explain that they handle teardown race conditions during Python garbage collection (e.g. at application exit or mesh rebuilds).
Use global_sum and global_size for arithmetic mean calculation to ensure rank-independent assertions when comparing with global integration results.
- Restore default max_its in Swarm.migrate to fix advection particle loss (issue #175) - Ensure canonical data array has owner for MeshVariable to enable PETSc sync - Confirmed with parallel advection and integral tests
|
I have addressed the CI failures and the feedback regarding the mesh variable synchronization regression. Fixes and Verification:
Rebase:I have rebased the branch onto the latest All reported regressions are now resolved and verified locally. |
Review pass — ready to merge@bknight1 — the four pre-merge check items from 2026-05-11 are all addressed, and the regression root-cause + fix is well-explained. Going through the verification: Check items resolved
Bug found, root cause documented, fix verifiedThe Small leftover from Copilot's reviewCopilot flagged unconditional Cross-validation with #182Once both #181 and #182 land, the leak-free fingerprint on a tight solver loop should be:
Worth a quick LGTM modulo the Underworld development team with AI support from Claude Code |
This PR addresses several memory-related issues and stability concerns identified during OOM investigations on HPC clusters.
Key changes:
__dealloc__in thePtrContainerclass to ensure thatmalloc'd function pointer arrays are properly freed when Python objects are garbage collected. Added re-allocation guards to prevent leaks during container reuse.array.ownerexists before execution. This preventsAttributeErrorduring complex object teardown (e.g., at application exit or during mesh replacement).mesh_update_lockchecks inMeshVariablecallbacks to prevent PETSc synchronization during sensitive coordinate deformations.