Fix comprehensive memory leaks in solver setup, interpolation caching, and SubDM syncing#178
Merged
Merged
Conversation
…uards, Stokes cleanup
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets PETSc-related memory leaks and rebuild churn in the Underworld3 solver loop, especially under frequent coordinate updates and repeated solve iterations (per issue #176).
Changes:
- Added an LRU +
max_entriescap toDMInterpolationCacheto prevent unbounded cache growth when evaluation coordinates vary. - Improved PETSc object lifecycle management by explicitly destroying temporary SubDMs and cached decomposition objects (IS/SubDM), and by tearing down stale solver DM/SNES state before rebuilding.
- Switched many solver setters from full
is_setup=Falserebuilds to the narrower_needs_function_rewire=Truepath, and added a new memory leak regression test.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_0006_memory_leak.py | Adds a regression test intended to detect solver-loop memory growth across many coupled solves. |
| src/underworld3/utilities/_api_tools.py | Updates SymbolicProperty change-detection behavior for solver setup invalidation. |
| src/underworld3/systems/solvers.py | Replaces many is_setup=False invalidations with _needs_function_rewire=True; expands Stokes solve() signature. |
| src/underworld3/function/dminterpolation_cache.py | Implements an LRU eviction policy with max_entries and eviction stats. |
| src/underworld3/discretisation/discretisation_mesh_variables.py | Destroys temporary SubDM created during local→global vector sync. |
| src/underworld3/cython/petsc_generic_snes_solvers.pyx | Adds explicit cleanup for solver DM/SNES and decomposition caches; caches IS sets to avoid per-solve allocations; fixes local/global Vec restore. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -97,14 +97,6 @@ def __get__(self, obj, objtype=None): | |||
|
|
|||
| def __set__(self, obj, value): | |||
| """Set the value, with automatic unwrapping.""" | |||
Comment on lines
+113
to
+118
| if hasattr(obj, "is_setup"): | ||
| old_value = getattr(obj, self.attr_name, None) | ||
| # Use 'is not' check first for speed | ||
| if old_value is not value: | ||
| # Check for mathematical inequality (handles SymPy expressions) | ||
| if old_value != value: |
| if len(self._cache) >= self.max_entries: | ||
| # Remove oldest (first) item | ||
| oldest_key = next(iter(self._cache)) | ||
| del self._cache[oldest_key] |
| dt = 0.001 | ||
| advdiff.solve(timestep=dt) | ||
|
|
||
| with swarm.access(swarm): |
| for line in f: | ||
| if line.startswith('VmRSS:'): | ||
| return int(line.split()[1]) / 1024.0 # MiB | ||
| except: |
Comment on lines
+23
to
+34
| def test_stokes_advdiff_memory_leak(): | ||
| comm = MPI.COMM_WORLD | ||
| rank = comm.Get_rank() | ||
| size = comm.Get_size() | ||
|
|
||
| # Parameters | ||
| res = 32 | ||
| n_steps = 100 | ||
| warmup_step = 30 | ||
|
|
||
| # Setup mesh | ||
| mesh = uw.meshing.StructuredQuadBox(elementRes=(res, res)) |
Comment on lines
1237
to
1245
| zero_init_guess: bool = True, | ||
| timestep: float = None, | ||
| _force_setup: bool = False, | ||
| verbose=False, | ||
| verbose: bool = False, | ||
| debug: bool = False, | ||
| debug_name: str = None, | ||
| evalf=False, | ||
| order=None, | ||
| picard: int = 0, |
| new_dt = float(value.data) if hasattr(value, 'data') else float(value) | ||
| if np.isclose(old_dt, new_dt, rtol=1e-12, atol=1e-15): | ||
| return | ||
| except: |
| new_dt = float(value.data) if hasattr(value, 'data') else float(value) | ||
| if np.isclose(old_dt, new_dt, rtol=1e-12, atol=1e-15): | ||
| return | ||
| except: |
lmoresi
approved these changes
May 10, 2026
Member
lmoresi
left a comment
There was a problem hiding this comment.
Looks good, Ben,
I added some diagnostics at the same time you figured out the fix. They should be complimentary and help find the issues quickly next time.
Member
Author
|
Sounds good, thanks @lmoresi |
This was referenced May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #176
Summary of Changes
This PR comprehensively addresses severe memory leaks identified in the solver loop, particularly during dynamic coordinate updates (e.g., swarm advection) and tight solve iterations.
DMInterpolationCachewith amax_entriescap (default: 10). This prevents unbounded memory growth when evaluation coordinates change every step.SubDMCleanup inMeshVariable: Added explicitsubdm.destroy()calls in_sync_lvec_to_gvecand otherMeshVariablesynchronization paths. Previously, temporarySubDMobjects created during data syncs were orphaned.SymbolicPropertyand solver setters to useself._needs_function_rewire = Trueinstead ofself.is_setup = Falsewhen only pointwise functions change. This avoids catastrophic, redundant teardowns and rebuilds of the PETScDMhierarchy..destroy()) of old PETSc FE objects,DMhierarchies, andSNESobjects within the_setup_discretisationand_buildpaths before allocating new ones.Verification and Test Findings
A new robust memory leak regression test (
tests/test_0006_memory_leak.py) was added. The test isolates steady-state leakage by warming up for 30 steps and measuring absolute memory growth over the subsequent 70 steps on a 32x32 mesh.developmentbranch): Memory grew by 77.83 MiB (~1.1 MiB leaked per step) over the final 70 steps, failing the test.fix/memory-leak-comprehensivebranch): Memory growth dropped to just 9.94 MiB (~0.14 MiB per step) over the same window. This remaining footprint represents high-water mark expansion rather than a leak. The test passes securely.All standard regression and parallel tests continue to pass.