Skip to content

[BUG] - Three stale-cache issues after swarm particle addition #215

@bknight1

Description

@bknight1

Summary

Three independent stale-cache issues were found when a swarm is modified via add_particles_with_coordinates() and then used for interpolation or projection. All three cause silent data corruption or MPI deadlock (PETSc < 3.24) and affect the common pattern of re-adding particles to empty cells each timestep.


Bug 1 — Stale kd-tree in add_particles_with_coordinates

File: src/underworld3/swarm.py:3571-3577

add_particles_with_coordinates() calls self.dm.migrate() (the raw PETSc DMSwarm migration) rather than self.migrate() (the UW3 wrapper). The manual cache invalidation at lines 3574-3577 nils _particle_coordinates._canonical_data and each variables _canonical_data, but misses self._kdtree:

# Lines 3574-3577 (BEFORE fix)
self._particle_coordinates._canonical_data = None
for var in self._vars.values():
    if hasattr(var, "_canonical_data"):
        var._canonical_data = None
# missing: self._kdtree = None

By contrast, Swarm._invalidate_canonical_data() at line 2648 correctly sets self._kdtree = None. It is called by self.migrate() (line 3463), but add_particles_with_coordinates bypasses that path.

Effect: After adding particles, swarm._get_kdtree() returns a kd-tree built from OLD particle coordinates. RBF interpolation (both for proxy mesh variables and for uw.function.evaluate() with rbf=True) looks up particle indices from the stale tree, accessing wrong PETSc memory locations. On PETSc 3.22.2 this produces an MPI deadlock inside the kd-tree query → rbf_evaluateupdate_lvec path; on PETSc 3.24.2 it silently returns wrong interpolated values.

Fix applied: Added self._kdtree = None after line 3573.


Bug 2 — Stale cached projector in _project_to_work_variable

File: src/underworld3/function/_function.pyx:529-642

_project_to_work_variable() caches Projection solver instances on the mesh object as _eval_projector_scalar (scalar) or _eval_{shape}_projector (tensor). The solver is created once and reused across all subsequent evaluate() calls on that mesh:

if not hasattr(mesh, "_eval_projector_scalar"):
    mesh._eval_projector_scalar = uw.systems.Projection(mesh, ...)
projector = mesh._eval_projector_scalar
projector.uw_function = scalar_expr
projector.solve(zero_init_guess=False)  # no _force_setup

When a Stokes solve (or any other solver modifying the DM) runs between two evaluate() calls, the cached projectors PETSc solver state (SNES/KSP/matrix decomposition) is stale. On PETSc 3.22.2 the projector.solve() deadlocks because the cached matrix doesnt match the current DM state. PETSc 3.24.2 tolerates this (silently returns wrong results).

Fix applied: Changed both the scalar projector (line 640) and the tensor projector (line 613) to pass _force_setup=True:

projector.solve(zero_init_guess=False, _force_setup=True)

Note: The same stale-cached-projector pattern exists in user code that reuses Projection solver instances across timesteps or after Stokes solves. Any cached projection solver should either (a) pass _force_setup=True on every solve, or (b) track a DM version counter and auto-rebuild when the DM changes.


Bug 3 — Stale proxy mesh variable data after swarm write

File: src/underworld3/swarm.py:1034-1087 (proxy update pipeline)

When a SwarmVariable has proxy_degree > 0 (the default is proxy_degree=2), a proxy MeshVariable is created that stores RBF-interpolated values from the swarm. The update is lazy:

  1. swarm.access(var) modifies the canonical data array
  2. On exit, delay_callbacks_global fires the data callback
  3. The callback calls pack_raw_data_to_petsc() (line 478), which writes to PETSc and calls self._update() (line 1291), setting self._proxy_stale = True
  4. The actual re-interpolation (_rbf_to_meshVar) happens only when material.sym is accessed or _update_proxy_if_stale() is called

The problem: If code reads the proxys MeshVariable DM directly (e.g., a Projection solver that evaluates its uw_function at quadrature points), it reads STALE data from the proxys PETSc DM — the lazy update hasnt fired yet.

Concrete scenario:

material = swarm.add_variable("material", 1, dtype=int, proxy_degree=2)
meshMat.uw_function = material.sym[0]  # triggers proxy update, stores symbol

# ... add particles and set new material values ...
meshMat.solve(_force_setup=True)
# ^ evaluates stored proxy symbol at quadrature points
# ^ proxy DM still contains data from the FIRST sym access — STALE

Why uw.function.evaluate(material.sym[0], ...) works: It re-accesses material.sym, which calls _update_proxy_if_stale() and re-interpolates from the current swarm.

Fix needed: Either:

  • (a) Document that _update_proxy_if_stale() must be called before using the proxy MeshVariable DM after a swarm write
  • (b) Make the evaluation pipeline check for stale proxies and auto-update before reading
  • (c) Remove the lazy proxy update pattern and update immediately on data write
  • (d) Add proxy update hooks in add_particles_with_coordinates and other swarm-mutating methods

Reproduction

The test file tests/test_0112_swarm_add_particles.py contains test_proxy_updates_after_add_particles which reproduces Bug 1 (kd-tree) and Bug 3 (proxy staleness). Bug 2 was reproduced on Setonix HPC (PETSc 3.22.2) and confirmed locally on macOS (PETSc 3.24.2).

Environment

  • PETSc 3.22.2 (Setonix HPC) — deadlocks on Bugs 2
  • PETSc 3.24.2 (macOS) — silently returns wrong values on Bugs 2
  • Underworld3 development branch (as of 2026-05-29)

Related Files

  • src/underworld3/swarm.py — lines 3501, 3571-3577 (Bug 1), lines 1034-1087 (Bug 3)
  • src/underworld3/function/_function.pyx — lines 529-642 (Bug 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions