On-disk snapshot toolkit v1.1 (stacked on #195, #196) by lmoresi · Pull Request #198 · underworldcode/underworld3

lmoresi · 2026-05-20T11:12:51Z

Summary

Adds the on-disk arm of the snapshot toolkit — Model.save_state(file=…) / Model.load_state(file=…). Pairs the in-memory work in #195 (the "git stash for timesteps") with a persistent format inspectable via standard h5 tools, parallel-correct, and built on top of #146's PETSc DMPlex primitives.

⚠️ Stacked

Based on feature/model-tracker (#196) → feature/in-memory-checkpoint (#195) → development. The PR will show a large diff until those land — once they merge, the diff narrows to just the snapshot-disk additions automatically. Targeting development directly because that's the actual destination; readers focus on the new files under src/underworld3/checkpoint/disk_snapshot.py and tests/test_0010*.

What's here (six phases)

Phase	What	Commit
1	Metadata + skeleton + inspectability layer	`72681d1`
2	Mesh + meshvar bulk via #146's PETSc primitives	`e4a43e0`
3a	`/python_state` — Snapshottable dataclass round-trip	`8e3d04a`
3b	Swarms in per-swarm sidecars	`83061be`
5	Unified `Model.save_state` / `load_state` API	`df4f829`
4	`MeshVariable.read_timestep` format-aware dispatch	`3bb201d`
6	Per-rank swarm sidecars (parallel-correct on-disk)	`dbbf52a`

User-facing API

The whole disk path is exposed through the same two methods #196 introduced for in-memory:

# In-memory (from #195): ephemeral stash
token = model.save_state()
model.load_state(token)

# On-disk (this PR): persistent, inspectable, parallel-correct
model.save_state(file="step42.snap.h5")
model.load_state("step42.snap.h5")

# Existing selective-read entry point now reads BOTH formats:
T.read_timestep("step42.snap.h5", "T", 0)      # v1.1 wrapper
T.read_timestep("legacy_run", "T", 0)          # legacy write_timestep

File layout

my_run.snap.h5         wrapper (h5py-inspectable: /metadata, /meshes, /swarms, /python_state)
my_run.snap.bulk/      companion directory
    {mesh}.mesh.00000.h5             mesh DM
    {mesh}.{var}.00000.h5            per mesh-variable
    {swarm}.swarm.rank{R:04d}of{S:04d}.h5  per-rank swarm sidecar

h5ls -v my_run.snap.h5/metadata shows run name, schema version, sim time, step, dim, MPI rank count, and inventories — no UW3 needed.

Tests

77 single-rank tier-A tests across the snapshot suite (in-memory 24 + tracker 9 + on-disk 23 + 3 real-solver + 18 regression).
MPI ptests at -np 1 / 3 / 4 for both in-memory (ptest_0007) and on-disk (ptest_0010). Exact reconstruction confirmed across cross-rank particle distribution; recovers from real cross-rank particle loss.
Real-solver test (test_0008_snapshot_realsolver) shows bit-exact discard guarantee through an AdvDiffusion solve.

Design decisions captured

Inspectability is a hard requirement — pure PETSc HDF5 dumps don't pass h5ls-without-UW3, so we wrap PETSc bulk in a UW3-controlled metadata layer.
Swarms always sidecar ("bulk is a problem with swarms, always") — per-swarm + per-rank files, not inline.
read_timestep stays user-facing and selective — different use case (variable subsets, cross-resolution remap) from load_state's whole-model role. The format detection is hidden behind the call.
Add PETSc DMPlex checkpoint reload for mesh variables #146 (Thyagarajulu's PETSc DMPlex reload) is reused, not replaced. This PR is additive layering on top of Add PETSc DMPlex checkpoint reload for mesh variables #146's primitives.
Disk-space measurement done (see commit messages): save_state is ~3× write_timestep for sparse setups, ~8× with typical Stokes-internal work variables. Conclusion: write_timestep stays as the selective-output path; a future vars=[…] filter on save_state would close the gap if needed.

Test plan

pixi run -e amr-dev pytest tests/test_0007_snapshot_inmemory.py tests/test_0008_snapshot_realsolver.py tests/test_0009_model_tracker.py tests/test_0010_snapshot_disk_format.py (77 tests)
cd tests/parallel && bash mpi_runner.sh (covers ptest_0007 + ptest_0010 at 1/3/4 ranks)
pixi run -e amr-dev docs-build — advanced/snapshot-restore.html renders with all sections

After #195 and #196 merge

Retarget this PR's base if needed (it will be already if dev contains those merges) — the diff narrows automatically.

Underworld development team with AI support from Claude Code

Copilot

Pull request overview

Adds a persistent, on-disk backend for the snapshot toolkit via Model.save_state(file=...) / Model.load_state(path), layering an inspectable HDF5 “wrapper + bulk sidecars” format on top of PETSc DMPlex checkpoint primitives. This complements the in-memory snapshot token path by enabling durable restarts and selective reads while keeping snapshot/restore semantics (including solver-internal state via dataclass snapshots).

Changes:

Implement v1.1 disk snapshot format (.snap.h5 wrapper + .snap.bulk/ companion dir) including mesh/meshvar bulk, per-rank swarm sidecars, and /python_state dataclass serialization.
Unify APIs via Model.save_state(...) / Model.load_state(...), and make MeshVariable.read_timestep(...) dispatch format-aware (legacy vs v1.1 wrapper).
Add extensive serial + MPI test coverage plus user-facing docs and demo scripts.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/test_0010_snapshot_disk_format.py	Validates disk snapshot wrapper/bulk layout, inspectability, roundtrip, sidecars, and `read_timestep` dispatch.
tests/test_0009_model_tracker.py	Tests `Model.tracker` snapshot-managed semantics and git-stash behavior.
tests/test_0008_snapshot_realsolver.py	Real-solver confidence tests for snapshot restore/continuation guarantees.
tests/test_0007_snapshot_inmemory.py	In-memory snapshot suite expanded/maintained for meshes, swarms, DDt state, and continuation.
tests/run_snapshot_backstepping_demo.py	Time-series demo script illustrating adaptive-Δt back-stepping using snapshots.
tests/run_snapshot_backstepping_spatial.py	Spatial visualization demo companion for snapshot back-stepping.
tests/parallel/ptest_0010_snapshot_disk.py	MPI test for disk snapshots (wrapper + per-rank sidecars + exact reconstruction).
tests/parallel/ptest_0007_snapshot_inmemory.py	MPI test for in-memory snapshots (exact reconstruction + continuation).
tests/parallel/mpi_runner.sh	Adds snapshot ptests to the MPI runner script.
src/underworld3/systems/ddt.py	Adds Snapshottable state dataclasses + `.state` adapters and model registration for DDt flavors.
src/underworld3/swarm.py	Adds swarm population generation counter and snapshot payload/apply support.
src/underworld3/model.py	Adds `_state_bearers`, `Model.tracker`, and unified `save_state/load_state` API.
src/underworld3/discretisation/discretisation_mesh.py	Adds mesh snapshot payload/apply support for in-memory restore.
src/underworld3/discretisation/discretisation_mesh_variables.py	Adds v1.1 wrapper detection/bridge in `read_timestep`.
src/underworld3/checkpoint/tracker.py	Implements `ModelTracker` + `TrackerState` Snapshottable dataclass.
src/underworld3/checkpoint/state.py	Defines the `SnapshottableState` base and `Snapshottable` protocol.
src/underworld3/checkpoint/snapshot.py	In-memory snapshot orchestration (token capture/restore).
src/underworld3/checkpoint/disk_snapshot.py	Disk snapshot writer/reader, inspectability layer, sidecars, and python-state serialization.
src/underworld3/checkpoint/backend.py	Defines the snapshot backend protocol and in-memory backend implementation.
src/underworld3/checkpoint/init.py	Exposes snapshot toolkit public API surface.
src/underworld3/init.py	Imports `underworld3.checkpoint` at package import time.
docs/developer/guides/state-as-dataclass.md	Documents the state-as-dataclass contract for solver-internal state.
docs/developer/design/in_memory_checkpoint_design.md	Design note covering snapshot/restore motivation, semantics, and roadmap.
docs/advanced/snapshot-restore.md	User guide for save/load state (in-memory + on-disk) and tracker usage.
docs/advanced/index.md	Adds snapshot/restore to advanced docs index/toctree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+DOFs, plus swarm positions and user swarm-variable data with
+rebuild-on-restore semantics. Solver-internal Python state, on-disk
+backend, schema versioning, mesh-DM rebuild, and cross-process restore
+are scheduled for follow-up PRs per the design note.


+"""Unitary in-memory (and, later, on-disk) snapshot toolkit.
+
+The first true unitary checkpoint in Underworld3 — captures enough state
+that a Model can be put back exactly as it was, suitable for backtrack on
+failure, multi-stage time integration, adaptive-Δt retry, and crash
+recovery.
+
+Distinct from the existing per-variable ``write_timestep`` /
+``read_timestep`` path, which serves visualisation and partial restart.
+That path stays in service of its existing role.
+
+See ``docs/developer/design/in_memory_checkpoint_design.md`` for the
+design rationale, scope, and roadmap. In v1 (this code), only an
+in-memory backend is implemented and only mesh + mesh-variable state is
+captured. Subsequent PRs add swarm coverage, solver-internal Python
+state (DDt history, parameter mutation history), an on-disk full-state
+backend, and schema versioning across UW3 releases.


+    ├── /metadata          (attrs: uw3_version, schema_version,
+    │                       created_at, step, sim_time, dt, dim,
+    │                       mesh_type, coordinate_system,
+    │                       mpi_ranks_at_write, variables_summary, ...)
+    ├── /mesh              (phase 2 — DMPlex topology + coords + labels)
+    ├── /variables         (phase 2 — one subgroup per mesh-variable)
+    ├── /swarms            (phase 3 — possibly @external_file refs)
+    └── /python_state      (phase 3 — Snapshottable dataclasses as attrs)


+            f"current {DISK_SNAPSHOT_SCHEMA_VERSION}; on-disk schema "
+            f"migration will land with phase 6 (not yet implemented)"


        output_base_name = os.path.join(outputPath, data_filename)
-        data_file = output_base_name + f".mesh.{data_name}.{index:05}.h5"
+        legacy_file = output_base_name + f".mesh.{data_name}.{index:05}.h5"

-        if not os.path.isfile(os.path.abspath(data_file)):
-            raise RuntimeError(f"{os.path.abspath(data_file)} does not exist")
+        is_v1_1 = (
+            os.path.isfile(data_filename)
+            and not data_filename.endswith(
+                f".mesh.{data_name}.{index:05}.h5"
+            )
+            and _is_snapshot_wrapper(data_filename)
+        )

-        import h5py
-        import numpy as np
+        if is_v1_1:
+            data_file = data_filename
+        else:
+            data_file = legacy_file
+            if not os.path.isfile(os.path.abspath(data_file)):


+        # and restore would silently no-op. `state` is therefore a
+        # reserved name and cannot be a user-managed quantity.
+        cls_attr = getattr(type(self), name, None)
+        if hasattr(cls_attr, "__set__") or hasattr(cls_attr, "__get__"):


+```text
+my_run.snap.h5         (~tens of KB; metadata, group structure)
+my_run.snap.bulk/      (per-mesh + per-swarm sidecars)
+    {mesh}.mesh.00000.h5
+    {mesh}.{var}.00000.h5     (one per mesh-variable)
+    {swarm}.swarm.h5          (one per swarm)
+```


+    """Write a complete on-disk snapshot of the model's mesh + mesh-variable
+    state (phase 2 scope; swarms and python_state land in phase 3).
+
+    Produces two artifacts:
+
+    - ``path`` — the wrapper HDF5 file with rich metadata and the group
+      structure inspectable via ``h5ls``.
+    - ``_bulk_dir_for(path)`` — companion directory containing the
+      PETSc HDF5 files (mesh DM + per-variable section/vec) produced
+      by #146's :meth:`Mesh.write_checkpoint`.


First slice of the on-disk snapshot format (v1.1). Establishes the file structure and the inspectability bar; no PETSc bulk yet (that is phase 2). Stacked on the in-memory snapshot toolkit (#195) and the model tracker (#196) so it can serialise both later. What lands: - src/underworld3/checkpoint/disk_snapshot.py - DISK_SNAPSHOT_SCHEMA_VERSION = 1 - write_snapshot_skeleton(model, path): writes /metadata attrs + empty stub groups /mesh /variables /swarms /python_state (the structure phases 2+ will fill in). - read_snapshot_metadata(path): reads /metadata back as a plain dict, decodes JSON-encoded list fields for convenience, validates schema version. - inspect_snapshot(path): human-readable summary suitable for print(...) at a notebook prompt. - src/underworld3/checkpoint/__init__.py: exports. - tests/test_0010_snapshot_disk_format.py (7, tier_a level_1): - top-level group structure matches the spec - h5py-readable /metadata attrs cover identity, schema, tracker conventions, geometry, MPI rank count, and inventories of meshes / swarms / state-bearer classes / variables — the proxy for "an external user running h5ls/h5dump sees useful info" - read/write roundtrip - rejection of non-snapshot files and wrong-schema files with clear errors (not obscure h5py noise) - inspect_snapshot includes the key facts - skeleton groups carry `filled_by` attrs so phases 2/3 readers and external inspectors can tell whether content is populated yet. Design notes encoded: - UW3-controlled rich-metadata wrapper around PETSc bulk; pure PETSc HDF5 dumps fail the inspectability bar so are rejected as the format. - List-typed metadata stored as JSON strings in scalar attrs so h5py / h5ls handle them cleanly; read API exposes them as plain Python lists alongside the *_json originals. - Swarm storage left as a phase-3 decision: the metadata wrapper is designed to support `@external_file` on /swarms/swarm_X/ when individual swarms grow too bulky for a single file. No commitment to inline vs split until phase 3 has real swarm sizes in hand. Stacked on feature/model-tracker; PRs to development after #195 and #196 land. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

@name

…t roundtrip Builds on phase 1's metadata wrapper to actually carry mesh + mesh- variable state to disk and read it back. Delegates the heavy lifting to #146's `Mesh.write_checkpoint` / `MeshVariable.read_checkpoint` PETSc-DMPlex primitives — phase 2's job is layout, dispatch, and tying the wrapper to the bulk data via a simple convention. Layout (final v1.1 shape): /path/to/run.snap.h5 wrapper (h5py-inspectable) /path/to/run.snap.bulk/ companion directory (one per snap) {mesh_safe}.mesh.00000.h5 {mesh_safe}.{var_clean}.00000.h5 Wrapper carries /meshes/{mesh_safe}/ with @name, @mesh_file, and /meshes/{mesh_safe}/variables/{var_safe}/ with @name, @components, @degree, @continuous, @external_file. The bulk-dir path is derived from the wrapper path by convention (`.h5` → `.bulk`), so no external_file attr is needed for the standard placement. Move them together; a clear FileNotFoundError fires if bulk is missing on read. Phase 1 layout refactor folded in: - /mesh (singular) → /meshes (plural) — supports multi-mesh natively. - /variables removed from the top level — now nests under each mesh as /meshes/{name}/variables/{var}, matching the in-memory snapshot's mesh→vars structure. New API: - `write_snapshot(model, path)` — writes wrapper + bulk; covers every registered mesh and every allocated meshvar on each mesh. Lazy-allocated vars (_gvec is None) are skipped — same rule as the in-memory path. - `read_snapshot(model, path)` — loads var DOFs back into already- registered meshes by name. Mesh / variable mismatch raises a clear ValueError (mesh-rebuild on read is v1.2 scope). - `write_snapshot_skeleton` / `read_snapshot_metadata` / `inspect_snapshot` stay as phase-1 metadata-only entry points. Branch hygiene: merged origin/development (which now has #146) into this branch so the new code can actually call read_checkpoint. The merge was clean — #146 and the snapshot toolkit only overlap at different methods in `discretisation_mesh.py`, as the earlier analysis predicted. PR target will be development once #195/#196 land; the diff stays clean because the merged dev commits are already there. Tests (12 total, 5 new in phase 2, tier_a level_1): - write produces wrapper + bulk-dir with the expected file pattern - wrapper populated with the per-mesh + per-var metadata that makes inspectability self-sufficient - bit-exact write→scribble→read roundtrip on a 2D mesh with one scalar + one vector variable (np.array_equal, zero tolerance) - missing bulk-dir → clear FileNotFoundError - mismatched mesh on read → clear ValueError (not an obscure h5py trace) Regression: 64 tests pass (24 snapshot + 9 tracker + 12 disk-format + 19 core/regression). Phase 3 next: swarms (with the @external_file freedom kept open for bulky swarms) + /python_state for DDt + ModelTracker via dataclass- to-HDF5-attrs serialisation. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Serialises every registered Snapshottable's .state dataclass into a per-bearer group under /python_state, keyed by the same stable name the in-memory snapshot uses (f"{type(obj).__name__}_{obj.instance_number}"). ModelTracker (always auto-registered) and DDt state therefore now travel with the disk snapshot in addition to the mesh + meshvar bulk from phase 2. Generic field serialisation (no per-class code): - None -> attr "__none__" sentinel - bool/int/float/str-> scalar attr (preserves type via h5py) - numpy.ndarray -> dataset - list/tuple -> attr <name>__json (JSON, handles None) - dict -> subgroup, recursive (used by TrackerState.managed) - unhandleable -> attr <name>__skipped = "<type info>" — restore keeps the *current* live value rather than clobbering it with a placeholder, so a documented partial round-trip (e.g. DDtSymbolicState.psi_star which is sympy and would need srepr+sympify) doesn't break. Restore uses the live obj.state as a type template + dataclasses. replace(...): captured fields override; skipped fields keep their current value. ValueError on state-bearer-not-registered keeps the same-rank/same-model contract. Tests (4 new, 16 total tier_a level_1): - tracker time/step/dt + user-added quantity (scalar + numpy array) round-trip exactly through disk - /python_state group is h5py-inspectable: __bearer_class__, __state_class__, instance_number; TrackerState.managed visible as a subgroup with each managed key as an attr (so h5ls shows 'time', 'step', 'dt', 'my_q' directly) - Symbolic DDt's primary BDF-control fields (dt_history, history_initialised, n_solves_completed, dt) round-trip; psi_star (sympy) is documented as skipped — restore keeps current value - mismatched state-bearer set on read raises clearly Phase 3b next: swarms in a per-swarm sidecar from day one (per Louis's "break out swarms" direction — bulk is always a swarm problem, so don't even try inline). Regression: 68 tests pass (24 in-memory + 9 tracker + 16 disk-format + 19 core/regression). Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

@dim

Per Louis's direction ("break out the swarm information into a separate file in the first instance — bulk is a problem with swarms, always"), swarms always go to their own h5py-direct sidecar from day one. No inline-vs-split toggle — sidecar is the only path. Layout: /path/to/run.snap.h5 wrapper /path/to/run.snap.bulk/{swarm_safe}.swarm.h5 swarm sidecar (one per swarm) Sidecar structure (h5py-native, no PETSc — swarms aren't DMPlex section/vec): @num_particles_local, @dim, @mesh_name, @population_generation /coordinates dataset, (n_local, dim) /variables/{var_clean_name} dataset, (n_local, num_components) @num_components, @dtype The sidecar's top-level @attrs and group structure mean `h5ls -v` on the sidecar alone tells you "this holds N particles in dim D on mesh M with these variables" — same inspectability bar as the wrapper. Wrapper /swarms/{swarm_safe}/ carries metadata + the @external_file pointer to the sidecar in the bulk dir. Restore mirrors the in-memory Swarm.apply_snapshot_payload exactly: clear local population via dm.removePoint loop, addNPoints at saved coords, write var data back. Same rebuild-on-restore semantics — the disk snapshot recovers from a particle-population mutation (added particles between snapshot and restore) just like the in-memory path does, proven by test_swarm_restore_recovers_after_particle_count_change. Tests (5 new, 21 total tier_a level_1): - swarm sidecar lands in bulk dir with predictable name; wrapper records external_file ref + mesh_name + var inventory - sidecar is self-inspectable via h5py (file-level attrs + /coordinates + /variables with per-var attrs) - whole swarm (coords + svar data) round-trips bit-exact through write → scribble → read - rebuild-on-restore parity with in-memory path: snapshot, mutate population, restore → exact local population recovered - PETSc-internal DMSwarm_* variables filtered at capture (same rule as in-memory) MPI: single-rank only in this phase. The current rank-0-only sidecar write only captures rank 0's local particles in a parallel run. Phase 6 will either use h5py-mpi parallel HDF5 or per-rank sidecars to match #195's parallel exact-reconstruction guarantee. 73 tests pass (24 in-memory + 9 tracker + 21 disk-format + 19 core/regression). Phase 4 next: format detection + dispatch in MeshVariable.read_timestep so it reads BOTH the legacy per-variable layout AND the new v1.1 sidecar format via the KDTree bridge. Closes the compatibility commitment from the design discussion. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Single user-facing entry point for all snapshot use cases. Same methods serve in-memory ephemeral stash and on-disk persistent snapshot — the dispatch is mechanical, the user has one API to learn: token = model.save_state() # in-memory, returns Snapshot model.load_state(token) # restore from token model.save_state(file="step42.snap.h5") # on-disk, returns path model.load_state("step42.snap.h5") # restore from disk # (also: load_state(file=…)) load_state dispatches on argument type — Snapshot → in-memory restore; str/PathLike → disk restore. Type-mismatched source raises TypeError with a clear message. Renames replace the prior Model.snapshot() / Model.restore() pair from #195. Pre-merge, no public users to migrate; getting the user-facing API right now means there is never a disparate version shipped. uw.checkpoint.{snapshot,restore,write_snapshot,read_snapshot, read_snapshot_metadata,inspect_snapshot,write_snapshot_skeleton} stay as power-user / lower-level entry points that save_state / load_state delegate to. Files updated (mechanical renames, except the doc rewrite): - src/underworld3/model.py: save_state / load_state methods replace snapshot / restore; load_state accepts positional Snapshot or str/os.PathLike, with TypeError on anything else. - tests/test_0007_snapshot_inmemory.py — 23 callers renamed; obsolete test_snapshot_path_is_v1_1_scope deleted (v1.1 has landed). - tests/test_0008_snapshot_realsolver.py — 3 tests renamed. - tests/test_0009_model_tracker.py — 9 tests renamed. - tests/test_0010_snapshot_disk_format.py — 21 tests: replace uw.checkpoint.write_snapshot / read_snapshot with model.save_state / model.load_state at user-style call sites; keep write_snapshot_skeleton + read_snapshot_metadata where the test is specifically exercising the lower-level entry points. - tests/parallel/ptest_0007_snapshot_inmemory.py — np-1/3/4 ptest. - tests/run_snapshot_backstepping_{demo,spatial}.py — demo scripts. - docs/advanced/snapshot-restore.md — rewritten API section to show both modes; added "On-disk file layout" section and a "Choosing between paths" comparison table covering write_timestep, write_checkpoint, and save_state. Limitations section updated to reflect that on-disk is now real (was "in-memory only"). Regression: 75 single-rank tests pass (was 76 — minus the deleted obsolete v1.1-scope test); MPI ptest at -np 4 still PASS with the parallel exact-reconstruction guarantee. Docs build clean with no snapshot-related warnings; the new layout + choosing-between-paths sections render. Phase 4 (read_timestep format-aware dispatch for backward compat) becomes a nice-to-have at this point — save_state / load_state is the recommended surface, write_timestep / read_timestep keep their existing role unchanged. Phase 6 (parallel HDF5 / per-rank sidecars for on-disk MPI) is the remaining correctness item. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

The selective-read entry point users already know (``var.read_timestep(...)``) now reads BOTH the legacy ``write_timestep`` per-variable HDF5 files AND v1.1 snapshot wrappers — same call, format detection is hidden inside the function. No user code has to learn a second API for the new format; existing scripts with ``var.read_timestep(...)`` calls keep working transparently against new files. This is the compat commitment from the design discussion: "the clean interface lies beneath the surface for this case" — meaning the format dispatch is hidden, not that read_timestep itself is hidden. read_timestep serves a different use case than save_state/load_state (selective per-variable, cross-resolution remap via KDTree, visualisation-style reads); both stay user-facing. Implementation: - ``uw.checkpoint.is_snapshot_wrapper(path)``: cheap format detector — checks for top-level /metadata + /meshes groups. - ``uw.checkpoint.extract_var_via_bridge(wrapper_path, var_name)``: given a v1.1 wrapper + variable name, returns (coords, values) numpy arrays — exactly what the legacy file's h5 read produces. Mechanism: load source mesh from .mesh.h5 sidecar, rebuild source variable with matching degree/components, load DOFs via #146's MeshVariable.read_checkpoint, read out var.coords and var.array. - MeshVariable.read_timestep: before its rank-0 (coord, value) read, dispatches on the file's format. v1.1 → bridge. Legacy → existing per-variable h5 read. Everything after — the source- swarm + query-swarm KDTree-routing machinery — is reused unchanged. Tests (2 new, 23 total in test_0010, 77 across the snapshot suite): - read_timestep against a v1.1 snapshot wrapper round-trips a variable bit-exact (KDTree query lands on captured DOF coords) - read_timestep against a legacy write_timestep file still uses the legacy code path (belt-and-braces no-regression check) Phase 6 (parallel on-disk MPI) remains as the production-readiness gate for the disk path. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

…n-disk Closes the last production-readiness gate on the disk path. Swarm sidecars are now per-rank files: each rank writes its own {swarm_safe}.swarm.rank{R:04d}of{S:04d}.h5, the wrapper records the naming pattern + rank count, and on restore each rank opens its matching file. Same shape that #146 uses internally for mesh-var collectives via PETSc, just expressed as per-rank h5py files rather than a single parallel-HDF5 file (avoids the h5py-mpi build dependency). Contract: same-rank-count restart only. Rank-count mismatch on read raises clearly with a pointer to mesh.write_timestep for the flexible-restart path. Each sidecar carries its writer's (mpi_rank, mpi_size_at_write) attrs so a wrong-rank-file load also fails cleanly. Wrapper layout addition: - /swarms/@filled_by = "phase3b+phase6" - /swarms/@mpi_size_at_write - /swarms/{name}/@sidecar_pattern (template with {rank}/{size}) - /swarms/{name}/@num_particles_global (gathered across ranks via MPI.SUM at write time) Phase 6 implementation deliberately keeps the mesh-var collective path #146 already provides — no changes to mesh-side bulk write/read. Only the swarm-sidecar layer is rebuilt for per-rank operation. Tests: - 23 single-rank tests in test_0010 (unchanged count; updated the two that asserted the old single-file naming). - New ptest_0010_snapshot_disk.py exercises -np 1/3/4: wrapper + per-rank sidecars present, particle count preserved, swarm round- trip exact (gather + sort by per-particle gid), tracker state restored, T mesh-var DOFs preserved (via partition-invariant min/max scalars — gathered DOF tables include partition-boundary duplicates that resist direct comparison). - mpi_runner.sh registers the new ptest at -np 1 / 3 / 4. Final tally: 77 single-rank tests green; parallel ptest_0007 (in-memory) and ptest_0010 (on-disk) both PASS at np 1/3/4. Production verdict on the disk path: matches the in-memory path — correct serial, parallel, and through real solvers. The full v1.1 plan from project_snapshot_v1_1_disk_format.md is now landed: phases 1, 2, 3a, 3b, 4 (read_timestep dispatch), 5 (unified save_state/load_state API), 6 (parallel sidecars). Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Copilot AI review requested due to automatic review settings May 20, 2026 11:12

Copilot started reviewing on behalf of lmoresi May 20, 2026 11:13 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

lmoresi added 7 commits May 20, 2026 21:38

lmoresi force-pushed the feature/snapshot-disk branch from dbbf52a to eba7500 Compare May 20, 2026 11:38

lmoresi merged commit e09b8af into development May 20, 2026
1 check passed

lmoresi mentioned this pull request May 20, 2026

docs: snapshot toolkit — CHANGES entry + current API + toctree #199

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-disk snapshot toolkit v1.1 (stacked on #195, #196)#198

On-disk snapshot toolkit v1.1 (stacked on #195, #196)#198
lmoresi merged 7 commits into
developmentfrom
feature/snapshot-disk

lmoresi commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		f"current {DISK_SNAPSHOT_SCHEMA_VERSION}; on-disk schema "
		f"migration will land with phase 6 (not yet implemented)"

Conversation

lmoresi commented May 20, 2026

Summary

⚠️ Stacked

What's here (six phases)

User-facing API

File layout

Tests

Design decisions captured

Test plan

After #195 and #196 merge

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants