On-disk snapshot toolkit v1.1 (stacked on #195, #196)#198
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a persistent, on-disk backend for the snapshot toolkit via Model.save_state(file=...) / Model.load_state(path), layering an inspectable HDF5 “wrapper + bulk sidecars” format on top of PETSc DMPlex checkpoint primitives. This complements the in-memory snapshot token path by enabling durable restarts and selective reads while keeping snapshot/restore semantics (including solver-internal state via dataclass snapshots).
Changes:
- Implement v1.1 disk snapshot format (
.snap.h5wrapper +.snap.bulk/companion dir) including mesh/meshvar bulk, per-rank swarm sidecars, and/python_statedataclass serialization. - Unify APIs via
Model.save_state(...)/Model.load_state(...), and makeMeshVariable.read_timestep(...)dispatch format-aware (legacy vs v1.1 wrapper). - Add extensive serial + MPI test coverage plus user-facing docs and demo scripts.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_0010_snapshot_disk_format.py | Validates disk snapshot wrapper/bulk layout, inspectability, roundtrip, sidecars, and read_timestep dispatch. |
| tests/test_0009_model_tracker.py | Tests Model.tracker snapshot-managed semantics and git-stash behavior. |
| tests/test_0008_snapshot_realsolver.py | Real-solver confidence tests for snapshot restore/continuation guarantees. |
| tests/test_0007_snapshot_inmemory.py | In-memory snapshot suite expanded/maintained for meshes, swarms, DDt state, and continuation. |
| tests/run_snapshot_backstepping_demo.py | Time-series demo script illustrating adaptive-Δt back-stepping using snapshots. |
| tests/run_snapshot_backstepping_spatial.py | Spatial visualization demo companion for snapshot back-stepping. |
| tests/parallel/ptest_0010_snapshot_disk.py | MPI test for disk snapshots (wrapper + per-rank sidecars + exact reconstruction). |
| tests/parallel/ptest_0007_snapshot_inmemory.py | MPI test for in-memory snapshots (exact reconstruction + continuation). |
| tests/parallel/mpi_runner.sh | Adds snapshot ptests to the MPI runner script. |
| src/underworld3/systems/ddt.py | Adds Snapshottable state dataclasses + .state adapters and model registration for DDt flavors. |
| src/underworld3/swarm.py | Adds swarm population generation counter and snapshot payload/apply support. |
| src/underworld3/model.py | Adds _state_bearers, Model.tracker, and unified save_state/load_state API. |
| src/underworld3/discretisation/discretisation_mesh.py | Adds mesh snapshot payload/apply support for in-memory restore. |
| src/underworld3/discretisation/discretisation_mesh_variables.py | Adds v1.1 wrapper detection/bridge in read_timestep. |
| src/underworld3/checkpoint/tracker.py | Implements ModelTracker + TrackerState Snapshottable dataclass. |
| src/underworld3/checkpoint/state.py | Defines the SnapshottableState base and Snapshottable protocol. |
| src/underworld3/checkpoint/snapshot.py | In-memory snapshot orchestration (token capture/restore). |
| src/underworld3/checkpoint/disk_snapshot.py | Disk snapshot writer/reader, inspectability layer, sidecars, and python-state serialization. |
| src/underworld3/checkpoint/backend.py | Defines the snapshot backend protocol and in-memory backend implementation. |
| src/underworld3/checkpoint/init.py | Exposes snapshot toolkit public API surface. |
| src/underworld3/init.py | Imports underworld3.checkpoint at package import time. |
| docs/developer/guides/state-as-dataclass.md | Documents the state-as-dataclass contract for solver-internal state. |
| docs/developer/design/in_memory_checkpoint_design.md | Design note covering snapshot/restore motivation, semantics, and roadmap. |
| docs/advanced/snapshot-restore.md | User guide for save/load state (in-memory + on-disk) and tracker usage. |
| docs/advanced/index.md | Adds snapshot/restore to advanced docs index/toctree. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+31
to
+34
| DOFs, plus swarm positions and user swarm-variable data with | ||
| rebuild-on-restore semantics. Solver-internal Python state, on-disk | ||
| backend, schema versioning, mesh-DM rebuild, and cross-process restore | ||
| are scheduled for follow-up PRs per the design note. |
Comment on lines
+1
to
+17
| """Unitary in-memory (and, later, on-disk) snapshot toolkit. | ||
|
|
||
| The first true unitary checkpoint in Underworld3 — captures enough state | ||
| that a Model can be put back exactly as it was, suitable for backtrack on | ||
| failure, multi-stage time integration, adaptive-Δt retry, and crash | ||
| recovery. | ||
|
|
||
| Distinct from the existing per-variable ``write_timestep`` / | ||
| ``read_timestep`` path, which serves visualisation and partial restart. | ||
| That path stays in service of its existing role. | ||
|
|
||
| See ``docs/developer/design/in_memory_checkpoint_design.md`` for the | ||
| design rationale, scope, and roadmap. In v1 (this code), only an | ||
| in-memory backend is implemented and only mesh + mesh-variable state is | ||
| captured. Subsequent PRs add swarm coverage, solver-internal Python | ||
| state (DDt history, parameter mutation history), an on-disk full-state | ||
| backend, and schema versioning across UW3 releases. |
Comment on lines
+20
to
+27
| ├── /metadata (attrs: uw3_version, schema_version, | ||
| │ created_at, step, sim_time, dt, dim, | ||
| │ mesh_type, coordinate_system, | ||
| │ mpi_ranks_at_write, variables_summary, ...) | ||
| ├── /mesh (phase 2 — DMPlex topology + coords + labels) | ||
| ├── /variables (phase 2 — one subgroup per mesh-variable) | ||
| ├── /swarms (phase 3 — possibly @external_file refs) | ||
| └── /python_state (phase 3 — Snapshottable dataclasses as attrs) |
Comment on lines
+226
to
+227
| f"current {DISK_SNAPSHOT_SCHEMA_VERSION}; on-disk schema " | ||
| f"migration will land with phase 6 (not yet implemented)" |
Comment on lines
1222
to
+1237
| output_base_name = os.path.join(outputPath, data_filename) | ||
| data_file = output_base_name + f".mesh.{data_name}.{index:05}.h5" | ||
| legacy_file = output_base_name + f".mesh.{data_name}.{index:05}.h5" | ||
|
|
||
| if not os.path.isfile(os.path.abspath(data_file)): | ||
| raise RuntimeError(f"{os.path.abspath(data_file)} does not exist") | ||
| is_v1_1 = ( | ||
| os.path.isfile(data_filename) | ||
| and not data_filename.endswith( | ||
| f".mesh.{data_name}.{index:05}.h5" | ||
| ) | ||
| and _is_snapshot_wrapper(data_filename) | ||
| ) | ||
|
|
||
| import h5py | ||
| import numpy as np | ||
| if is_v1_1: | ||
| data_file = data_filename | ||
| else: | ||
| data_file = legacy_file | ||
| if not os.path.isfile(os.path.abspath(data_file)): |
| # and restore would silently no-op. `state` is therefore a | ||
| # reserved name and cannot be a user-managed quantity. | ||
| cls_attr = getattr(type(self), name, None) | ||
| if hasattr(cls_attr, "__set__") or hasattr(cls_attr, "__get__"): |
Comment on lines
+228
to
+234
| ```text | ||
| my_run.snap.h5 (~tens of KB; metadata, group structure) | ||
| my_run.snap.bulk/ (per-mesh + per-swarm sidecars) | ||
| {mesh}.mesh.00000.h5 | ||
| {mesh}.{var}.00000.h5 (one per mesh-variable) | ||
| {swarm}.swarm.h5 (one per swarm) | ||
| ``` |
Comment on lines
+275
to
+284
| """Write a complete on-disk snapshot of the model's mesh + mesh-variable | ||
| state (phase 2 scope; swarms and python_state land in phase 3). | ||
|
|
||
| Produces two artifacts: | ||
|
|
||
| - ``path`` — the wrapper HDF5 file with rich metadata and the group | ||
| structure inspectable via ``h5ls``. | ||
| - ``_bulk_dir_for(path)`` — companion directory containing the | ||
| PETSc HDF5 files (mesh DM + per-variable section/vec) produced | ||
| by #146's :meth:`Mesh.write_checkpoint`. |
First slice of the on-disk snapshot format (v1.1). Establishes the file structure and the inspectability bar; no PETSc bulk yet (that is phase 2). Stacked on the in-memory snapshot toolkit (#195) and the model tracker (#196) so it can serialise both later. What lands: - src/underworld3/checkpoint/disk_snapshot.py - DISK_SNAPSHOT_SCHEMA_VERSION = 1 - write_snapshot_skeleton(model, path): writes /metadata attrs + empty stub groups /mesh /variables /swarms /python_state (the structure phases 2+ will fill in). - read_snapshot_metadata(path): reads /metadata back as a plain dict, decodes JSON-encoded list fields for convenience, validates schema version. - inspect_snapshot(path): human-readable summary suitable for print(...) at a notebook prompt. - src/underworld3/checkpoint/__init__.py: exports. - tests/test_0010_snapshot_disk_format.py (7, tier_a level_1): - top-level group structure matches the spec - h5py-readable /metadata attrs cover identity, schema, tracker conventions, geometry, MPI rank count, and inventories of meshes / swarms / state-bearer classes / variables — the proxy for "an external user running h5ls/h5dump sees useful info" - read/write roundtrip - rejection of non-snapshot files and wrong-schema files with clear errors (not obscure h5py noise) - inspect_snapshot includes the key facts - skeleton groups carry `filled_by` attrs so phases 2/3 readers and external inspectors can tell whether content is populated yet. Design notes encoded: - UW3-controlled rich-metadata wrapper around PETSc bulk; pure PETSc HDF5 dumps fail the inspectability bar so are rejected as the format. - List-typed metadata stored as JSON strings in scalar attrs so h5py / h5ls handle them cleanly; read API exposes them as plain Python lists alongside the *_json originals. - Swarm storage left as a phase-3 decision: the metadata wrapper is designed to support `@external_file` on /swarms/swarm_X/ when individual swarms grow too bulky for a single file. No commitment to inline vs split until phase 3 has real swarm sizes in hand. Stacked on feature/model-tracker; PRs to development after #195 and #196 land. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
…t roundtrip Builds on phase 1's metadata wrapper to actually carry mesh + mesh- variable state to disk and read it back. Delegates the heavy lifting to #146's `Mesh.write_checkpoint` / `MeshVariable.read_checkpoint` PETSc-DMPlex primitives — phase 2's job is layout, dispatch, and tying the wrapper to the bulk data via a simple convention. Layout (final v1.1 shape): /path/to/run.snap.h5 wrapper (h5py-inspectable) /path/to/run.snap.bulk/ companion directory (one per snap) {mesh_safe}.mesh.00000.h5 {mesh_safe}.{var_clean}.00000.h5 Wrapper carries /meshes/{mesh_safe}/ with @name, @mesh_file, and /meshes/{mesh_safe}/variables/{var_safe}/ with @name, @components, @degree, @continuous, @external_file. The bulk-dir path is derived from the wrapper path by convention (`.h5` → `.bulk`), so no external_file attr is needed for the standard placement. Move them together; a clear FileNotFoundError fires if bulk is missing on read. Phase 1 layout refactor folded in: - /mesh (singular) → /meshes (plural) — supports multi-mesh natively. - /variables removed from the top level — now nests under each mesh as /meshes/{name}/variables/{var}, matching the in-memory snapshot's mesh→vars structure. New API: - `write_snapshot(model, path)` — writes wrapper + bulk; covers every registered mesh and every allocated meshvar on each mesh. Lazy-allocated vars (_gvec is None) are skipped — same rule as the in-memory path. - `read_snapshot(model, path)` — loads var DOFs back into already- registered meshes by name. Mesh / variable mismatch raises a clear ValueError (mesh-rebuild on read is v1.2 scope). - `write_snapshot_skeleton` / `read_snapshot_metadata` / `inspect_snapshot` stay as phase-1 metadata-only entry points. Branch hygiene: merged origin/development (which now has #146) into this branch so the new code can actually call read_checkpoint. The merge was clean — #146 and the snapshot toolkit only overlap at different methods in `discretisation_mesh.py`, as the earlier analysis predicted. PR target will be development once #195/#196 land; the diff stays clean because the merged dev commits are already there. Tests (12 total, 5 new in phase 2, tier_a level_1): - write produces wrapper + bulk-dir with the expected file pattern - wrapper populated with the per-mesh + per-var metadata that makes inspectability self-sufficient - bit-exact write→scribble→read roundtrip on a 2D mesh with one scalar + one vector variable (np.array_equal, zero tolerance) - missing bulk-dir → clear FileNotFoundError - mismatched mesh on read → clear ValueError (not an obscure h5py trace) Regression: 64 tests pass (24 snapshot + 9 tracker + 12 disk-format + 19 core/regression). Phase 3 next: swarms (with the @external_file freedom kept open for bulky swarms) + /python_state for DDt + ModelTracker via dataclass- to-HDF5-attrs serialisation. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
Serialises every registered Snapshottable's .state dataclass into a
per-bearer group under /python_state, keyed by the same stable name
the in-memory snapshot uses (f"{type(obj).__name__}_{obj.instance_number}").
ModelTracker (always auto-registered) and DDt state therefore now
travel with the disk snapshot in addition to the mesh + meshvar
bulk from phase 2.
Generic field serialisation (no per-class code):
- None -> attr "__none__" sentinel
- bool/int/float/str-> scalar attr (preserves type via h5py)
- numpy.ndarray -> dataset
- list/tuple -> attr <name>__json (JSON, handles None)
- dict -> subgroup, recursive (used by TrackerState.managed)
- unhandleable -> attr <name>__skipped = "<type info>"
— restore keeps the *current* live value rather than clobbering
it with a placeholder, so a documented partial round-trip (e.g.
DDtSymbolicState.psi_star which is sympy and would need
srepr+sympify) doesn't break.
Restore uses the live obj.state as a type template + dataclasses.
replace(...): captured fields override; skipped fields keep their
current value. ValueError on state-bearer-not-registered keeps the
same-rank/same-model contract.
Tests (4 new, 16 total tier_a level_1):
- tracker time/step/dt + user-added quantity (scalar + numpy array)
round-trip exactly through disk
- /python_state group is h5py-inspectable: __bearer_class__,
__state_class__, instance_number; TrackerState.managed visible as
a subgroup with each managed key as an attr (so h5ls shows
'time', 'step', 'dt', 'my_q' directly)
- Symbolic DDt's primary BDF-control fields (dt_history,
history_initialised, n_solves_completed, dt) round-trip; psi_star
(sympy) is documented as skipped — restore keeps current value
- mismatched state-bearer set on read raises clearly
Phase 3b next: swarms in a per-swarm sidecar from day one
(per Louis's "break out swarms" direction — bulk is always a swarm
problem, so don't even try inline).
Regression: 68 tests pass (24 in-memory + 9 tracker + 16 disk-format
+ 19 core/regression).
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Per Louis's direction ("break out the swarm information into a
separate file in the first instance — bulk is a problem with swarms,
always"), swarms always go to their own h5py-direct sidecar from day
one. No inline-vs-split toggle — sidecar is the only path.
Layout:
/path/to/run.snap.h5 wrapper
/path/to/run.snap.bulk/{swarm_safe}.swarm.h5 swarm sidecar (one
per swarm)
Sidecar structure (h5py-native, no PETSc — swarms aren't DMPlex
section/vec):
@num_particles_local, @dim, @mesh_name, @population_generation
/coordinates dataset, (n_local, dim)
/variables/{var_clean_name} dataset, (n_local, num_components)
@num_components, @dtype
The sidecar's top-level @attrs and group structure mean `h5ls -v`
on the sidecar alone tells you "this holds N particles in dim D on
mesh M with these variables" — same inspectability bar as the
wrapper.
Wrapper /swarms/{swarm_safe}/ carries metadata + the @external_file
pointer to the sidecar in the bulk dir.
Restore mirrors the in-memory Swarm.apply_snapshot_payload exactly:
clear local population via dm.removePoint loop, addNPoints at saved
coords, write var data back. Same rebuild-on-restore semantics — the
disk snapshot recovers from a particle-population mutation (added
particles between snapshot and restore) just like the in-memory path
does, proven by test_swarm_restore_recovers_after_particle_count_change.
Tests (5 new, 21 total tier_a level_1):
- swarm sidecar lands in bulk dir with predictable name; wrapper
records external_file ref + mesh_name + var inventory
- sidecar is self-inspectable via h5py (file-level attrs +
/coordinates + /variables with per-var attrs)
- whole swarm (coords + svar data) round-trips bit-exact through
write → scribble → read
- rebuild-on-restore parity with in-memory path: snapshot, mutate
population, restore → exact local population recovered
- PETSc-internal DMSwarm_* variables filtered at capture (same rule
as in-memory)
MPI: single-rank only in this phase. The current rank-0-only sidecar
write only captures rank 0's local particles in a parallel run.
Phase 6 will either use h5py-mpi parallel HDF5 or per-rank sidecars
to match #195's parallel exact-reconstruction guarantee.
73 tests pass (24 in-memory + 9 tracker + 21 disk-format + 19
core/regression).
Phase 4 next: format detection + dispatch in MeshVariable.read_timestep
so it reads BOTH the legacy per-variable layout AND the new v1.1
sidecar format via the KDTree bridge. Closes the compatibility
commitment from the design discussion.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Single user-facing entry point for all snapshot use cases. Same
methods serve in-memory ephemeral stash and on-disk persistent
snapshot — the dispatch is mechanical, the user has one API to
learn:
token = model.save_state() # in-memory, returns Snapshot
model.load_state(token) # restore from token
model.save_state(file="step42.snap.h5") # on-disk, returns path
model.load_state("step42.snap.h5") # restore from disk
# (also: load_state(file=…))
load_state dispatches on argument type — Snapshot → in-memory
restore; str/PathLike → disk restore. Type-mismatched source raises
TypeError with a clear message.
Renames replace the prior Model.snapshot() / Model.restore() pair
from #195. Pre-merge, no public users to migrate; getting the
user-facing API right now means there is never a disparate version
shipped. uw.checkpoint.{snapshot,restore,write_snapshot,read_snapshot,
read_snapshot_metadata,inspect_snapshot,write_snapshot_skeleton}
stay as power-user / lower-level entry points that save_state /
load_state delegate to.
Files updated (mechanical renames, except the doc rewrite):
- src/underworld3/model.py: save_state / load_state methods replace
snapshot / restore; load_state accepts positional Snapshot or
str/os.PathLike, with TypeError on anything else.
- tests/test_0007_snapshot_inmemory.py — 23 callers renamed; obsolete
test_snapshot_path_is_v1_1_scope deleted (v1.1 has landed).
- tests/test_0008_snapshot_realsolver.py — 3 tests renamed.
- tests/test_0009_model_tracker.py — 9 tests renamed.
- tests/test_0010_snapshot_disk_format.py — 21 tests: replace
uw.checkpoint.write_snapshot / read_snapshot with model.save_state
/ model.load_state at user-style call sites; keep
write_snapshot_skeleton + read_snapshot_metadata where the test is
specifically exercising the lower-level entry points.
- tests/parallel/ptest_0007_snapshot_inmemory.py — np-1/3/4 ptest.
- tests/run_snapshot_backstepping_{demo,spatial}.py — demo scripts.
- docs/advanced/snapshot-restore.md — rewritten API section to show
both modes; added "On-disk file layout" section and a "Choosing
between paths" comparison table covering write_timestep,
write_checkpoint, and save_state. Limitations section updated to
reflect that on-disk is now real (was "in-memory only").
Regression: 75 single-rank tests pass (was 76 — minus the deleted
obsolete v1.1-scope test); MPI ptest at -np 4 still PASS with the
parallel exact-reconstruction guarantee. Docs build clean with no
snapshot-related warnings; the new layout + choosing-between-paths
sections render.
Phase 4 (read_timestep format-aware dispatch for backward compat)
becomes a nice-to-have at this point — save_state / load_state is
the recommended surface, write_timestep / read_timestep keep their
existing role unchanged. Phase 6 (parallel HDF5 / per-rank sidecars
for on-disk MPI) is the remaining correctness item.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
The selective-read entry point users already know (``var.read_timestep(...)``) now reads BOTH the legacy ``write_timestep`` per-variable HDF5 files AND v1.1 snapshot wrappers — same call, format detection is hidden inside the function. No user code has to learn a second API for the new format; existing scripts with ``var.read_timestep(...)`` calls keep working transparently against new files. This is the compat commitment from the design discussion: "the clean interface lies beneath the surface for this case" — meaning the format dispatch is hidden, not that read_timestep itself is hidden. read_timestep serves a different use case than save_state/load_state (selective per-variable, cross-resolution remap via KDTree, visualisation-style reads); both stay user-facing. Implementation: - ``uw.checkpoint.is_snapshot_wrapper(path)``: cheap format detector — checks for top-level /metadata + /meshes groups. - ``uw.checkpoint.extract_var_via_bridge(wrapper_path, var_name)``: given a v1.1 wrapper + variable name, returns (coords, values) numpy arrays — exactly what the legacy file's h5 read produces. Mechanism: load source mesh from .mesh.h5 sidecar, rebuild source variable with matching degree/components, load DOFs via #146's MeshVariable.read_checkpoint, read out var.coords and var.array. - MeshVariable.read_timestep: before its rank-0 (coord, value) read, dispatches on the file's format. v1.1 → bridge. Legacy → existing per-variable h5 read. Everything after — the source- swarm + query-swarm KDTree-routing machinery — is reused unchanged. Tests (2 new, 23 total in test_0010, 77 across the snapshot suite): - read_timestep against a v1.1 snapshot wrapper round-trips a variable bit-exact (KDTree query lands on captured DOF coords) - read_timestep against a legacy write_timestep file still uses the legacy code path (belt-and-braces no-regression check) Phase 6 (parallel on-disk MPI) remains as the production-readiness gate for the disk path. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
…n-disk
Closes the last production-readiness gate on the disk path. Swarm
sidecars are now per-rank files: each rank writes its own
{swarm_safe}.swarm.rank{R:04d}of{S:04d}.h5, the wrapper records the
naming pattern + rank count, and on restore each rank opens its
matching file. Same shape that #146 uses internally for mesh-var
collectives via PETSc, just expressed as per-rank h5py files rather
than a single parallel-HDF5 file (avoids the h5py-mpi build
dependency).
Contract: same-rank-count restart only. Rank-count mismatch on read
raises clearly with a pointer to mesh.write_timestep for the
flexible-restart path. Each sidecar carries its writer's
(mpi_rank, mpi_size_at_write) attrs so a wrong-rank-file load
also fails cleanly.
Wrapper layout addition:
- /swarms/@filled_by = "phase3b+phase6"
- /swarms/@mpi_size_at_write
- /swarms/{name}/@sidecar_pattern (template with {rank}/{size})
- /swarms/{name}/@num_particles_global (gathered across ranks via
MPI.SUM at write time)
Phase 6 implementation deliberately keeps the mesh-var collective
path #146 already provides — no changes to mesh-side bulk write/read.
Only the swarm-sidecar layer is rebuilt for per-rank operation.
Tests:
- 23 single-rank tests in test_0010 (unchanged count; updated the
two that asserted the old single-file naming).
- New ptest_0010_snapshot_disk.py exercises -np 1/3/4: wrapper +
per-rank sidecars present, particle count preserved, swarm round-
trip exact (gather + sort by per-particle gid), tracker state
restored, T mesh-var DOFs preserved (via partition-invariant
min/max scalars — gathered DOF tables include partition-boundary
duplicates that resist direct comparison).
- mpi_runner.sh registers the new ptest at -np 1 / 3 / 4.
Final tally: 77 single-rank tests green; parallel ptest_0007
(in-memory) and ptest_0010 (on-disk) both PASS at np 1/3/4.
Production verdict on the disk path: matches the in-memory path —
correct serial, parallel, and through real solvers. The full v1.1
plan from project_snapshot_v1_1_disk_format.md is now landed:
phases 1, 2, 3a, 3b, 4 (read_timestep dispatch), 5 (unified
save_state/load_state API), 6 (parallel sidecars).
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
dbbf52a to
eba7500
Compare
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the on-disk arm of the snapshot toolkit —
Model.save_state(file=…)/Model.load_state(file=…). Pairs the in-memory work in #195 (the "git stash for timesteps") with a persistent format inspectable via standard h5 tools, parallel-correct, and built on top of #146's PETSc DMPlex primitives.Based on
feature/model-tracker(#196) →feature/in-memory-checkpoint(#195) →development. The PR will show a large diff until those land — once they merge, the diff narrows to just the snapshot-disk additions automatically. Targetingdevelopmentdirectly because that's the actual destination; readers focus on the new files undersrc/underworld3/checkpoint/disk_snapshot.pyandtests/test_0010*.What's here (six phases)
72681d1e4a43e0/python_state— Snapshottable dataclass round-trip8e3d04a83061beModel.save_state/load_stateAPIdf4f829MeshVariable.read_timestepformat-aware dispatch3bb201ddbbf52aUser-facing API
The whole disk path is exposed through the same two methods #196 introduced for in-memory:
File layout
h5ls -v my_run.snap.h5/metadatashows run name, schema version, sim time, step, dim, MPI rank count, and inventories — no UW3 needed.Tests
ptest_0007) and on-disk (ptest_0010). Exact reconstruction confirmed across cross-rank particle distribution; recovers from real cross-rank particle loss.test_0008_snapshot_realsolver) shows bit-exact discard guarantee through an AdvDiffusion solve.Design decisions captured
h5ls-without-UW3, so we wrap PETSc bulk in a UW3-controlled metadata layer.read_timestepstays user-facing and selective — different use case (variable subsets, cross-resolution remap) fromload_state's whole-model role. The format detection is hidden behind the call.write_timestepstays as the selective-output path; a futurevars=[…]filter onsave_statewould close the gap if needed.Test plan
pixi run -e amr-dev pytest tests/test_0007_snapshot_inmemory.py tests/test_0008_snapshot_realsolver.py tests/test_0009_model_tracker.py tests/test_0010_snapshot_disk_format.py(77 tests)cd tests/parallel && bash mpi_runner.sh(covers ptest_0007 + ptest_0010 at 1/3/4 ranks)pixi run -e amr-dev docs-build—advanced/snapshot-restore.htmlrenders with all sectionsAfter #195 and #196 merge
Retarget this PR's base if needed (it will be already if dev contains those merges) — the diff narrows automatically.
Underworld development team with AI support from Claude Code