Plan F: ZFS path for enroot load docker:// by sodre · Pull Request #3 · zeroae/enroot

sodre · 2026-04-29T13:51:41Z

Summary

When ENROOT_STORAGE_BACKEND=zfs, enroot load docker://... no longer requires ENROOT_NATIVE_OVERLAYFS=y. The merged image is materialized into a ZFS template dataset (cached by image config digest) and the user's container is a zfs clone of it. Default dir backend behavior is byte-for-byte preserved.

Implements Plan F. Builds on Plan A (#1).

What changed

src/storage_zfs.sh — added two helpers that own the ZFS-specific lifecycle so docker.sh only dispatches:
- zfs::container_check NAME — early-exit gate that errors (or destroys with --force) if a container of that name already exists. Used so docker::load can fail fast before downloading layers it would discard.
- zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME — full template lifecycle: cache lookup, atomic .tmp lock, enroot-nsenter --user --remap-root + mount -t overlay lowerdir=0:1:…:N + tar-pipe into the template's mountpoint, promote (rename + snapshot @pristine + readonly=on), then clone for the user. Race-safe across concurrent loads of the same image (losers wait for @pristine); ENOSPC mid-merge destroys the .tmp so retries can run.
src/docker.sh (docker::load) — three small dispatches:
- One-line precondition tweak: ENROOT_NATIVE_OVERLAYFS=y is required only on the dir backend.
- One-line existence-check dispatch: zfs::container_check on ZFS, the existing realpath/rmall block on dir.
- Two-line merge dispatch: zfs::docker_install_from_layers on ZFS, the existing enroot-nsenter + tar | tar on dir.
doc/zfs.md — status note flipped to "Plans A, E, F implemented"; the "Where the ZFS backend is used" table row for enroot load docker:// rewritten to match the actual implementation (overlay merge reused; only the tar-pipe destination changes).
CLAUDE.md — active-design-proposals line updated.
doc/plans/2026-04-29-zfs-f-docker-load.md — rewritten to match the helper-based architecture.

Scope-reduction note (read this before merging)

This PR is not a like-for-like port of Docker's own zfs storage driver. The original Plan F design followed Docker's pattern: each image layer becomes its own ZFS dataset cloned from its parent's snapshot, with whiteouts applied per-layer in shell. We deliberately scoped that out in favor of a single-pass merge into one template per image. That's a real trade — the simpler design loses some properties worth flagging:

Cross-image layer dedup at the dataset level. Two images sharing a Debian base store the base's bytes twice on this design (block-level dedup=on recovers it but at ~5–6 GB RAM per TB indexed). The per-layer-chain design dedups for free.
Incremental re-pull cost on lineage updates. When alpine:3.21 replaces alpine:3.20, the per-layer design re-merges only the changed top layers; this design re-merges the full stack.
Layer-granular cache invalidation. Per-layer chain lets you zfs destroy a single suspect layer; this design throws out the whole template.
Native ZFS introspection of the layer chain (zfs list -t all, layer-level zfs send).
Quota accounting that matches intuition when many images share lower layers.

What we got in return: no in-shell whiteout/opaque-dir handling, no per-layer locks, no zfs promote flattening dance, ~5–15× fewer dataset objects per image, and ~90% of the work is reused from the existing dir-backend pipeline.

This is a sensible default for sites with a small image set or where ZFS block-level dedup is acceptable. For HPC sites pulling many images that share common bases, a future plan ("Plan G") could add the per-layer chain as an opt-in alongside this code path (e.g. ENROOT_ZFS_LAYER_CHAIN=y), not as a replacement. Tracked as a known limitation.

Test Plan

Verified manually against a loopback ZFS pool on Linux 6.12.75 (aarch64), zfs-2.4.1:

enroot load -n alpine_loaded docker://alpine succeeds with ENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=no; rootfs readable; enroot start prints alpine os-release; enroot remove reaps clone + template (refcount).
enroot load -n debian_loaded docker://debian:stable-slim (multi-layer) succeeds; dpkg present in rootfs; zero .wh.* markers leaked through; enroot start prints Debian os-release.
Cache reuse: second enroot load of the same image takes ~0.7s and logs "Reusing cached template "; templates dataset count unchanged.
Existence check: second load with the same name without --force errors Container already exists: ...; with ENROOT_FORCE_OVERRIDE=y it overwrites.
Dir backend regression: ENROOT_STORAGE_BACKEND=dir ENROOT_NATIVE_OVERLAYFS=no errors with the new combined precondition message; with ENROOT_NATIVE_OVERLAYFS=y the dir backend works as before.
Precondition relaxation: ENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=no succeeds.

Known limitations

Per-layer dedup across distinct images is not done at the dataset level (see scope-reduction note).
The merge still runs through enroot-nsenter --user --remap-root and the kernel's overlay (or fuse-overlayfs) — we just redirect the result.
Concurrent loads of the same image are race-safe via the same .tmp lock as Plan A's ensure_template.

The original plan assumed _prepare_layers leaves layer tarballs in the cwd and designed a per-layer ZFS clone chain with manual whiteout handling. Reality: _prepare_layers untars each layer into a numbered directory and runs enroot-aufs2ovlfs to convert whiteouts in-place. After it returns, directories 0/, 1/, ..., N/ are extracted, whiteout-converted layer trees ready for an overlayfs lowerdir mount. The revised plan reuses the dir backend's existing single-pass overlay+tar-pipe mechanism unchanged, only redirecting the destination from a directory to a ZFS clone's mountpoint that we create outside the user namespace (per Plan E's userns-zfs lesson). No new C helpers, no per-layer chain, no manual whiteout merge — the kernel's overlay support is still the merge engine. zfs::ensure_template_from_target is added as the atomic-template lifecycle helper, separating the create-tmp/promote/snapshot/lock machinery from the content-fill mechanism (which docker::load provides via the existing nsenter+overlay+tar-pipe). Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

@pristine

Lifts the ENROOT_NATIVE_OVERLAYFS=y precondition when ZFS backend is active; the ZFS path doesn't depend on kernel overlayfs after the merge step (overlayfs is still used INSIDE the merge, but only to materialize the merged tree; the result is captured into a ZFS clone via tar-pipe). ZFS path: - Early existence check on the user's target name so we don't download layers we'll discard. - Cache template by image config sha256 (the docker config blob's digest, available from _prepare_layers' first output line). - Atomic .tmp dataset lock pattern same as zfs::ensure_template. - Reuse the existing enroot-nsenter --user --remap-root + overlay + tar-pipe pipeline; only the destination changes from the user's rootfs directory to the template clone's mountpoint. - Promote on success: rename, snapshot @pristine, readonly=on. - Clone for user via zfs::clone_container. Dir backend behavior is byte-for-byte preserved. Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Per review feedback: keep ZFS code together so the delta to existing files stays minimal. The previous commit had ~50 lines of inline ZFS lifecycle (template caching, .tmp lock, overlay merge, promotion, race-wait) directly in docker::load; this refactor moves them into two src/storage_zfs.sh helpers and shrinks the docker.sh delta to a precondition tweak + an early existence check + a single dispatch call. New helpers in src/storage_zfs.sh: - zfs::container_check NAME Errors (or destroys with --force) if the container name already exists in the ZFS store. Used as an early-exit gate so we don't download Docker layers we'd just throw away. - zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME Materializes the merged Docker rootfs into a ZFS template (cached by CACHE_KEY) and clones it as the user's container. Atomic across concurrent loads via the same .tmp lock pattern as zfs::ensure_template. Designed to be called from docker::load AFTER docker::_prepare_layers populates the cwd with directories 0/, 1/, ..., N/. ENOSPC mid-merge destroys the .tmp so retries can run. docker::load delta is now: - one-line precondition tweak (relax NATIVE_OVERLAYFS=y on ZFS) - one-line existence-check dispatch - two-line merge-step dispatch No behavior change; verified end-to-end (load, cache reuse, existence error, force overwrite, dir-backend regression). Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

sodre added 5 commits April 29, 2026 09:44

Mark Plan F (Docker load ZFS path) as implemented

1891878

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Update Plan F to match helper-based final implementation

7885bd1

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

sodre merged commit 059bfc4 into zenroot/main Apr 29, 2026

This was referenced Apr 29, 2026

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in) #4

Open

Plan B: ZFS template warm/cold lifecycle #5

Merged

ZFS backend: support pyxis (Slurm) workflows where enroot runs after privilege-drop to the job user #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan F: ZFS path for enroot load docker://#3

Plan F: ZFS path for enroot load docker://#3
sodre merged 5 commits intozenroot/mainfrom
feature/zfs-f-docker-load

sodre commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sodre commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Scope-reduction note (read this before merging)

Test Plan

Known limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sodre commented Apr 29, 2026 •

edited

Loading