Plan F: ZFS path for enroot load docker://#3
Merged
sodre merged 5 commits intozenroot/mainfrom Apr 29, 2026
Merged
Conversation
The original plan assumed _prepare_layers leaves layer tarballs in the cwd and designed a per-layer ZFS clone chain with manual whiteout handling. Reality: _prepare_layers untars each layer into a numbered directory and runs enroot-aufs2ovlfs to convert whiteouts in-place. After it returns, directories 0/, 1/, ..., N/ are extracted, whiteout-converted layer trees ready for an overlayfs lowerdir mount. The revised plan reuses the dir backend's existing single-pass overlay+tar-pipe mechanism unchanged, only redirecting the destination from a directory to a ZFS clone's mountpoint that we create outside the user namespace (per Plan E's userns-zfs lesson). No new C helpers, no per-layer chain, no manual whiteout merge — the kernel's overlay support is still the merge engine. zfs::ensure_template_from_target is added as the atomic-template lifecycle helper, separating the create-tmp/promote/snapshot/lock machinery from the content-fill mechanism (which docker::load provides via the existing nsenter+overlay+tar-pipe). Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Lifts the ENROOT_NATIVE_OVERLAYFS=y precondition when ZFS backend is active; the ZFS path doesn't depend on kernel overlayfs after the merge step (overlayfs is still used INSIDE the merge, but only to materialize the merged tree; the result is captured into a ZFS clone via tar-pipe). ZFS path: - Early existence check on the user's target name so we don't download layers we'll discard. - Cache template by image config sha256 (the docker config blob's digest, available from _prepare_layers' first output line). - Atomic .tmp dataset lock pattern same as zfs::ensure_template. - Reuse the existing enroot-nsenter --user --remap-root + overlay + tar-pipe pipeline; only the destination changes from the user's rootfs directory to the template clone's mountpoint. - Promote on success: rename, snapshot @pristine, readonly=on. - Clone for user via zfs::clone_container. Dir backend behavior is byte-for-byte preserved. Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Per review feedback: keep ZFS code together so the delta to
existing files stays minimal. The previous commit had ~50 lines of
inline ZFS lifecycle (template caching, .tmp lock, overlay merge,
promotion, race-wait) directly in docker::load; this refactor moves
them into two src/storage_zfs.sh helpers and shrinks the docker.sh
delta to a precondition tweak + an early existence check + a single
dispatch call.
New helpers in src/storage_zfs.sh:
- zfs::container_check NAME
Errors (or destroys with --force) if the container name already
exists in the ZFS store. Used as an early-exit gate so we don't
download Docker layers we'd just throw away.
- zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME
Materializes the merged Docker rootfs into a ZFS template
(cached by CACHE_KEY) and clones it as the user's container.
Atomic across concurrent loads via the same .tmp lock pattern
as zfs::ensure_template. Designed to be called from docker::load
AFTER docker::_prepare_layers populates the cwd with directories
0/, 1/, ..., N/. ENOSPC mid-merge destroys the .tmp so retries
can run.
docker::load delta is now:
- one-line precondition tweak (relax NATIVE_OVERLAYFS=y on ZFS)
- one-line existence-check dispatch
- two-line merge-step dispatch
No behavior change; verified end-to-end (load, cache reuse,
existence error, force overwrite, dir-backend regression).
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
ENROOT_STORAGE_BACKEND=zfs,enroot load docker://...no longer requiresENROOT_NATIVE_OVERLAYFS=y. The merged image is materialized into a ZFS template dataset (cached by image config digest) and the user's container is azfs cloneof it. Defaultdirbackend behavior is byte-for-byte preserved.Implements Plan F. Builds on Plan A (#1).
What changed
src/storage_zfs.sh— added two helpers that own the ZFS-specific lifecycle sodocker.shonly dispatches:zfs::container_check NAME— early-exit gate that errors (or destroys with--force) if a container of that name already exists. Used sodocker::loadcan fail fast before downloading layers it would discard.zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME— full template lifecycle: cache lookup, atomic.tmplock,enroot-nsenter --user --remap-root+mount -t overlay lowerdir=0:1:…:N+ tar-pipe into the template's mountpoint, promote (rename+snapshot @pristine+readonly=on), then clone for the user. Race-safe across concurrent loads of the same image (losers wait for@pristine); ENOSPC mid-merge destroys the.tmpso retries can run.src/docker.sh(docker::load) — three small dispatches:ENROOT_NATIVE_OVERLAYFS=yis required only on thedirbackend.zfs::container_checkon ZFS, the existingrealpath/rmallblock ondir.zfs::docker_install_from_layerson ZFS, the existingenroot-nsenter+tar | tarondir.doc/zfs.md— status note flipped to "Plans A, E, F implemented"; the "Where the ZFS backend is used" table row forenroot load docker://rewritten to match the actual implementation (overlay merge reused; only the tar-pipe destination changes).CLAUDE.md— active-design-proposals line updated.doc/plans/2026-04-29-zfs-f-docker-load.md— rewritten to match the helper-based architecture.Scope-reduction note (read this before merging)
This PR is not a like-for-like port of Docker's own
zfsstorage driver. The original Plan F design followed Docker's pattern: each image layer becomes its own ZFS dataset cloned from its parent's snapshot, with whiteouts applied per-layer in shell. We deliberately scoped that out in favor of a single-pass merge into one template per image. That's a real trade — the simpler design loses some properties worth flagging:dedup=onrecovers it but at ~5–6 GB RAM per TB indexed). The per-layer-chain design dedups for free.alpine:3.21replacesalpine:3.20, the per-layer design re-merges only the changed top layers; this design re-merges the full stack.zfs destroya single suspect layer; this design throws out the whole template.zfs list -t all, layer-levelzfs send).What we got in return: no in-shell whiteout/opaque-dir handling, no per-layer locks, no
zfs promoteflattening dance, ~5–15× fewer dataset objects per image, and ~90% of the work is reused from the existing dir-backend pipeline.This is a sensible default for sites with a small image set or where ZFS block-level dedup is acceptable. For HPC sites pulling many images that share common bases, a future plan ("Plan G") could add the per-layer chain as an opt-in alongside this code path (e.g.
ENROOT_ZFS_LAYER_CHAIN=y), not as a replacement. Tracked as a known limitation.Test Plan
Verified manually against a loopback ZFS pool on Linux 6.12.75 (aarch64), zfs-2.4.1:
enroot load -n alpine_loaded docker://alpinesucceeds withENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=no; rootfs readable;enroot startprints alpine os-release;enroot removereaps clone + template (refcount).enroot load -n debian_loaded docker://debian:stable-slim(multi-layer) succeeds;dpkgpresent in rootfs; zero.wh.*markers leaked through;enroot startprints Debian os-release.enroot loadof the same image takes ~0.7s and logs "Reusing cached template "; templates dataset count unchanged.--forceerrorsContainer already exists: ...; withENROOT_FORCE_OVERRIDE=yit overwrites.ENROOT_STORAGE_BACKEND=dir ENROOT_NATIVE_OVERLAYFS=noerrors with the new combined precondition message; withENROOT_NATIVE_OVERLAYFS=ythe dir backend works as before.ENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=nosucceeds.Known limitations
enroot-nsenter --user --remap-rootand the kernel's overlay (orfuse-overlayfs) — we just redirect the result..tmplock as Plan A'sensure_template.