Skip to content

Plan F: ZFS path for enroot load docker://#3

Merged
sodre merged 5 commits intozenroot/mainfrom
feature/zfs-f-docker-load
Apr 29, 2026
Merged

Plan F: ZFS path for enroot load docker://#3
sodre merged 5 commits intozenroot/mainfrom
feature/zfs-f-docker-load

Conversation

@sodre
Copy link
Copy Markdown
Member

@sodre sodre commented Apr 29, 2026

Summary

When ENROOT_STORAGE_BACKEND=zfs, enroot load docker://... no longer requires ENROOT_NATIVE_OVERLAYFS=y. The merged image is materialized into a ZFS template dataset (cached by image config digest) and the user's container is a zfs clone of it. Default dir backend behavior is byte-for-byte preserved.

Implements Plan F. Builds on Plan A (#1).

What changed

  • src/storage_zfs.sh — added two helpers that own the ZFS-specific lifecycle so docker.sh only dispatches:
    • zfs::container_check NAME — early-exit gate that errors (or destroys with --force) if a container of that name already exists. Used so docker::load can fail fast before downloading layers it would discard.
    • zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME — full template lifecycle: cache lookup, atomic .tmp lock, enroot-nsenter --user --remap-root + mount -t overlay lowerdir=0:1:…:N + tar-pipe into the template's mountpoint, promote (rename + snapshot @pristine + readonly=on), then clone for the user. Race-safe across concurrent loads of the same image (losers wait for @pristine); ENOSPC mid-merge destroys the .tmp so retries can run.
  • src/docker.sh (docker::load) — three small dispatches:
    • One-line precondition tweak: ENROOT_NATIVE_OVERLAYFS=y is required only on the dir backend.
    • One-line existence-check dispatch: zfs::container_check on ZFS, the existing realpath/rmall block on dir.
    • Two-line merge dispatch: zfs::docker_install_from_layers on ZFS, the existing enroot-nsenter + tar | tar on dir.
  • doc/zfs.md — status note flipped to "Plans A, E, F implemented"; the "Where the ZFS backend is used" table row for enroot load docker:// rewritten to match the actual implementation (overlay merge reused; only the tar-pipe destination changes).
  • CLAUDE.md — active-design-proposals line updated.
  • doc/plans/2026-04-29-zfs-f-docker-load.md — rewritten to match the helper-based architecture.

Scope-reduction note (read this before merging)

This PR is not a like-for-like port of Docker's own zfs storage driver. The original Plan F design followed Docker's pattern: each image layer becomes its own ZFS dataset cloned from its parent's snapshot, with whiteouts applied per-layer in shell. We deliberately scoped that out in favor of a single-pass merge into one template per image. That's a real trade — the simpler design loses some properties worth flagging:

  • Cross-image layer dedup at the dataset level. Two images sharing a Debian base store the base's bytes twice on this design (block-level dedup=on recovers it but at ~5–6 GB RAM per TB indexed). The per-layer-chain design dedups for free.
  • Incremental re-pull cost on lineage updates. When alpine:3.21 replaces alpine:3.20, the per-layer design re-merges only the changed top layers; this design re-merges the full stack.
  • Layer-granular cache invalidation. Per-layer chain lets you zfs destroy a single suspect layer; this design throws out the whole template.
  • Native ZFS introspection of the layer chain (zfs list -t all, layer-level zfs send).
  • Quota accounting that matches intuition when many images share lower layers.

What we got in return: no in-shell whiteout/opaque-dir handling, no per-layer locks, no zfs promote flattening dance, ~5–15× fewer dataset objects per image, and ~90% of the work is reused from the existing dir-backend pipeline.

This is a sensible default for sites with a small image set or where ZFS block-level dedup is acceptable. For HPC sites pulling many images that share common bases, a future plan ("Plan G") could add the per-layer chain as an opt-in alongside this code path (e.g. ENROOT_ZFS_LAYER_CHAIN=y), not as a replacement. Tracked as a known limitation.

Test Plan

Verified manually against a loopback ZFS pool on Linux 6.12.75 (aarch64), zfs-2.4.1:

  • enroot load -n alpine_loaded docker://alpine succeeds with ENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=no; rootfs readable; enroot start prints alpine os-release; enroot remove reaps clone + template (refcount).
  • enroot load -n debian_loaded docker://debian:stable-slim (multi-layer) succeeds; dpkg present in rootfs; zero .wh.* markers leaked through; enroot start prints Debian os-release.
  • Cache reuse: second enroot load of the same image takes ~0.7s and logs "Reusing cached template "; templates dataset count unchanged.
  • Existence check: second load with the same name without --force errors Container already exists: ...; with ENROOT_FORCE_OVERRIDE=y it overwrites.
  • Dir backend regression: ENROOT_STORAGE_BACKEND=dir ENROOT_NATIVE_OVERLAYFS=no errors with the new combined precondition message; with ENROOT_NATIVE_OVERLAYFS=y the dir backend works as before.
  • Precondition relaxation: ENROOT_STORAGE_BACKEND=zfs ENROOT_NATIVE_OVERLAYFS=no succeeds.

Known limitations

  • Per-layer dedup across distinct images is not done at the dataset level (see scope-reduction note).
  • The merge still runs through enroot-nsenter --user --remap-root and the kernel's overlay (or fuse-overlayfs) — we just redirect the result.
  • Concurrent loads of the same image are race-safe via the same .tmp lock as Plan A's ensure_template.

sodre added 5 commits April 29, 2026 09:44
The original plan assumed _prepare_layers leaves layer tarballs in
the cwd and designed a per-layer ZFS clone chain with manual
whiteout handling. Reality: _prepare_layers untars each layer into
a numbered directory and runs enroot-aufs2ovlfs to convert
whiteouts in-place. After it returns, directories 0/, 1/, ..., N/
are extracted, whiteout-converted layer trees ready for an
overlayfs lowerdir mount.

The revised plan reuses the dir backend's existing single-pass
overlay+tar-pipe mechanism unchanged, only redirecting the
destination from a directory to a ZFS clone's mountpoint that we
create outside the user namespace (per Plan E's userns-zfs lesson).
No new C helpers, no per-layer chain, no manual whiteout merge —
the kernel's overlay support is still the merge engine.

zfs::ensure_template_from_target is added as the atomic-template
lifecycle helper, separating the create-tmp/promote/snapshot/lock
machinery from the content-fill mechanism (which docker::load
provides via the existing nsenter+overlay+tar-pipe).

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Lifts the ENROOT_NATIVE_OVERLAYFS=y precondition when ZFS backend
is active; the ZFS path doesn't depend on kernel overlayfs after
the merge step (overlayfs is still used INSIDE the merge, but only
to materialize the merged tree; the result is captured into a ZFS
clone via tar-pipe).

ZFS path:
- Early existence check on the user's target name so we don't
  download layers we'll discard.
- Cache template by image config sha256 (the docker config blob's
  digest, available from _prepare_layers' first output line).
- Atomic .tmp dataset lock pattern same as zfs::ensure_template.
- Reuse the existing enroot-nsenter --user --remap-root + overlay
  + tar-pipe pipeline; only the destination changes from the
  user's rootfs directory to the template clone's mountpoint.
- Promote on success: rename, snapshot @pristine, readonly=on.
- Clone for user via zfs::clone_container.

Dir backend behavior is byte-for-byte preserved.

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Per review feedback: keep ZFS code together so the delta to
existing files stays minimal. The previous commit had ~50 lines of
inline ZFS lifecycle (template caching, .tmp lock, overlay merge,
promotion, race-wait) directly in docker::load; this refactor moves
them into two src/storage_zfs.sh helpers and shrinks the docker.sh
delta to a precondition tweak + an early existence check + a single
dispatch call.

New helpers in src/storage_zfs.sh:

- zfs::container_check NAME
    Errors (or destroys with --force) if the container name already
    exists in the ZFS store. Used as an early-exit gate so we don't
    download Docker layers we'd just throw away.

- zfs::docker_install_from_layers CACHE_KEY LAYER_COUNT UNPRIV NAME
    Materializes the merged Docker rootfs into a ZFS template
    (cached by CACHE_KEY) and clones it as the user's container.
    Atomic across concurrent loads via the same .tmp lock pattern
    as zfs::ensure_template. Designed to be called from docker::load
    AFTER docker::_prepare_layers populates the cwd with directories
    0/, 1/, ..., N/. ENOSPC mid-merge destroys the .tmp so retries
    can run.

docker::load delta is now:
- one-line precondition tweak (relax NATIVE_OVERLAYFS=y on ZFS)
- one-line existence-check dispatch
- two-line merge-step dispatch

No behavior change; verified end-to-end (load, cache reuse,
existence error, force overwrite, dir-backend regression).

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant