Overlay could be implemented portably, unprivileged #1207

DrDaveD · 2017-12-07T02:23:18Z

Version of Singularity:

All of them.

Expected behavior

We should be able to bind mount to any directory, whether or not overlayfs is available. The fact that we can't is becoming a serious impediment for at least one LHC experiment's adoption of singularity.

Actual behavior

Bind mounting arbitrary directories only works when overlay is available, which is only on some operating systems and which requires root privileges.

Steps to reproduce behavior

Overlayfs could be avoided by using a different technique. You might call it "underlay" instead of "overlay". Simply create a scratch directory with mount points for all the -B options plus all the non-conflicting directories and files from the image. Bind mount in all the -B's and all the non-conflicting directories and files from the image onto this underlaying directory tree. We have tried this out with a script and passing a lot of extra -B options to singularity, and it works, but it would be much better if singularity did this itself instead of using overlayfs.

DrDaveD · 2017-12-07T14:54:08Z

A problem when we simulate it competely unprivileged using singularity -B options is that '/' inside of the container is still writable by the user, and we want to completely isolate the user's code. If it were implemented inside of singularity, that problem could be solved by making '/' a read-only bind mount.

Here's a way to demonstrate that the concept works, using the userns_child_exec program from the series on lwn.net on "Namespaces in operation", on a kernel that has user mount namespaces enabled:

$ userns_child_exec -U -m -M "0 `id -u` 1" bash
# id
uid=0(root) gid=65534(nfsnobody) groups=65534(nfsnobody)
context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
# cd /tmp
# mkdir userns
# cd userns
# mkdir scratch final
# mount --bind scratch final
# mount --bind -o remount,ro final
# IMAGE=/cvmfs/cernvm-prod.cern.ch/cvm3
# for d in `ls $IMAGE`; do
    mkdir scratch/$d
    case $d in
        proc) mount -t proc proc final/proc;;
        sys)  mount -t sysfs sys final/sys;;
        tmp)  mount --bind /tmp final/tmp;;
        *)  mount --bind $IMAGE/$d final/$d;;
    esac
  done
# chroot final /bin/bash
# PATH=/bin
# mkdir /testing
mkdir: cannot create directory `/testing': Read-only file system

bauerm97 · 2017-12-08T00:34:40Z

So if I understand the idea correctly, the steps to this would be as follows:

Make a scratch directory
Populate scratch directory with mount points for each of the -B options, as well as for the directories in the root of the image
Bind mount all of the dirs from step 2
chroot into the scratch directory

Is that correct?

DrDaveD · 2017-12-08T12:33:32Z

That's not quiet enough steps. Replace step 4 with:

Bind mount the scratch directory to a final directory.
Remount the final directory as read-only
chroot into the final directory.

DrDaveD · 2017-12-08T12:46:27Z

No wait, that's not quite right either. Let me start from step 3 instead:

Bind mount the scratch directory to a final directory.
Remount the final directory as read-only.
Bind mount all the dirs from step 2 onto the final directory.
chroot into the final directory.

DrDaveD · 2017-12-08T15:05:44Z

The one minor downside I can think to this compared to overlay is that if new directories are added to the underlying image during the run in either '/' or other places where -B options added bind mounts, the new directories will not show up to the user. That might be a reason to continue to support the current behavior with overlayfs as an option, but I think it makes sense for this mode to be the default because in most cases users won't care about that difference.

dtrudg · 2017-12-08T17:24:48Z

I guess this would fix the issue addressed by PR #1124 ? - where I need to support the narrow case of docker images not having our /home2 user home location on RHEL 6.x without overlay available

olifre · 2017-12-08T17:27:38Z

Isn't PR #1124 about creating additional directories at "build" time?
This issue here is about adding directories after the image is built, extracted (i.e. "sandbox mode") and shipped via a readonly-FS (e.g. CVMFS).
So mkdir -p inside the image's root will not work.

dtrudg · 2017-12-08T17:32:05Z

OK - yes PR #1124 does create at build time. Reading the above I wasn't clear that this was sandbox mode specific.

afortiorama · 2017-12-14T09:33:32Z

This is to mount $PWD in an isolated manner the way $HOME is mounted but without requiring a top directory. It'd be good if this was an option also in the configuration file not only using -B.

DrDaveD · 2018-01-09T16:04:16Z

@bauerm97 This is assigned to you -- does that mean you plan to work on this? If so, any progress?

DrDaveD · 2018-01-10T20:01:12Z

@dctrud and @olifre I think this might fix the issue addressed by PR #1124 because you wouldn't have to add extra directories at build time if they weren't needed at run time. Right?

olifre · 2018-01-10T20:09:52Z

@DrDaveD Yes, indeed.
@dctrud Referencing you earlier comment: @DrDaveD 's suggestion is actually not sandbox-mode specific.
Of course, it really helps in unprivileged (sandbox) mode, when overlayfs can not be used since singularity runs unprivileged - but also for a privileged singularity which can not use overlayfs (or is disallowed to use overlayfs by configuration) this solution could be used.

dtrudg · 2018-01-10T22:13:03Z

@DrDaveD @olifre - cool, my gut feeling from a quick scan previously was that it would fix issue of PR #1124 - but then I got confused. We're using PR #1124 in production as the mounting without overlayfs available is pretty much essential for how anyone wants to use Singularity here. Since I'm a Python person, and wasn't sure about how things would change with squashfs and now SIF image changes I thought injecting the bind points with a new layer tar.gz was just a nice 'safe' place to do it that I wouldn't need to mess with in future :-)

olifre · 2018-01-15T12:40:45Z

@DrDaveD and others: You may be interested in the latest comment on opencontainers/runc#1671 which references opencontainers/runc#1688 . Using rootless (unprivileged) containers mapping the uid to uid 0 inside the container (which does not grant extra privileges on the host) appears to allow usage of overlayfs inside the container.
With some extra logic outside, this can be used to create an arbitrary filesystem structure, I believe.
However, it requires overlayfs support, of course.

DrDaveD · 2018-01-16T23:09:52Z

@olifre I believe that works only in Ubuntu which has a custom kernel adding support for overlayfs in unprivileged user namespaces.

AkihiroSuda · 2018-01-17T06:45:37Z

Unfortunately, it requires Ubuntu-specific kernel patch: http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/commit/fs/overlayfs?h=Ubuntu-4.13.0-25.29&id=0a414bdc3d01f3b61ed86cfe3ce8b63a9240eba7

However, if you are on XFS, I believe you could use copy_file_range(2) for copying files with reflink (CoW).
Since it requires to execute the syscall on a bunch of files, the copy operation would be slower than overlay though.

olifre · 2018-01-17T09:11:56Z

@DrDaveD Thanks for clarifying!

@AkihiroSuda Do you happen to know what chances exist this is coming to vanilla kernels / what are the future plans?

However, if you are on XFS [...]

Nice idea! That also works on btrfs, but for the use-cases used throughout WLCG, the filesystem on which the containers are shipped is CVMFS (a read-only, HTTP-based, deduping filesystem with heavy caching), so at least for many of the actual users this would not help. Also, on btrfs / XFS the reflinking overhead for short container runtimes may be not negligible.

AkihiroSuda · 2018-01-17T09:24:47Z

Do you happen to know what chances exist this is coming to vanilla kernels / what are the future plans?

I don't know, probably unlikely.

the use-cases used throughout WLCG, the filesystem on which the containers are shipped is CVMFS

You might want to use some FUSE unionfs implementation, although it would be slow and likely to cause compatibility issues.

cyphar · 2018-02-19T21:57:11Z

@olifre Eric Biederman is currently working on making FUSE safe for unprivileged users. This work should allow for unprivileged overlayfs to be supported in upstream kernels (finally) -- the main concern with overlay is related to the permissions model for the current implementation.

olifre · 2018-02-19T22:05:26Z

@cyphar This looks interesting, many thanks for the heads-up on this matter!

DrDaveD · 2018-06-05T20:18:39Z

I am working on a PR for this.

stevekm · 2018-06-06T03:34:11Z

What are the software requirements of the expected PR, and will it eventually be available on Centos6?

DrDaveD · 2018-06-06T11:28:24Z

It should work everywhere, definitely CentoOS6.

giuseppe · 2018-06-25T12:22:25Z

FYI, I've played a bit with a FUSE implementation of overlay: https://github.com/giuseppe/containerfs

Working with the FUSE low level interface turned to be much harder than I initially thought, so the current implementation is quite hacky/broken although I managed to get it running in an userNS using Linux 4.18 and run some rootless containers.

DrDaveD · 2018-10-30T21:43:06Z

This is in singularity-3.0.0 and in EPEL's singularity-2.6.0.

olifre mentioned this issue Dec 7, 2017

Add support for bind mounts to directories not existing within a container on read-only FS hpc/charliecloud#96

Closed

olifre mentioned this issue Dec 8, 2017

Support for arbitrary bind mount directories on read-only FS for rootless containers opencontainers/runc#1671

Open

bauerm97 self-assigned this Dec 11, 2017

DrDaveD mentioned this issue Jan 10, 2018

Allow user pull/run of docker images with non-standard $HOME / missing bind paths and no overlayfs #1124

Closed

5 tasks

DrDaveD mentioned this issue Jan 17, 2018

Unable to bind sub-directory unless parent directory has r+w for 'other' users #1249

Closed

This was referenced Feb 7, 2018

[Hackathon] TODO list #1281

Closed

OverlayFS and 2.6.32-573 kernel #747

Closed

ArangoGutierrez added the hackathon label Feb 7, 2018

This was referenced Jun 6, 2018

Consolidate mounting code #1624

Closed

Add the underlay feature #1638

Closed

lukasheinrich mentioned this issue Jul 25, 2018

roadmap containers/fuse-overlayfs#4

Open

DrDaveD mentioned this issue Aug 14, 2018

External mounts without OverlayFS #1807

Closed

DrDaveD closed this as completed Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlay could be implemented portably, unprivileged #1207

Overlay could be implemented portably, unprivileged #1207

DrDaveD commented Dec 7, 2017

DrDaveD commented Dec 7, 2017

bauerm97 commented Dec 8, 2017

DrDaveD commented Dec 8, 2017

DrDaveD commented Dec 8, 2017 •

edited

DrDaveD commented Dec 8, 2017

dtrudg commented Dec 8, 2017

olifre commented Dec 8, 2017

dtrudg commented Dec 8, 2017

afortiorama commented Dec 14, 2017

DrDaveD commented Jan 9, 2018

DrDaveD commented Jan 10, 2018

olifre commented Jan 10, 2018

dtrudg commented Jan 10, 2018

olifre commented Jan 15, 2018

DrDaveD commented Jan 16, 2018 •

edited

AkihiroSuda commented Jan 17, 2018

olifre commented Jan 17, 2018

AkihiroSuda commented Jan 17, 2018

cyphar commented Feb 19, 2018

olifre commented Feb 19, 2018

DrDaveD commented Jun 5, 2018

stevekm commented Jun 6, 2018

DrDaveD commented Jun 6, 2018

giuseppe commented Jun 25, 2018

DrDaveD commented Oct 30, 2018

Overlay could be implemented portably, unprivileged #1207

Overlay could be implemented portably, unprivileged #1207

Comments

DrDaveD commented Dec 7, 2017

Version of Singularity:

Expected behavior

Actual behavior

Steps to reproduce behavior

DrDaveD commented Dec 7, 2017

bauerm97 commented Dec 8, 2017

DrDaveD commented Dec 8, 2017

DrDaveD commented Dec 8, 2017 • edited

DrDaveD commented Dec 8, 2017

dtrudg commented Dec 8, 2017

olifre commented Dec 8, 2017

dtrudg commented Dec 8, 2017

afortiorama commented Dec 14, 2017

DrDaveD commented Jan 9, 2018

DrDaveD commented Jan 10, 2018

olifre commented Jan 10, 2018

dtrudg commented Jan 10, 2018

olifre commented Jan 15, 2018

DrDaveD commented Jan 16, 2018 • edited

AkihiroSuda commented Jan 17, 2018

olifre commented Jan 17, 2018

AkihiroSuda commented Jan 17, 2018

cyphar commented Feb 19, 2018

olifre commented Feb 19, 2018

DrDaveD commented Jun 5, 2018

stevekm commented Jun 6, 2018

DrDaveD commented Jun 6, 2018

giuseppe commented Jun 25, 2018

DrDaveD commented Oct 30, 2018

DrDaveD commented Dec 8, 2017 •

edited

DrDaveD commented Jan 16, 2018 •

edited