Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlay could be implemented portably, unprivileged #1207

Closed
DrDaveD opened this issue Dec 7, 2017 · 25 comments
Closed

Overlay could be implemented portably, unprivileged #1207

DrDaveD opened this issue Dec 7, 2017 · 25 comments
Assignees

Comments

@DrDaveD
Copy link
Collaborator

DrDaveD commented Dec 7, 2017

Version of Singularity:

All of them.

Expected behavior

We should be able to bind mount to any directory, whether or not overlayfs is available. The fact that we can't is becoming a serious impediment for at least one LHC experiment's adoption of singularity.

Actual behavior

Bind mounting arbitrary directories only works when overlay is available, which is only on some operating systems and which requires root privileges.

Steps to reproduce behavior

Overlayfs could be avoided by using a different technique. You might call it "underlay" instead of "overlay". Simply create a scratch directory with mount points for all the -B options plus all the non-conflicting directories and files from the image. Bind mount in all the -B's and all the non-conflicting directories and files from the image onto this underlaying directory tree. We have tried this out with a script and passing a lot of extra -B options to singularity, and it works, but it would be much better if singularity did this itself instead of using overlayfs.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Dec 7, 2017

A problem when we simulate it competely unprivileged using singularity -B options is that '/' inside of the container is still writable by the user, and we want to completely isolate the user's code. If it were implemented inside of singularity, that problem could be solved by making '/' a read-only bind mount.

Here's a way to demonstrate that the concept works, using the userns_child_exec program from the series on lwn.net on "Namespaces in operation", on a kernel that has user mount namespaces enabled:

$ userns_child_exec -U -m -M "0 `id -u` 1" bash
# id
uid=0(root) gid=65534(nfsnobody) groups=65534(nfsnobody)
context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
# cd /tmp
# mkdir userns
# cd userns
# mkdir scratch final
# mount --bind scratch final
# mount --bind -o remount,ro final
# IMAGE=/cvmfs/cernvm-prod.cern.ch/cvm3
# for d in `ls $IMAGE`; do
    mkdir scratch/$d
    case $d in
        proc) mount -t proc proc final/proc;;
        sys)  mount -t sysfs sys final/sys;;
        tmp)  mount --bind /tmp final/tmp;;
        *)  mount --bind $IMAGE/$d final/$d;;
    esac
  done
# chroot final /bin/bash
# PATH=/bin
# mkdir /testing
mkdir: cannot create directory `/testing': Read-only file system

@bauerm97
Copy link
Contributor

bauerm97 commented Dec 8, 2017

So if I understand the idea correctly, the steps to this would be as follows:

  1. Make a scratch directory
  2. Populate scratch directory with mount points for each of the -B options, as well as for the directories in the root of the image
  3. Bind mount all of the dirs from step 2
  4. chroot into the scratch directory

Is that correct?

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Dec 8, 2017

That's not quiet enough steps. Replace step 4 with:

  1. Bind mount the scratch directory to a final directory.
  2. Remount the final directory as read-only
  3. chroot into the final directory.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Dec 8, 2017

No wait, that's not quite right either. Let me start from step 3 instead:

  1. Bind mount the scratch directory to a final directory.
  2. Remount the final directory as read-only.
  3. Bind mount all the dirs from step 2 onto the final directory.
  4. chroot into the final directory.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Dec 8, 2017

The one minor downside I can think to this compared to overlay is that if new directories are added to the underlying image during the run in either '/' or other places where -B options added bind mounts, the new directories will not show up to the user. That might be a reason to continue to support the current behavior with overlayfs as an option, but I think it makes sense for this mode to be the default because in most cases users won't care about that difference.

@dtrudg
Copy link
Contributor

dtrudg commented Dec 8, 2017

I guess this would fix the issue addressed by PR #1124 ? - where I need to support the narrow case of docker images not having our /home2 user home location on RHEL 6.x without overlay available

@olifre
Copy link
Contributor

olifre commented Dec 8, 2017

Isn't PR #1124 about creating additional directories at "build" time?
This issue here is about adding directories after the image is built, extracted (i.e. "sandbox mode") and shipped via a readonly-FS (e.g. CVMFS).
So mkdir -p inside the image's root will not work.

@dtrudg
Copy link
Contributor

dtrudg commented Dec 8, 2017

OK - yes PR #1124 does create at build time. Reading the above I wasn't clear that this was sandbox mode specific.

@bauerm97 bauerm97 self-assigned this Dec 11, 2017
@afortiorama
Copy link

This is to mount $PWD in an isolated manner the way $HOME is mounted but without requiring a top directory. It'd be good if this was an option also in the configuration file not only using -B.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Jan 9, 2018

@bauerm97 This is assigned to you -- does that mean you plan to work on this? If so, any progress?

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Jan 10, 2018

@dctrud and @olifre I think this might fix the issue addressed by PR #1124 because you wouldn't have to add extra directories at build time if they weren't needed at run time. Right?

@olifre
Copy link
Contributor

olifre commented Jan 10, 2018

@DrDaveD Yes, indeed.
@dctrud Referencing you earlier comment: @DrDaveD 's suggestion is actually not sandbox-mode specific.
Of course, it really helps in unprivileged (sandbox) mode, when overlayfs can not be used since singularity runs unprivileged - but also for a privileged singularity which can not use overlayfs (or is disallowed to use overlayfs by configuration) this solution could be used.

@dtrudg
Copy link
Contributor

dtrudg commented Jan 10, 2018

@DrDaveD @olifre - cool, my gut feeling from a quick scan previously was that it would fix issue of PR #1124 - but then I got confused. We're using PR #1124 in production as the mounting without overlayfs available is pretty much essential for how anyone wants to use Singularity here. Since I'm a Python person, and wasn't sure about how things would change with squashfs and now SIF image changes I thought injecting the bind points with a new layer tar.gz was just a nice 'safe' place to do it that I wouldn't need to mess with in future :-)

@olifre
Copy link
Contributor

olifre commented Jan 15, 2018

@DrDaveD and others: You may be interested in the latest comment on opencontainers/runc#1671 which references opencontainers/runc#1688 . Using rootless (unprivileged) containers mapping the uid to uid 0 inside the container (which does not grant extra privileges on the host) appears to allow usage of overlayfs inside the container.
With some extra logic outside, this can be used to create an arbitrary filesystem structure, I believe.
However, it requires overlayfs support, of course.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Jan 16, 2018

@olifre I believe that works only in Ubuntu which has a custom kernel adding support for overlayfs in unprivileged user namespaces.

@AkihiroSuda
Copy link

Unfortunately, it requires Ubuntu-specific kernel patch: http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/commit/fs/overlayfs?h=Ubuntu-4.13.0-25.29&id=0a414bdc3d01f3b61ed86cfe3ce8b63a9240eba7

However, if you are on XFS, I believe you could use copy_file_range(2) for copying files with reflink (CoW).
Since it requires to execute the syscall on a bunch of files, the copy operation would be slower than overlay though.

@olifre
Copy link
Contributor

olifre commented Jan 17, 2018

@DrDaveD Thanks for clarifying!

@AkihiroSuda Do you happen to know what chances exist this is coming to vanilla kernels / what are the future plans?

However, if you are on XFS [...]

Nice idea! That also works on btrfs, but for the use-cases used throughout WLCG, the filesystem on which the containers are shipped is CVMFS (a read-only, HTTP-based, deduping filesystem with heavy caching), so at least for many of the actual users this would not help. Also, on btrfs / XFS the reflinking overhead for short container runtimes may be not negligible.

@AkihiroSuda
Copy link

Do you happen to know what chances exist this is coming to vanilla kernels / what are the future plans?

I don't know, probably unlikely.

the use-cases used throughout WLCG, the filesystem on which the containers are shipped is CVMFS

You might want to use some FUSE unionfs implementation, although it would be slow and likely to cause compatibility issues.

@cyphar
Copy link

cyphar commented Feb 19, 2018

@olifre Eric Biederman is currently working on making FUSE safe for unprivileged users. This work should allow for unprivileged overlayfs to be supported in upstream kernels (finally) -- the main concern with overlay is related to the permissions model for the current implementation.

@olifre
Copy link
Contributor

olifre commented Feb 19, 2018

@cyphar This looks interesting, many thanks for the heads-up on this matter!

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Jun 5, 2018

I am working on a PR for this.

@stevekm
Copy link

stevekm commented Jun 6, 2018

What are the software requirements of the expected PR, and will it eventually be available on Centos6?

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Jun 6, 2018

It should work everywhere, definitely CentoOS6.

This was referenced Jun 6, 2018
@giuseppe
Copy link

FYI, I've played a bit with a FUSE implementation of overlay: https://github.com/giuseppe/containerfs

Working with the FUSE low level interface turned to be much harder than I initially thought, so the current implementation is quite hacky/broken although I managed to get it running in an userNS using Linux 4.18 and run some rootless containers.

@DrDaveD
Copy link
Collaborator Author

DrDaveD commented Oct 30, 2018

This is in singularity-3.0.0 and in EPEL's singularity-2.6.0.

@DrDaveD DrDaveD closed this as completed Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants