New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlay could be implemented portably, unprivileged #1207
Comments
A problem when we simulate it competely unprivileged using singularity -B options is that '/' inside of the container is still writable by the user, and we want to completely isolate the user's code. If it were implemented inside of singularity, that problem could be solved by making '/' a read-only bind mount. Here's a way to demonstrate that the concept works, using the userns_child_exec program from the series on lwn.net on "Namespaces in operation", on a kernel that has user mount namespaces enabled:
|
So if I understand the idea correctly, the steps to this would be as follows:
Is that correct? |
That's not quiet enough steps. Replace step 4 with:
|
No wait, that's not quite right either. Let me start from step 3 instead:
|
The one minor downside I can think to this compared to overlay is that if new directories are added to the underlying image during the run in either '/' or other places where -B options added bind mounts, the new directories will not show up to the user. That might be a reason to continue to support the current behavior with overlayfs as an option, but I think it makes sense for this mode to be the default because in most cases users won't care about that difference. |
I guess this would fix the issue addressed by PR #1124 ? - where I need to support the narrow case of docker images not having our /home2 user home location on RHEL 6.x without overlay available |
Isn't PR #1124 about creating additional directories at "build" time? |
OK - yes PR #1124 does create at build time. Reading the above I wasn't clear that this was sandbox mode specific. |
This is to mount $PWD in an isolated manner the way $HOME is mounted but without requiring a top directory. It'd be good if this was an option also in the configuration file not only using -B. |
@bauerm97 This is assigned to you -- does that mean you plan to work on this? If so, any progress? |
@DrDaveD Yes, indeed. |
@DrDaveD @olifre - cool, my gut feeling from a quick scan previously was that it would fix issue of PR #1124 - but then I got confused. We're using PR #1124 in production as the mounting without overlayfs available is pretty much essential for how anyone wants to use Singularity here. Since I'm a Python person, and wasn't sure about how things would change with squashfs and now SIF image changes I thought injecting the bind points with a new layer tar.gz was just a nice 'safe' place to do it that I wouldn't need to mess with in future :-) |
@DrDaveD and others: You may be interested in the latest comment on opencontainers/runc#1671 which references opencontainers/runc#1688 . Using rootless (unprivileged) containers mapping the uid to uid 0 inside the container (which does not grant extra privileges on the host) appears to allow usage of overlayfs inside the container. |
@olifre I believe that works only in Ubuntu which has a custom kernel adding support for overlayfs in unprivileged user namespaces. |
Unfortunately, it requires Ubuntu-specific kernel patch: http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/commit/fs/overlayfs?h=Ubuntu-4.13.0-25.29&id=0a414bdc3d01f3b61ed86cfe3ce8b63a9240eba7 However, if you are on XFS, I believe you could use |
@DrDaveD Thanks for clarifying! @AkihiroSuda Do you happen to know what chances exist this is coming to vanilla kernels / what are the future plans?
Nice idea! That also works on btrfs, but for the use-cases used throughout WLCG, the filesystem on which the containers are shipped is CVMFS (a read-only, HTTP-based, deduping filesystem with heavy caching), so at least for many of the actual users this would not help. Also, on btrfs / XFS the reflinking overhead for short container runtimes may be not negligible. |
I don't know, probably unlikely.
You might want to use some FUSE unionfs implementation, although it would be slow and likely to cause compatibility issues. |
@olifre Eric Biederman is currently working on making FUSE safe for unprivileged users. This work should allow for unprivileged |
@cyphar This looks interesting, many thanks for the heads-up on this matter! |
I am working on a PR for this. |
What are the software requirements of the expected PR, and will it eventually be available on Centos6? |
It should work everywhere, definitely CentoOS6. |
FYI, I've played a bit with a FUSE implementation of overlay: https://github.com/giuseppe/containerfs Working with the FUSE low level interface turned to be much harder than I initially thought, so the current implementation is quite hacky/broken although I managed to get it running in an userNS using Linux 4.18 and run some rootless containers. |
This is in singularity-3.0.0 and in EPEL's singularity-2.6.0. |
Version of Singularity:
All of them.
Expected behavior
We should be able to bind mount to any directory, whether or not overlayfs is available. The fact that we can't is becoming a serious impediment for at least one LHC experiment's adoption of singularity.
Actual behavior
Bind mounting arbitrary directories only works when overlay is available, which is only on some operating systems and which requires root privileges.
Steps to reproduce behavior
Overlayfs could be avoided by using a different technique. You might call it "underlay" instead of "overlay". Simply create a scratch directory with mount points for all the -B options plus all the non-conflicting directories and files from the image. Bind mount in all the -B's and all the non-conflicting directories and files from the image onto this underlaying directory tree. We have tried this out with a script and passing a lot of extra -B options to singularity, and it works, but it would be much better if singularity did this itself instead of using overlayfs.
The text was updated successfully, but these errors were encountered: