Join GitHub today
stage1: rkt pods should not be given CAP_SYS_ADMIN #576
In this issue, I talk about how rkt's security is reduced because of CAP_SYS_ADMIN and give some ideas how to make rkt work without giving CAP_SYS_ADMIN to stage1's systemd.
systemd-nspawn does not drop CAP_SYS_ADMIN by default so stage1's systemd and stage2 have CAP_SYS_ADMIN.
Docker does not give CAP_SYS_ADMIN to containers by default. Correct isolation is ensured by the fact it does not have CAP_SYS_ADMIN.
Example 1: attempting to remount a volume read-write without CAP_SYS_ADMIN does not work:
The Docker container is started with a volume bind mounted read-only from the host. Docker does not allow the container to remount the volume without the readonly limitation.
Example 2: attempting to remount a volume read-write with CAP_SYS_ADMIN is possible:
When CAP_SYS_ADMIN is given to the container, the readonly attribute of the volume cannot be enforced.
In rkt, it is also dangerous to let the container have CAP_SYS_ADMIN, for example for the cgroup filesystems. Systemd uses cgroups heavily. It either mounts the cgroup filesystems in
Systemd-nspawn v215 (version currently used in rkt) does not mount the cgroup filesystems, so stage1's systemd has to do the mounting. It mounts them in read-write mode.
Since systemd-nspawn v219, the cgroup filesystems are mounted for the container as read-only, except the container's own tree which needs to be read-write (commit). Therefore, with systemd-nspawn v215, stage1's systemd does not need to mount any cgroups filesystem.
If the container could remount all the cgroup filesystems in read-write mode, it would be able to do lots of damage to the host.
Example 3: freeze the host's lighttpd from the rkt container:
The shell is initially in stage2's chroot. I break the chroot with /break-chroot and I reach stage1's root. I can remount the cgroups in read-write mode because I have CAP_SYS_ADMIN. (or they were already in read-write mode with systemd-nspawn v215)
Even though the container cannot see the list of host processes in a cgroup because it is in a different pid namespace, I can still DoS them by freezing them with the freezer cgroup subsystem. I tested with a browser on http://localhost/ that lighttpd (running on the host) stopped responding after the rkt container froze it.
Why do rkt containers have CAP_SYS_ADMIN?
It is the default in systemd-nspawn
By default, and unless the option
CAP_SYS_ADMIN is not the only dangerous one. CAP_DAC_READ_SEARCH is dangerous too and allowed Docker containers to access files outside such as the host's /etc/shadow.
By comparison, Docker's default capability list does not have CAP_DAC_READ_SEARCH or CAP_SYS_ADMIN:
Recommended by Systemd's container interface
Systemd's container interface recommends to keep CAP_SYS_ADMIN:
However, the only systemd .service files defined in a rkt container are the ones written by stage0:
Since rkt controls how those .service files are written, we can take care of not using features incompatible with !CAP_SYS_ADMIN.
prepare-app (introduced in #546)
prepare-app bind mounts device nodes from stage1's /dev to stage2's /dev. The mount() syscall requires CAP_SYS_ADMIN. So if CAP_SYS_ADMIN is not given to the container, prepare-app as written in #546 will stop working.
This could be solved by using a setuid binary (or use filesystem capabilities) in stage1 to do the mount. /bin/mount is already setuid. When systemd is requested to mount a filesystem by a .mount file, it does not use the mount() syscall directly but execs /bin/mount. So if .mount files were to be used to bind mount stage1's /dev files to stage2's /dev, CAP_SYS_ADMIN would not be needed. Current versions of systemd don't create the mount point correctly but systemd has a patch available. Meanwhile, prepare-app could be setuid.
Limiting CAP_SYS_ADMIN to stage1's systemd?
It is not enough to limit CAP_SYS_ADMIN to stage1's systemd and keep applications without CAP_SYS_ADMIN: applications can break the chroot, create a file in /run/systemd/system/foo.service containing
Isolators at the app level
Some of rkt isolators will be implemented with cgroups: e.g. the "resource/cpu" isolator (appc/spec#192)can be implemented by putting the processes of the app in a specific cgroup on the "cpu" cgroup subsystem. It means stage1's systemd needs to have write access to
Systemd-nspawn does not manage read-write access for container subtree on cgroup subsystems. I don't know if it is feasable without adding support for the Cgroup unified hierarchy. Systemd developers plan to add that support.
There has been some attempt to add CGroup Namespaces in the Linux kernel. It would make this read-only bind mount unnecessary. But so far, CGroup Namespaces don't exist.
thanks for the awesome write-up!
I am starting to wonder if we should move away from strictly requiring a shared mount namespace, and instead use the slightly more agnostic stipulation that apps within a pod must be able to access the same shared volumes (c.f. kubernetes/kubernetes#4701 (comment) ). Then we could potentially do something like give each app their own mount namespace, and limit CAP_SYS_ADMIN to stage1 systemd.
Arguably using systemd .mount files might be preferable anyway.
If the apps are in a different mount namespace, they can still access the rootfs of stage1 if they are in the same pid namespace through
Being in different mount namespaces does not block access to files outside of
To check this, I tried to send a file descriptor from outside the container to the container with fd-passing (SCM_RIGHTS) and amusingly I could
I am fine with apps running in different mount namespaces but it does not fix the issue of CAP_SYS_ADMIN when apps can run as root.
If we allow apps to run as root, I think that systemd must not be given more capabilities than the union of the caps of the apps. Then, an app running as root cannot get CAP_SYS_ADMIN, unless another app in the same pod is already allowed to have CAP_SYS_ADMIN.
In the section "5-1. CAP for resource control" of the Cgroup unified hierarchy documentation, it is explained that control knobs (which will need to be used for Rocket isolators) currently don't need a special capability (being root is enough because the knob files belong to root with -rw-r--r--) but it is planned to change to require a capability, most certainly CAP_SYS_ADMIN. That could be a problem if we want to drop CAP_SYS_ADMIN from stage1's systemd.
One way to solve this could be to have stage1's /init prepare the cgroups before dropping the caps, before exec() to stage1's systemd. On hosts running systemd, this is done by calling StartTransientUnit on the New Control Group Interfaces. This is what Docker is doing through libcontainer. But it means each app in the pod will have the same cgroup-based isolators.
Today I played a bit with LXCFS to see if it can solve some of our problems regarding this issue.
What is it?
Basically, it is a FUSE filesystem that communicates with cgmanager to offer a restricted view of the cgroup filesystem. This view can be bind-mounted inside a container and boom! We'll have the cgroups we need with rw permissions, exactly what we need for #811.
It also gives a virtualized view of some
Since systemd-nspawn v215 doesn't mount the cgroups and leaves this task to systemd, they're mounted RW so things Just Work. I tested LXCFS with systemd v219.
Then I entered stage1 and bind-mounted
Now we need to set some memory limitation, I did this directly with
Finally, I ran a small program that does a 100MB
Running cgtop we can see that the memory doesn't get over 50MB:
The next thing I tried is running all this without CAP_SYS_ADMIN. Unfortunately, this doesn't work because prepare-app uses the mount system call.
After talking to @alban, we tried to make prepare-app setuid but this also doesn't work. Dropping the capability in
It seems LXCFS works. Sounds a bit hacky but I guess that's what we have until Cgroup namespaces are a thing. I'll play a bit more with it.
Re: CAP_SYS_ADMIN I guess we should wait for user namespaces?
: I have to read more about it, not sure what this means (@alban?):
referenced this issue
Apr 27, 2015
@iaguis : good work!
Have you tried
I would like to know if systemd in stage1 can make app-level isolators work in the following configuration: without lxcfs, without
About your  note on inherited capabilities: there is some explanations in CapabilityBoundingSet= and Capabilities= but it isn't clear enough to me whether we can use that. I am not sure what it means in practice.
@alban I agree with your line of thinking - am also curious why it's considered unsafe. Poking through systemd-devel/git history I couldn't find a clear explanation of the move to mounting it read-only.
I did notice, from http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/:
@jonboulle Today I added a couple of patches to systemd and implemented the memory isolator. It's a WIP but it seems to work. I'll try to get an answer to whether this is secure or not and why.
referenced this issue
May 22, 2015
FWIW, I just found we are not doing anything with the