Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage1: rkt pods should not be given CAP_SYS_ADMIN #576

alban opened this Issue Mar 6, 2015 · 12 comments


None yet
5 participants
Copy link

alban commented Mar 6, 2015

In this issue, I talk about how rkt's security is reduced because of CAP_SYS_ADMIN and give some ideas how to make rkt work without giving CAP_SYS_ADMIN to stage1's systemd.

Current status

systemd-nspawn does not drop CAP_SYS_ADMIN by default so stage1's systemd and stage2 have CAP_SYS_ADMIN.

Docker does not give CAP_SYS_ADMIN to containers by default. Correct isolation is ensured by the fact it does not have CAP_SYS_ADMIN.

Example 1: attempting to remount a volume read-write without CAP_SYS_ADMIN does not work:

host:/$ docker run -t -i -v /tmp/foo/:/tmp/foo:ro busybox
container:/ # mount -o remount,rw /tmp/foo/
mount: permission denied (are you root?)

The Docker container is started with a volume bind mounted read-only from the host. Docker does not allow the container to remount the volume without the readonly limitation.

Example 2: attempting to remount a volume read-write with CAP_SYS_ADMIN is possible:

host:/$ docker run --cap-add=SYS_ADMIN -t -i -v /tmp/foo/:/tmp/foo:ro busybox
container:/ # mount -o remount,rw /tmp/foo/
container:/ # touch /tmp/foo/written_from_the_container
container:/ #

When CAP_SYS_ADMIN is given to the container, the readonly attribute of the volume cannot be enforced.

In rkt, it is also dangerous to let the container have CAP_SYS_ADMIN, for example for the cgroup filesystems. Systemd uses cgroups heavily. It either mounts the cgroup filesystems in /sys/fs/cgroup/* itself if they are not already mounted. If systemd-nspawn mounted the cgroup filesystems, systemd does not need to remount them and it just use them as is.

Systemd-nspawn v215 (version currently used in rkt) does not mount the cgroup filesystems, so stage1's systemd has to do the mounting. It mounts them in read-write mode.

Since systemd-nspawn v219, the cgroup filesystems are mounted for the container as read-only, except the container's own tree which needs to be read-write (commit). Therefore, with systemd-nspawn v215, stage1's systemd does not need to mount any cgroups filesystem.

If the container could remount all the cgroup filesystems in read-write mode, it would be able to do lots of damage to the host.

Example 3: freeze the host's lighttpd from the rkt container:

host:/$ rkt --debug --insecure-skip-verify run --interactive docker://ubuntu
container:/ # /break-chroot
container:/ # mount -o remount,bind,rw /sys/fs/cgroup/systemd
container:/ # mount -o remount,bind,rw /sys/fs/cgroup/freezer
container:/ # echo FROZEN >  /sys/fs/cgroup/freezer/my_http_server/freezer.state

The shell is initially in stage2's chroot. I break the chroot with /break-chroot and I reach stage1's root. I can remount the cgroups in read-write mode because I have CAP_SYS_ADMIN. (or they were already in read-write mode with systemd-nspawn v215)

Even though the container cannot see the list of host processes in a cgroup because it is in a different pid namespace, I can still DoS them by freezing them with the freezer cgroup subsystem. I tested with a browser on http://localhost/ that lighttpd (running on the host) stopped responding after the rkt container froze it.

Why do rkt containers have CAP_SYS_ADMIN?

It is the default in systemd-nspawn

By default, and unless the option --drop-cap is used, systemd-nspawn keeps the following capabilities for the container:

static uint64_t arg_retain =
        (1ULL << CAP_CHOWN) |
        (1ULL << CAP_DAC_OVERRIDE) |
        (1ULL << CAP_DAC_READ_SEARCH) |
        (1ULL << CAP_FOWNER) |
        (1ULL << CAP_FSETID) |
        (1ULL << CAP_IPC_OWNER) |
        (1ULL << CAP_KILL) |
        (1ULL << CAP_LEASE) |
        (1ULL << CAP_NET_BIND_SERVICE) |
        (1ULL << CAP_NET_BROADCAST) |
        (1ULL << CAP_NET_RAW) |
        (1ULL << CAP_SETGID) |
        (1ULL << CAP_SETFCAP) |
        (1ULL << CAP_SETPCAP) |
        (1ULL << CAP_SETUID) |
        (1ULL << CAP_SYS_ADMIN) |
        (1ULL << CAP_SYS_CHROOT) |
        (1ULL << CAP_SYS_NICE) |
        (1ULL << CAP_SYS_PTRACE) |
        (1ULL << CAP_SYS_TTY_CONFIG) |
        (1ULL << CAP_SYS_RESOURCE) |
        (1ULL << CAP_SYS_BOOT) |
        (1ULL << CAP_AUDIT_WRITE) |
        (1ULL << CAP_AUDIT_CONTROL) |
        (1ULL << CAP_MKNOD);

CAP_SYS_ADMIN is not the only dangerous one. CAP_DAC_READ_SEARCH is dangerous too and allowed Docker containers to access files outside such as the host's /etc/shadow.

By comparison, Docker's default capability list does not have CAP_DAC_READ_SEARCH or CAP_SYS_ADMIN:

        Capabilities: []string{

Recommended by Systemd's container interface

Systemd's container interface recommends to keep CAP_SYS_ADMIN:

Do not drop CAP_SYS_ADMIN from the container. A number of fs namespacing related settings, such as PrivateDevices=, ProtectHome=, ProtectSystem=, MountFlags=, PrivateTmp=, ReadWriteDirectories=, ReadOnlyDirectories=, InaccessibleDirectories=, MountFlags= need to be able to open new mount namespaces and the mount certain file system into it. You break all services that make use of these flags if you drop the flag. Note that already quite a number of services make use of this as we actively encourage users to make use of this security functionality. Also note that logind mounts XDG_RUNTIME_DIR as tmpfs for all logged in users and won't work either if you take away the capability.

However, the only systemd .service files defined in a rkt container are the ones written by stage0:

  • the app services (stage2): sha512-xxx.service
  • exit-watcher.service
  • prepare-app@.service (introduced in #546)
  • reaper.service

Since rkt controls how those .service files are written, we can take care of not using features incompatible with !CAP_SYS_ADMIN.

prepare-app (introduced in #546)

prepare-app bind mounts device nodes from stage1's /dev to stage2's /dev. The mount() syscall requires CAP_SYS_ADMIN. So if CAP_SYS_ADMIN is not given to the container, prepare-app as written in #546 will stop working.

This could be solved by using a setuid binary (or use filesystem capabilities) in stage1 to do the mount. /bin/mount is already setuid. When systemd is requested to mount a filesystem by a .mount file, it does not use the mount() syscall directly but execs /bin/mount. So if .mount files were to be used to bind mount stage1's /dev files to stage2's /dev, CAP_SYS_ADMIN would not be needed. Current versions of systemd don't create the mount point correctly but systemd has a patch available. Meanwhile, prepare-app could be setuid.

@iaguis: do you use overlayfs for stage1's rootfs? Does overlayfs support filesystem capabilities with getcap and setcap? Is stage1 overlayfs mounted with or without "nosuid"?

Limiting CAP_SYS_ADMIN to stage1's systemd?

It is not enough to limit CAP_SYS_ADMIN to stage1's systemd and keep applications without CAP_SYS_ADMIN: applications can break the chroot, create a file in /run/systemd/system/foo.service containing CapabilityBoundingSet=CAP_SYS_ADMIN and start it with "systemctl start".

Isolators at the app level

Some of rkt isolators will be implemented with cgroups: e.g. the "resource/cpu" isolator (appc/spec#192)can be implemented by putting the processes of the app in a specific cgroup on the "cpu" cgroup subsystem. It means stage1's systemd needs to have write access to /sys/fs/cgroup/cpu/. But this directory contains the cpu's cgroup tree for all processes, including the host's processes so it is unsafe to give complete read-write access to it. Instead, only the container's subtree should be mounted in read-write mode, as is it done for the /sys/fs/cgroup/systemd/ filesystem.

Systemd-nspawn does not manage read-write access for container subtree on cgroup subsystems. I don't know if it is feasable without adding support for the Cgroup unified hierarchy. Systemd developers plan to add that support.

There has been some attempt to add CGroup Namespaces in the Linux kernel. It would make this read-only bind mount unnecessary. But so far, CGroup Namespaces don't exist.


This comment has been minimized.

Copy link

jonboulle commented Mar 11, 2015

thanks for the awesome write-up!

I am starting to wonder if we should move away from strictly requiring a shared mount namespace, and instead use the slightly more agnostic stipulation that apps within a pod must be able to access the same shared volumes (c.f. kubernetes/kubernetes#4701 (comment) ). Then we could potentially do something like give each app their own mount namespace, and limit CAP_SYS_ADMIN to stage1 systemd.

Arguably using systemd .mount files might be preferable anyway.

@alban alban referenced this issue Mar 11, 2015


Functional testing for rkt #600

18 of 46 tasks complete

This comment has been minimized.

Copy link
Member Author

alban commented Mar 11, 2015

If the apps are in a different mount namespace, they can still access the rootfs of stage1 if they are in the same pid namespace through /proc/1/root/. The apps running as root can steal capabilities from stage1 systemd by adding a new systemd unit file in /proc/1/root/usr/lib64/systemd/system/foo.service and start it through the socket file /proc/1/root/run/systemd/private.

Being in different mount namespaces does not block access to files outside of / if the process can somehow get a file descriptor to them. Getting that file descriptor is possible in the same pid namespace with /proc/1/root/.

To check this, I tried to send a file descriptor from outside the container to the container with fd-passing (SCM_RIGHTS) and amusingly I could chdir outside of the container root successfully, and the /bin/pwd command just returned (unreachable)/.


This comment has been minimized.

Copy link
Member Author

alban commented Mar 11, 2015

I am fine with apps running in different mount namespaces but it does not fix the issue of CAP_SYS_ADMIN when apps can run as root.

If we allow apps to run as root, I think that systemd must not be given more capabilities than the union of the caps of the apps. Then, an app running as root cannot get CAP_SYS_ADMIN, unless another app in the same pod is already allowed to have CAP_SYS_ADMIN.


This comment has been minimized.

Copy link
Member Author

alban commented Mar 11, 2015

In the section "5-1. CAP for resource control" of the Cgroup unified hierarchy documentation, it is explained that control knobs (which will need to be used for Rocket isolators) currently don't need a special capability (being root is enough because the knob files belong to root with -rw-r--r--) but it is planned to change to require a capability, most certainly CAP_SYS_ADMIN. That could be a problem if we want to drop CAP_SYS_ADMIN from stage1's systemd.

One way to solve this could be to have stage1's /init prepare the cgroups before dropping the caps, before exec() to stage1's systemd. On hosts running systemd, this is done by calling StartTransientUnit on the New Control Group Interfaces. This is what Docker is doing through libcontainer. But it means each app in the pod will have the same cgroup-based isolators.

@jonboulle jonboulle added this to the v1.0.0 milestone Apr 17, 2015

@jonboulle jonboulle changed the title Rocket containers should not be given CAP_SYS_ADMIN rkt containers should not be given CAP_SYS_ADMIN Apr 24, 2015

@jonboulle jonboulle referenced this issue Apr 24, 2015


stage1: add per-app isolators #811

2 of 5 tasks complete

@jonboulle jonboulle changed the title rkt containers should not be given CAP_SYS_ADMIN stage1: rkt pods should not be given CAP_SYS_ADMIN Apr 25, 2015


This comment has been minimized.

Copy link

iaguis commented Apr 27, 2015

Today I played a bit with LXCFS to see if it can solve some of our problems regarding this issue.

What is it?

Basically, it is a FUSE filesystem that communicates with cgmanager to offer a restricted view of the cgroup filesystem. This view can be bind-mounted inside a container and boom! We'll have the cgroups we need with rw permissions, exactly what we need for #811.

It also gives a virtualized view of some /proc files but I haven't tested this functionality.

Testing it

Since systemd-nspawn v215 doesn't mount the cgroups and leaves this task to systemd, they're mounted RW so things Just Work. I tested LXCFS with systemd v219.

I started cgmanager on the host, mounted the LXCFS filesystem on /var/lib/lxcfs and modified rkt so it bind-mounts /var/lib/lxcfs/cgroup to /newcgroup. I had to do this because in nspawn, the bind-mounts happen before it mounts the cgroups.

Then I entered stage1 and bind-mounted /newcgroup/memory to /sys/fs/cgroup/memory (I only tested the memory cgroup, in the Real World we'd mount all the subsystems). Perhaps we can do this with a service that does this bind mount in stage1.

Now we need to set some memory limitation, I did this directly with systemctl:

-- stage1
# systemctl set-property sha512-cefde676d01cb6bc27e2a811e19a6c58.service MemoryLimit=50M

Finally, I ran a small program that does a 100MB malloc and memset and then sleeps:

-- stage2
# bigmem 100

Running cgtop we can see that the memory doesn't get over 50MB:

/machine.slice/machine-rkt\x2dc7e3cec6\x2daae7\x2d4e14\x2da835\x2d3c488208775a.scope/system.slice/sha512-cefde676d01cb6bc27e2a811e19a6c58.service  2      -    49.9M        -        -


The next thing I tried is running all this without CAP_SYS_ADMIN. Unfortunately, this doesn't work because prepare-app uses the mount system call.

After talking to @alban, we tried to make prepare-app setuid but this also doesn't work. Dropping the capability in systemd-nspawn means that CAP_SYS_ADMIN is excluded from the capability bounding set which means that it cannot be recovered even if a file that has the capability in its inheritable set is executed[1]


It seems LXCFS works. Sounds a bit hacky but I guess that's what we have until Cgroup namespaces are a thing. I'll play a bit more with it.

Re: CAP_SYS_ADMIN I guess we should wait for user namespaces? 😁

[1]: I have to read more about it, not sure what this means (@alban?):

Note that the bounding set masks the file permitted capabilities, but not the inherited capabilities. If a thread maintains a capability in its inherited set that is not in its bounding set, then it can still gain that capability in its permitted set by executing a file that has the capability in its inherited set.


This comment has been minimized.

Copy link
Member Author

alban commented Apr 28, 2015

@iaguis : good work!

Have you tried Delegate=, suggested by @jonboulle in another discussion? When systemd-nspawn calls CreateMachineWithNetwork, is it possible to pass Delegate=true along with DeviceAllow and other scope_properties?

I would like to know if systemd in stage1 can make app-level isolators work in the following configuration: without lxcfs, without CAP_SYS_ADMIN and with nspawn patched to have Delegate and the cgroup hierarchies (mem, etc.) mounted in read-write mode in the same way as the name=systemd hierarchy (that is, only a subtree is read-write). If it works but is considered unsafe, I would like to know what is missing to be safe.

About your [1] note on inherited capabilities: there is some explanations in CapabilityBoundingSet= and Capabilities= but it isn't clear enough to me whether we can use that. I am not sure what it means in practice.


This comment has been minimized.

Copy link

jonboulle commented Apr 28, 2015

@alban I agree with your line of thinking - am also curious why it's considered unsafe. Poking through systemd-devel/git history I couldn't find a clear explanation of the move to mounting it read-only.

I did notice, from

Either pre-mount all cgroup hierarchies in full into the container, or leave that to systemd which will do so if they are missing. Note that it is explicitly not OK to just mount a sub-hierarchy into the container as that is incompatible with /proc/$PID/cgroup (which lists full paths). Also the root-level cgroup directories tend to be quite different from inner directories, and that distinction matters. It is OK however, to mount the "upper" parts read-only of the hierarchies, and only allow write-access to the cgroup subtree the container runs in.


Don't mount only a subtree of the cgroupfs into the container. This will not work as /proc/$PID/cgroup lists full paths and cannot be matched up with the actual cgroupfs tree visible, then. (Also see above.)


This comment has been minimized.

Copy link

iaguis commented Apr 28, 2015

@jonboulle Today I added a couple of patches to systemd and implemented the memory isolator. It's a WIP but it seems to work. I'll try to get an answer to whether this is secure or not and why.


This comment has been minimized.

Copy link

jonboulle commented Dec 1, 2015

Where did we end up on this?


This comment has been minimized.

Copy link
Member Author

alban commented Dec 1, 2015

No progress on this. I think it is easier to use user namespaces to make CAP_SYS_ADMIN harmless rather than removing CAP_SYS_ADMIN from pods based on systemd.

@jonboulle jonboulle modified the milestones: v1+, v1.0.0 Jan 22, 2016


This comment has been minimized.

Copy link

yifan-gu commented Apr 18, 2016

FWIW, I just found we are not doing anything with the capabilities-remove-set defined in the pod manifest today. That is, we pass --capability to nspawn but not --drop-capability. I guess this is mentioned in above comments, but I didn't read through them yet, so leave a comment here :)


This comment has been minimized.

Copy link

lucab commented May 27, 2016

The issue described here has been fixed by implementing appc/spec#600 via #2493. Even if stage1 still has CAP_SYS_ADMIN, there is now better separation between stage1 and stage2 and apps are receiving by default a much smaller capabilities bounding set.

@lucab lucab closed this May 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.