Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd 248 broke read-only /sys/fs/cgroup mount in docker #19245

Closed
fthiery opened this issue Apr 8, 2021 · 5 comments
Closed

systemd 248 broke read-only /sys/fs/cgroup mount in docker #19245

fthiery opened this issue Apr 8, 2021 · 5 comments

Comments

@fthiery
Copy link

fthiery commented Apr 8, 2021

My usecase is about running systemd inside a docker container.

Starting systemd 248 (as host systemd), mounting /sys/fs/cgroup read-only is not possible anymore.

Workarounds:

  • boot host with systemd.unified_cgroup_hierarchy=0
  • remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown

The NEWS file mentions the following possibly related changes introduced in v248:

    * The existing ConditionControlGroupController= setting has been
      extended with two new values "v1" and "v2". "v2" means that the
      unified v2 cgroup hierarchy is used, and "v1" means that legacy v1
      hierarchy or the hybrid hierarchy are used.

    * Systems with the legacy cgroup v1 hierarchy are now marked as
      "tainted", to make it clearer that using the legacy hierarchy is not
      recommended.

    * systemd-detect-virt/ConditionVirtualization= will now explicitly
      detect Docker/Podman environments where possible. Moreover, they
      should be able to generically detect any container manager as long as
      it assigns the container a cgroup.

Any insight / advice / workarounds are welcome.

Used distribution

Arch Linux, kernel 5.11.11-arch1-1

$ docker version
Client:
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.16
 Git commit:        55c4c88966
 Built:             Wed Mar  3 16:51:54 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16
  Git commit:       363e9a88a1
  Built:            Wed Mar  3 16:51:28 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Expected behaviour

With the following example Dockerfile:

FROM debian:buster-slim

ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive

USER root
WORKDIR /root

RUN set -x

RUN apt-get update -y \
    && apt-get install --no-install-recommends -y systemd \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && rm -f /var/run/nologin

RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
    /etc/systemd/system/*.wants/* \
    /lib/systemd/system/local-fs.target.wants/* \
    /lib/systemd/system/sockets.target.wants/*udev* \
    /lib/systemd/system/sockets.target.wants/*initctl* \
    /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
    /lib/systemd/system/systemd-update-utmp*

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/lib/systemd/systemd"]

On a systemd 247 host

$ /lib/systemd/systemd --version
systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Listening on Journal Socket.
...
[  OK  ] Reached target Graphical Interface.

Unexpected behaviour you saw

On a systemd 248 host:

$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

CPU architecture issue was seen on

x86_64

@poettering
Copy link
Member

hmm, this is between docker your old container and your local configuration. What does host systemd have to do with that?

I have no idea about docker, but they are pretty hostile towards systemd and still don't support cgroupsv2. Maybe take it up with them?

Maybe your host runs cgroupsv2 now and docker fails on that? What's your mount table like?

BTW, mucking around in /etc/systemd and /usr/lib/systemd looks really broken. systemd just works in reasonably not broken container managers, see https://systemd.io/CONTAINER_INTERFACE. Now, docker being its own thing ignores that, but I think you can easily make things match that document so that things just work for you too without patching around.

@fthiery
Copy link
Author

fthiery commented Apr 8, 2021

hmm, this is between docker your old container and your local configuration. What does host systemd have to do with that?

Good question, not pointing fingers at systemd specifically, just noticed that the systemd upgrade seems to have triggered it.

I have no idea about docker, but they are pretty hostile towards systemd and still don't support cgroupsv2. Maybe take it up with them?

I will

Maybe your host runs cgroupsv2 now and docker fails on that? What's your mount table like?

$ grep cgroup /etc/mtab 
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64 0 0
cgroup2 /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0

The most practical workaround is to boot with systemd.unified_cgroup_hierarchy=0 (if it helps)

BTW, mucking around in /etc/systemd and /usr/lib/systemd looks really broken. systemd just works in reasonably not broken container managers, see https://systemd.io/CONTAINER_INTERFACE. Now, docker being its own thing ignores that, but I think you can easily make things match that document so that things just work for you too without patching around.

I will check this out, thanks

@fthiery
Copy link
Author

fthiery commented Apr 8, 2021

Btw docker is supposed to support cgroupsv2 since v20.10

"This release continues Docker’s investment in our community Engine adding multiple new features including support for cgroups V2"

https://www.docker.com/blog/introducing-docker-engine-20-10/
https://docs.docker.com/engine/security/rootless/

This may also get relevant: https://serverfault.com/a/1054414/91453

@poettering
Copy link
Member

btw, just bind mounting /sys/fs/cgroup hierarchy is never going to work if cgroup namespaces are used, since then the host view of /sys/fs/cgroup will be visible to the container, but /proc/$PID/cgroup will report the namespaced viewed, and things are then utterly broken.

Hence, what you are doing is pretty fishy, and I am not sure this ever could work. Either way, I doubt there#s anything for us to address here.

@poettering
Copy link
Member

Anyway, let's close this here, bind mounting the hierarchy when cgroupns is used cannot work. It's a wonder this wasn't visible before. I also don't see how this is a systemd issue in the first place. Please follow up with docker. And drop the bind mount. unless you explicitly turn of cgroupns, too

gjhenrique added a commit to gjhenrique/rpi_stuff that referenced this issue Sep 10, 2021
New systemd versions on the host break systemd on the container
systemd/systemd#19245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants