New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-run --user --scope ... doesn't work with unified cgroup hierarchy #3388

Closed
gdamjan opened this Issue May 30, 2016 · 42 comments

Comments

@gdamjan
Copy link
Contributor

gdamjan commented May 30, 2016

Submission type

  • [ x ] Bug report
  • Request for enhancement (RFE)

systemd version the issue has been seen with

230 and 230+84 commits

Used distribution

Arch Linux (testing)

In case of bug report: Expected behaviour you didn't see

systemd-run --user --scope bash -c "sleep 100" should start a new scope and service in the user systemd instance.

In case of bug report: Unexpected behaviour you saw

systemd-run fails and the journal has run-r9c9b0f8d4fd14753bd9d4446b7a4c1ef.scope: Failed to add PIDs to scope's control group: Permission denied

The system was booted with systemd.unified_cgroup_hierarchy=1. Removing that option (and rebooting) makes systemd-run --scope work again.

@evverx

This comment has been minimized.

Copy link
Member

evverx commented May 31, 2016

-bash-4.3# cat /proc/cmdline
root=/dev/sda1 raid=noautodetect loglevel=2 init=/usr/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.log_level=debug systemd.unit=multi-user.target systemd.unified_cgroup_hierarchy=1

-bash-4.3# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

-bash-4.3# systemctl --version
systemd 230
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

-bash-4.3# systemd-run --user --scope bash -c 'sleep 10'
Running scope as unit: run-rc42d127d342b4eb6ab73acdb4143cccf.scope

-bash-4.3# journalctl --sync

-bash-4.3# journalctl -b --no-hostname | grep run-rc42d127d342b4eb6ab73acdb4143cccf
May 31 04:35:48 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: Trying to enqueue job run-rc42d127d342b4eb6ab73acdb4143cccf.scope/start/fail
May 31 04:35:48 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: Installed new job run-rc42d127d342b4eb6ab73acdb4143cccf.scope/start as 13
May 31 04:35:48 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: Enqueued job run-rc42d127d342b4eb6ab73acdb4143cccf.scope/start as 13
May 31 04:35:48 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope changed dead -> running
May 31 04:35:48 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: Job run-rc42d127d342b4eb6ab73acdb4143cccf.scope/start finished, result=done
May 31 04:35:58 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: cgroup is empty
May 31 04:35:58 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope changed running -> dead
May 31 04:35:58 systemd[153]: run-rc42d127d342b4eb6ab73acdb4143cccf.scope: Collecting.

Maybe, https://github.com/systemd/systemd/blob/master/NEWS#L66

Therefore it
is necessary to also update systemd in the initramfs if using the
unified hierarchy. An updated SELinux policy is also required.

@evverx

This comment has been minimized.

Copy link
Member

evverx commented May 31, 2016

Well, I've tried as non-root:

-bash-4.3$ id
uid=1002(hola) gid=1002(hola) groups=1002(hola)

-bash-4.3$ systemd-run --user --scope bash -c 'sleep 10'
Job for run-rd5b069a420f54cc592fbe1ebe2d90c90.scope failed.
See "systemctl status run-rd5b069a420f54cc592fbe1ebe2d90c90.scope" and "journalctl -xe" for details.
-bash-4.3$ journalctl -b | grep run-rd5b069a420f54cc592fbe1ebe2d90c90.
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
      Pass -q to turn off this notice.
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Failed to load configuration: No such file or directory
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Trying to enqueue job run-rd5b069a420f54cc592fbe1ebe2d90c90.scope/start/fail
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Installed new job run-rd5b069a420f54cc592fbe1ebe2d90c90.scope/start as 7
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Enqueued job run-rd5b069a420f54cc592fbe1ebe2d90c90.scope/start as 7
May 31 05:26:54 systemd-testsuite systemd[156]: Failed to set pids.max on /user.slice/user-1002.slice/user@1002.service/run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: No such file or directory
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Failed to add PIDs to scope's control group: Permission denied
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope changed dead -> failed
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Job run-rd5b069a420f54cc592fbe1ebe2d90c90.scope/start finished, result=failed
May 31 05:26:54 systemd-testsuite systemd[156]: run-rd5b069a420f54cc592fbe1ebe2d90c90.scope: Unit entered failed state.
@evverx

This comment has been minimized.

Copy link
Member

evverx commented Jun 20, 2016

For the record:

  • systemd-run --user --scope runs a process PID inside the cgroup /user.slice/user-M.slice/session-N.scope
  • systemd --user runs mkdir /sys/fs/cgroup/user.slice/user-M.slice/user@M.service/run-PID.scope
  • systemd --user tries to write PID to /sys/fs/cgroup/user.slice/user-M.slice/user@M.service/run-PID.scope/cgroup.procs

This fails

The writer must have write access to the "cgroup.procs" file of the
common ancestor of the source and destination cgroups.

https://github.com/torvalds/linux/blob/5518f66b5a64b76fd602a7baf60590cd838a2ca0/Documentation/cgroup-v2.txt#L331

(systemd --user doesn't have write access to the /sys/fs/cgroup/user.slice/user-M.slice/cgroup.procs)

@htejun

This comment has been minimized.

Copy link
Contributor

htejun commented Sep 7, 2016

Hmm.. so the restriction is there to prevent !priv processes from jumping across isolation points. Is it problematic to give write access of /sys/fs/cgroup/user.slice/user-M.slice/cgroup.procs to the user?

@peterhoeg

This comment has been minimized.

Copy link

peterhoeg commented Sep 8, 2016

fwiw, manually setting the permissions makes it work.

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Sep 11, 2016

@htejun , sorry for the delay.

Is it problematic to give write access of /sys/fs/cgroup/user.slice/user-M.slice/cgroup.procs to the user?

Well, there is a complex interaction between pam_systemd, systemd, systemd --user and systemd-logind. I'm not sure which part of the systemd should chown/chmod the cgroup.procs.

Another question: what does Delegate= really mean?

For unprivileged services (i.e. those using the User= setting), this allows processes to create a subhierarchy beneath its control group path.

└── user-1001.slice
    ├── session-4.scope
    └── user@1001.service <-- Delegate=yes
        ├── dbus.socket
         ...
        ├── init.scope <-- systemd --user here
        ├── run-pid.scope
        ├── -.mount
        ├── run-user-1001.mount
        ...

Why does systemd --user try to move process from session-4.scope to user@1001.service/run-pid.scope ? Should Delegate allow this?

@floppym

This comment has been minimized.

Copy link
Contributor

floppym commented Nov 5, 2016

This bug now effects anyone running systemd >= 232 with a recent kernel, since the new cgroup fstype is now used for /sys/fs/cgroup/systemd by default.

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Nov 5, 2016

@floppym , right. Marked as "regression"

@peterhoeg

This comment has been minimized.

Copy link

peterhoeg commented Nov 7, 2016

This bug has been around for as long as the new hierarchy has been available.

My workaround is this script:

#!/usr/bin/env bash

set -euo pipefail

file=/sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.procs

sudo chown $(whoami):root $file
sudo chmod g+w            $file
@floppym

This comment has been minimized.

Copy link
Contributor

floppym commented Feb 23, 2017

Should this not block the v233 release? Or will v233 have unified cgroup support diasbled out of the box?

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Feb 25, 2017

@floppym , I'm not sure what to do about it.

According to f98220a

It is expected that general-purpose distributions might want to override this.

I think all distros will use ./configure ... --with-default-hierarchy=legacy

@floppym

This comment has been minimized.

Copy link
Contributor

floppym commented Feb 25, 2017

@evverx @keszybz It just seems awfully strange to have a default configuration that breaks one of systemd's own provided tools.

@keszybz keszybz added this to the v233 milestone Feb 27, 2017

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Feb 28, 2017

So, I am not sure what the right fix is here. I am not convinced that we should grant access to the slice's cgroup.procs file to the user, as that's a cgroup PID 1 manages, and I am very sure unprivileged users should not be able to muck with stuff PID 1 manages. It's fine to muck with stuff where PID 1 explicitly gave up control (through Delegate=yes), but the .slice units are not of that type.

I mean, if we could somehow mark the cgroup as an something where the user is permitted to jump acrosswithout actually giving him access to the file itself? I mean right now giving the +w bit on the file does two things: the user could move PIDs into the cgroup itself, and jump across all its children. I am only interested in permitting the latter, not the former.

note that this is only an issue when using "systemd-run --scope" from a process that is not invoked by "systemd --user" itself. One could claim that that's even a feature. In my normal GNOME session I cannot reproduce this issue, due to this, as my gnome-terminals are run as children of "systemd --user". If I ssh into the local machine with my own user name I can reproduce the issue however, as then it's not permitted to do this.

I am not sure this really deserves to be a block for v233, and I will drop this milestone now. We should find a solution for this though.

@htejun any ideas?

@poettering poettering removed this from the v233 milestone Feb 28, 2017

@floppym

This comment has been minimized.

Copy link
Contributor

floppym commented Feb 28, 2017

Using systemd-run --user --scope screen is suggested by logind.conf(5) and systemd-run(1) as a workaround to make tools like screen and tmux work when KillUserProcesses is enabled in logind.

It would be really nice if that suggestion actually worked. ;-)

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Feb 28, 2017

so after chatting with @htejun we can make this work like this:

When we create the user slice, we should:

  1. create the slice's cgroup
  2. enable at least one controller in cgroups.subtree_control (e.g. "pids")
  3. chown the slice's cgroups.proc file to the user

This would be sufficient to make things work. Enabling the controller means that the cgroup can't accept PIDs anymore, it becomes an inner node even if it is located at a leaf. That way, as soon as we chown the slice's cgroups.proc file to the user this will only enable the cross-movement of PIDs below this slice, but it won't permit the user to directly add PIDs to the slice cgroup, as the kernel will refuse this for inner nodes. And since the cgroups.subtree_control file is not chowned to the user he can't change the inner node state on his own, hence we should be safe.

Now, all that sounds pretty straightforward... There's just one problem with it: so far we have no concept of user-ownership for a slice. Slices are just slices with noone owning them. We'd have to introduce that, and that's quite hard, as it means we'd have to have a User= setting in the slice, and that means we'd have to fork off a process to resolve it (since NSS in PID 1 is not OK), and that's yuck...

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Feb 28, 2017

enable at least one controller in cgroups.subtree_control (e.g. "pids")

README says

Kernel Config Options:
CONFIG_CGROUPS (it is OK to disable all controllers)

We can't enable at least one controller if all controllers are disabled. Am I missing something?

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Mar 1, 2017

Hm, what about a hybrid hierarchy?

-bash-4.3$ grep cgroup /proc/self/mountinfo
24 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:22 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
26 24 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,xattr,name=systemd
28 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,net_cls,net_prio
29 24 0:26 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,memory
30 24 0:27 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,cpu,cpuacct
31 24 0:28 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:29 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,hugetlb
33 24 0:30 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,cpuset
34 24 0:31 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,freezer
35 24 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,blkio
36 24 0:33 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,pids
37 24 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,perf_event

-bash-4.3$ cat /sys/fs/cgroup/unified/cgroup.controllers # prints nothing
@keszybz

This comment has been minimized.

Copy link
Member

keszybz commented Mar 1, 2017

We can't enable at least one controller if all controllers are disabled.

It's enough of a corner case to ignore. You need both cgroups-v2 and disabled controllers and be running not as a user@.service child.

There's just one problem with it: so far we have no concept of user-ownership for a slice. Slices are just slices with noone owning them. We'd have to introduce that, and that's quite hard, as it means we'd have to have a User= setting in the slice

Wouldn't it be better to avoid generic support for this (at least right now), and just reach out from user@.service when starting it, and chown slice's cgroups.proc? This could be done as a ExecStartPre=+ thing. As long as systemd-user@.service is inactive, transfering PIDs out of the slice would not be possible, but I think that's actually OK.

@evverx

This comment has been minimized.

Copy link
Member

evverx commented Mar 1, 2017

It's enough of a corner case to ignore

Indeed.

But we can't enable at least one controller if the hybrid hierarchy is used. Is the hybrid hierarchy usage a corner case too?

@keszybz keszybz modified the milestones: v235, v234 Jun 17, 2017

@wavexx

This comment has been minimized.

Copy link

wavexx commented Jun 28, 2017

Debian unstable is using hybrid as the default mode, which breaks --user + --scope.
Is it really expected that distributions request the legacy controller manually, or there's a reasonable fix pending?

As running processes in a transient scope was the expected migration away from KillUserProcesses=yes, it's a bit sad that the whole setup is broken.

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Sep 11, 2017

Hmm, so I tested this again in hybrid, on kernel 4.11. It appears to just work now here. Is this still an issue? Did the kernel change recently on this?

@lilydjwg

This comment has been minimized.

Copy link

lilydjwg commented Sep 12, 2017

I don't know what hybrid mode is, but I still have the issue with Arch Linux, kernel 4.12.4, systemd 234.11.

@gdamjan

This comment has been minimized.

Copy link
Contributor

gdamjan commented Sep 12, 2017

On Arch, with kernel 4.13.1 and systemd 234.11-9 the systemd-run (from the original bug report) doesn't work in neither hybrid nor unified mode.

The error is:

Sep 12 14:30:09 arch-uefi-test systemd[276]: run-r072f0086eca04eb9900cd0002d92c2b6.scope: Failed to add PIDs to scope's control group: Permission denied

unified mode was enabled with systemd.unified_cgroup_hierarchy=1, without it, systemd defaulted to hybrid mode.

@sourcejedi

This comment has been minimized.

Copy link
Contributor

sourcejedi commented Sep 12, 2017

+1. I'm getting this message with systemd v234-424-gc13ee7cc8 installed from source. No options to ./configure, no options on kernel cmdline. System is Fedora 26 in a VM. Kernel 4.12.9-300.fc26.x86_64 and 4.11.11-300.fc26.x86_64, both show this result. Selinux is in permissive mode.

There doesn't seem to be a problem with the packaged systemd-233-6.fc26.x86_64, which is what I have on the physical host. I don't know why that would be.

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Sep 27, 2017

Hmm, after looking into this: I figure little has changed regarding this on the kernel side, I figure for now the only thing we can do is to move this to v236...

@poettering poettering modified the milestones: v235, v236 Sep 27, 2017

@poettering poettering modified the milestones: v236, v237 Nov 15, 2017

@akors

This comment has been minimized.

Copy link

akors commented Jan 16, 2018

So what is the currently suggested way to keep a screen session running after logging out?

@keszybz

This comment has been minimized.

Copy link
Member

keszybz commented Jan 17, 2018

Don't use unified hierarchy yet.

@poettering poettering modified the milestones: v237, v238 Jan 24, 2018

@gasinvein

This comment has been minimized.

Copy link

gasinvein commented Jan 25, 2018

So is there any workaround without disabling unified hierarchy?

poettering added a commit to poettering/systemd that referenced this issue Feb 7, 2018

core: add new new bus call for migrating foreign processes to scope/s…
…ervice units

This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: systemd#3388

poettering added a commit to poettering/systemd that referenced this issue Feb 9, 2018

core: add new new bus call for migrating foreign processes to scope/s…
…ervice units

This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: systemd#3388

poettering added a commit to poettering/systemd that referenced this issue Feb 9, 2018

core: add new new bus call for migrating foreign processes to scope/s…
…ervice units

This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: systemd#3388

poettering added a commit to poettering/systemd that referenced this issue Feb 12, 2018

core: add new new bus call for migrating foreign processes to scope/s…
…ervice units

This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: systemd#3388

@ghost ghost referenced this issue Feb 12, 2018

Closed

systemd error messages #1216

@keszybz keszybz removed the has-pr label Feb 15, 2018

dm0- pushed a commit to dm0-/systemd that referenced this issue Oct 30, 2018

Merge pull request systemd#3388 from bgilbert/disarm-kernel
sys-kernel/coreos-*: drop arm64 support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment