Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Units with DynamicUser=yes fail in lxc container #9493

Closed
eworm-de opened this issue Jul 3, 2018 · 34 comments
Closed

Units with DynamicUser=yes fail in lxc container #9493

eworm-de opened this issue Jul 3, 2018 · 34 comments

Comments

@eworm-de
Copy link
Contributor

eworm-de commented Jul 3, 2018

systemd version the issue has been seen with

v239

Though this has been reported for systemd-networkd, which switched to use a dynamic user lately. I think other units failed before.

Used distribution

Arch Linux

Expected behaviour you didn't see

Successfully execute systemd-run --property=DynamicUser=yes /usr/bin/true

Unexpected behaviour you saw
Unit fails:

Jul 03 13:49:49 eworm systemd[7574]: run-r43f83bd017244093b488784c4e69b1bb.service: Failed to update dynamic user credentials: Permission denied
Jul 03 13:49:49 eworm systemd[7574]: run-r43f83bd017244093b488784c4e69b1bb.service: Failed at step USER spawning /usr/bin/true: Permission denied
Jul 03 13:49:49 eworm systemd[1]: run-r43f83bd017244093b488784c4e69b1bb.service: Main process exited, code=exited, status=217/USER
Jul 03 13:49:49 eworm systemd[1]: run-r43f83bd017244093b488784c4e69b1bb.service: Failed with result 'exit-code'.

Steps to reproduce the problem

  • Boot a lxc container
  • Run systemd-run --property=DynamicUser=yes /usr/bin/true
@poettering
Copy link
Member

are you using userns? how many users have you assigned to the container?

@poettering
Copy link
Member

can you run "strace -p1 -s500 -f" in the container, and check out what precisely fails?

@poettering
Copy link
Member

@brauner, maybe you have an idea?

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 3, 2018

are you using userns?

The host kernel has CONFIG_USER_NS enabled.

how many users have you assigned to the container?

The file /etc/passwd inside the container lists 23 users.

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 3, 2018

can you run "strace -p1 -s500 -f" in the container, and check out what precisely fails?

See strace.log

@poettering
Copy link
Member

poettering commented Jul 3, 2018

are you using userns?

The host kernel has CONFIG_USER_NS enabled.

Well, but is LXC configured to use it?

how many users have you assigned to the container?

The file /etc/passwd inside the container lists 23 users.

Nah, the question is what the UID range is you assigned to the container. i.e. what does cat /proc/self/uid_map say when you call it from inside the container?

@poettering
Copy link
Member

Hmm, fcntl(F_SETLKW) fails with EACCES inside of the container, say the logs you posted. This suggests POSIX file locks are disabled for the container. Do you know if LXC sets up a seccomp policy or so for that? Or have you explicitly masked that?

@poettering
Copy link
Member

Also, what does cat /proc/1/limits | grep "file lock" say inside of your container?

@poettering poettering added the pid1 label Jul 3, 2018
@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 3, 2018

BTW, here is the downstream bug report:
https://bugs.archlinux.org/task/59155

I have access to a number of machines on two proxmox hosts. I can't tell the details, will have to figure.

Settings for these two differ, but the result is the same. The strace log shows EACCES for fcntl(F_SETLKW) for both:

# cat /proc/self/uid_map
         0          0 4294967295
# cat /proc/1/limits | grep "file lock"
Max file locks            unlimited            unlimited            locks
# cat /proc/self/uid_map
         0     100000      65536
# cat /proc/1/limits | grep "file lock"
Max file locks            unlimited            unlimited            locks

@poettering
Copy link
Member

is some MAC in effect on the host? i.e. apparmor or selinux or so?

i understand proxmox is some commercial provider, and you have no access to the host side of things? i am not sure what "proxmox" precisely does, but if they disable posix file locking, then you'll always have trouble running non-trivial software inside such containers.

Please contact their support, and ask them to correct their MAC/seccomp policies to allow posix file locks. There's little else we can do on this.

systemd reuqires posix locks to work, and given that that's not precisely an exotic requirement I am very sure it#s proxmox who should fix their policies on this.

@brauner
Copy link
Contributor

brauner commented Jul 3, 2018

Let me rope in @Blub. @Blub, any idea what this could be caused by and if this is specific to Proxmox?

@brauner
Copy link
Contributor

brauner commented Jul 3, 2018

Also, does your dmesg output show any AppArmor or MAC denials in general?

@poettering
Copy link
Member

Also, I am pretty sure that blocked syscalls should return EPERM, not EACCES... That is for file access mostly, while EPERM is usually used for API access

@ghost
Copy link

ghost commented Jul 3, 2018

It all works fine with 238, userns, lxc and all. In this case the LXC host is a VPS with Virtualizor based on KVM 7.4.1 as hypervisor. The moment 239 introduces DynamicUser the issue starts. Thus reckon it is not specific to Proxmox.


but perhaps an issue with apparmor, which is no issue with 238 however.

From the lxc host kernel log only this denial is apparent and probably not relevant to the issue?

audit: type=1400 audit(1530602405.099:27): apparmor=“DENIED” operation=“mount” info=“failed type match” error=-13 profile=“lxc-container-default-cgns” name="/dev/" pid=2108 comm=“mount” flags=“rw, nosuid, remount”


no MAC denials in the LXC host kernel log

@poettering
Copy link
Member

It all works fine with 238, userns, lxc and all. In this case the LXC host is a VPS with Virtualizor based on KVM 7.4.1 as hypervisor. The moment 239 introduces DynamicUser the issue starts. Thus reckon it is not specific to Proxmox.

Did you actually run "systemd-run -p DynamicUser=1 /bin/true" on 238?

@ghost
Copy link

ghost commented Jul 3, 2018

No, that did not come up in the discourse of #9427

@Blub
Copy link

Blub commented Jul 3, 2018

Proxmox by default uses the default lxc common.seccomp policy, with user namespaces it also adds ENOSYS for keyctl(). As for the AppArmor profile it should be the same, there may at most be some additional /sys mount deny rule IIRC, I don't remember anything that would affect locking, I'll have to take a closer look tomorrow.

@ghost
Copy link

ghost commented Jul 4, 2018

@poettering

238
systemd-run -p DynamicUser=1 /bin/true

Running as unit: run-re1b5d046863546018187412d08e2fee3.service


239
systemd-run -p DynamicUser=1 /bin/true

Running as unit: run-rc32210ace7b245e991ff1c62afc5aef6.service


and since 239

systemd[1]: Failed to start Network Service.

with the details in the aforementioned downstream bug report and #9427

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 4, 2018

For whatever reason systemd-run does not print errors and returns success. But the journal should tell a different story.

@poettering
Copy link
Member

@eworm-de services default to Type=simple, both in regular service files and when systemd-run is used. Type=simple means that the processes are forked off by the service manager and not waited for in any way for the start to be considered successful. This means "systemd-run" won't wait for the execve() to succeed before returning successfully, but will do so immediately. Which is why you'll see such failures only in the journal.

I figure we could add a new Type= to systemd which would wait until systemd's own initializations in the child succeeded before considering the service up. This is not precisely trivial though, as conceptually here's already the problem that in order to know that the execve() is successful we'd have to ack the startup after the execve(), but the execve() means our code already got swapped out by the new process, hence we cannot ack it...

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 4, 2018

Indeed the host logs messages similar to this:

Jul 04 15:11:11 host audit[28404]: AVC apparmor="DENIED" operation="file_lock" profile="lxc-container-default-cgns" pid=28404 comm="(true)" family="unix" sock_type="dgram" protocol=0 addr=none

So there is nothing we can do from systemd side, no?

@brauner
Copy link
Contributor

brauner commented Jul 4, 2018

@eworm-de, what kernel is this on? Best to give uname -a and also what distro.

@brauner
Copy link
Contributor

brauner commented Jul 4, 2018

That is what distro the host is running.

@brauner
Copy link
Contributor

brauner commented Jul 4, 2018

I suspect that these are the socket mediation AppArmor patches that got merged upstream but then got reverted. If this is a distro kernel carrying these patches nonetheless this would explain it.

@poettering
Copy link
Member

we issue those posix locks on an AF_UNIX socketpair() socket btw.

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 4, 2018

This is a proxmox distribution, based on Debian. Kernel is:

Linux host 4.15.17-3-pve #1 SMP PVE 4.15.17-13 (Mon, 18 Jun 2018 17:15:04 +0200) x86_64 GNU/Linux

@brauner
Copy link
Contributor

brauner commented Jul 4, 2018

@eworm-de, can you try starting the container with lxc.apparmor.profile = unconfined set in it's config and see if you see the same errors?

@brauner
Copy link
Contributor

brauner commented Jul 4, 2018

So I suspect this is an actual AppArmor bug and actually an old one. Locks on fds only should be totally legal. I'm going to escalate this bug with the AppArmor people to get some motion on this.
@poettering, this is unrelated to systemd but if you don't terribly mind it would be good if could keep this issue open to track it.
For posterity, here's another instance of that bug https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/1575779.

@brauner
Copy link
Contributor

brauner commented Jul 5, 2018

So, the good news is that this is all fixed upstream starting with 4.17 with the socket mediation patchset that got merged a short while ago. The bad news is that you would need to get Proxmox to backport this patchset and it is quite large:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80a17a5f501ea048d86f81d629c94062b76610d4

I couldn't reproduce on a newer kernels but it would be good if someone other than me could confirm that it is indeed fixed (Looking at you @Blub ;).).
Otherwise we can close this issue.

@brauner
Copy link
Contributor

brauner commented Jul 5, 2018

I've requested the socket mediation patchset to be backported to the Ubuntu LTS kernels. This can be tracked here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780227

@ghost
Copy link

ghost commented Jul 5, 2018

@brauner

Just updated the kernel of the lxc host (VPS on KVM 7._4.1 based hypervisor) to 4.17.4-041704 .
locks placed on AF_UNIX sockets are no longer denied by AppArmor

  • unpriviliged container ArchLinux with systemd 238
  • apparmor 2.12 on the lxc host

Just for completeness

  • During boot logged in the lxc host ststemlog

server kernel: [ 792.772569] audit: type=1400 audit(1530791008.499:39): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/cgroup/unified/" pid=11901 comm="systemd" fstype="cgroup2" srcname="cgroup2" flags="rw, nosuid, nodev, noexec"
server kernel: [ 792.772578] audit: type=1400 audit(1530791008.499:40): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/cgroup/unified/" pid=11901 comm="systemd" fstype="cgroup2" srcname="cgroup2" flags="rw, nosuid, nodev, noexec"
server kernel: [ 792.897652] audit: type=1400 audit(1530791008.623:41): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/kernel/config/" pid=11978 comm="mount" fstype="configfs" srcname="configfs"
server kernel: [ 792.897703] audit: type=1400 audit(1530791008.623:42): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/kernel/config/" pid=11978 comm="mount" fstype="configfs" srcname="configfs" flags="ro"
server kernel: [ 792.914358] audit: type=1400 audit(1530791008.639:43): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=11990 comm="(networkd)" flags="rw, rslave"
server kernel: [ 793.042265] audit: type=1400 audit(1530791008.767:44): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=12062 comm="(ostnamed)" flags="rw, rslave"


  • running hostnamectl status in the container

Jul 5 13:44:45 server kernel: [ 869.611220] audit: type=1400 audit(1530791085.334:45): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=12999 comm="(ostnamed)" flags="rw, rslave"

However it did produce an output whilst previsouly with kernel 4.15 it would just time out.

@eworm-de
Copy link
Contributor Author

eworm-de commented Jul 5, 2018

All these messages are about mounts being denied. The issue we discuss here is with POSIX file locks (operation="file_lock"). So does this result in any logs on host and inside container?

systemd-run --property=DynamicUser=yes /usr/bin/true

@ghost
Copy link

ghost commented Jul 5, 2018

updated kernel 4.17.4-041704
locks placed on AF_UNIX sockets are no longer denied by AppArmor

So does this result in any logs on host and inside container?

No

@poettering
Copy link
Member

Given that this was fixed elsewhere by now, let's close this here. Hope that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants