-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binfmt: several cleanups #25693
binfmt: several cleanups #25693
Conversation
@bluca Thank you for the review. The above comment is addressed. Upgrading the green label. |
I think Moreover, we want systemd-binfmt to trigger it, since we log the PID of the trggering process, and that should be systemd-binfmt, not PID 1. |
@poettering Thank you for the suggestion. I did not notice there exists |
@poettering responding here for visibility:
I tested your suggestion along with the other patches in this PR, but it does not appear to be sufficient. You can demonstrate this easily with the following:
|
what's the point of mounting that if it doesn't work? |
Fair point. According to this documentation, this is done to support "legacy init systems which require those to be mounted or be mountable inside the container." I have not found more specific references than that yet. Are there instances where one would expect |
(Not familiar, but may be related to https://lore.kernel.org/all/20211028103114.2849140-2-brauner@kernel.org/) |
Hmm, binfmt_misc is virtualized? interesting. @enr0n is your kernel new enough for that? |
@enr0n i don't understand how binfmt_misc ended up being mounted in your container? this is lxd, and lxd uses userns by efault, so your container shouldn't have the perms to. are you turning that off? |
AFAICT these patches are not (yet?) in the mainline kernel. There was some chatter about the status of patches on the v2 thread a few months ago.
I'll have to look into the details to answer this better, but I am running a very plain LXD config with unprivileged containers by default. I can double check this with a fresh install to make sure I haven't missed something special in my setup. |
I just did the following in a clean Ubuntu 22.04 VM:
and binfmt_misc is still mounted in the container. |
LXD does this with the apparmor profile it seems: $ sudo grep binfmt /var/snap/lxd/common/lxd/security/apparmor/profiles/lxd-lunar
# Handle binfmt
mount fstype=binfmt_misc -> /proc/sys/fs/binfmt_misc/,
deny /proc/sys/fs/binfmt_misc/{,**} rwklx, |
@enr0n but apparmor can only deny stuff normally allowed, but not allow stuff normally denied. So, I don't understand how unbloocking binfmt_misc would do any good, the mounting should have been forbidden anyway... |
Just providing a bit of background here for why it's done this way in LXD. binfmt_misc is indeed bind-mounted by the container manager from the host system into the container. Now as for why this is done at all, it's because some other init systems (upstart on CentOS and older Ubuntu at least) expect to be able to mount binfmt_misc and fail quite spectacularly when they cannot do this (init stops, nothing starts up). |
hmm, so not sure how do address this best then. ideally we wouldn't trigger the binfmt_misc mount via a ConditionReadWritePath= check (since as mentioned this is a bit icky, since on the host the thing is mounted via autofs implemented by pid 1 itself, so we'd trigger the autofs we ourselves maintain — which won't cause a deadlock or so because there's a circuit breaker, but it's still ugly). Hence I'd prefer the ConditionPathIsMountPoint= check, which should cover things nicely – except for the LXD case where the thing is mounted but not usable. Sniff. |
Thank you for the explanation. But, why not to mount it read-only, but using apparmor? At least, until the namespace support for binfmt comes to the kernel, mounting with read-only makes things simpler, and then make systemd-binfmt.service become happy. |
hmm, so with @yuwata's patch systemd-binfmt actually does the read-only check internally as well. So maybe the approach is actually to use ConditionIsMountPoint= here, which will cover all non-LXD usecases very nicely. And then LXD is covered via @yuwata's patch which then does an explicit read-only check early-on. This should make things clean for everyone. The only downside would be that in LXD the systemd-binfmt tool gets invoked but then immediately exits, while for other cases we wouldn't even start that. That sounds like an OK compromise, no? |
For the LXD case, I think |
Then, reimplement path_is_read_only_fs() by the function to avoid race.
No functional changes, just refactoring and preparation for later commits.
@poettering and @enr0n Thank you for the suggestions. All comments are addressed. PTAL. |
replaces #25690.