Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binfmt: several cleanups #25693

Merged
merged 5 commits into from
Dec 15, 2022
Merged

binfmt: several cleanups #25693

merged 5 commits into from
Dec 15, 2022

Conversation

yuwata
Copy link
Member

@yuwata yuwata commented Dec 10, 2022

replaces #25690.

@yuwata yuwata added the binfmt label Dec 10, 2022
@github-actions github-actions bot added units please-review PR is ready for (re-)review by a maintainer labels Dec 10, 2022
src/binfmt/binfmt.c Outdated Show resolved Hide resolved
@bluca bluca added good-to-merge/with-minor-suggestions and removed please-review PR is ready for (re-)review by a maintainer labels Dec 10, 2022
@yuwata
Copy link
Member Author

yuwata commented Dec 12, 2022

@bluca Thank you for the review. The above comment is addressed. Upgrading the green label.

@yuwata yuwata added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed good-to-merge/with-minor-suggestions labels Dec 12, 2022
@poettering
Copy link
Member

poettering commented Dec 12, 2022

I think ConditionPathIsMountPoint=/proc/sys/fs/binfmt_misc would be a better approach, since it won't trigger the autofs that is used there. And that's good, since we when the autofs isn't triggered yet, PID 1 couldn't prohibit it anyway, as the deadlock-breakage of the autofs logic prohibits that.

Moreover, we want systemd-binfmt to trigger it, since we log the PID of the trggering process, and that should be systemd-binfmt, not PID 1.

src/basic/stat-util.c Outdated Show resolved Hide resolved
@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed labels Dec 12, 2022
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Dec 12, 2022
@yuwata
Copy link
Member Author

yuwata commented Dec 12, 2022

@poettering Thank you for the suggestion. I did not notice there exists access_fd(). Now the function is used, and path_is_read_only_fs() is implemented on fd_is_read_only_fs(). PTAL!

@enr0n
Copy link
Contributor

enr0n commented Dec 12, 2022

@poettering responding here for visibility:

So, I think this should instead do: ConditionPathIsMountPoint=/proc/sys/fs/binfmt_misc to check if binfmt_misc or autofs is mounted there. That should mean we won't trigger the fs needlessly, and we should work on any system.

We generally avoid adding checks specific to container managers, i.e. any check linked to lxc directly is icky.

I tested your suggestion along with the other patches in this PR, but it does not appear to be sufficient. /proc/sys/fs/binfmt_misc is mounted, but since we are in an unprivileged container we still do not have permission to read/write this mount point. This thread has more details on this (note that the kernel patches for namespace support in binfmt_misc never landed).

You can demonstrate this easily with the following:

lxc launch ubuntu-daily:lunar lunar
lxc exec lunar bash
touch /usr/lib/binfmt.d/foo.conf
systemctl start systemd-binfmt.service

@poettering
Copy link
Member

what's the point of mounting that if it doesn't work?

@enr0n
Copy link
Contributor

enr0n commented Dec 12, 2022

what's the point of mounting that if it doesn't work?

Fair point. According to this documentation, this is done to support "legacy init systems which require those to be mounted or be mountable inside the container." I have not found more specific references than that yet.

Are there instances where one would expect /proc/sys/fs/binfmt_misc to securely work from containers? If not would it be acceptable to add ConditionVirtualization=!container instead of singling out LXD?

@yuwata
Copy link
Member Author

yuwata commented Dec 13, 2022

(Not familiar, but may be related to https://lore.kernel.org/all/20211028103114.2849140-2-brauner@kernel.org/)

@poettering
Copy link
Member

Hmm, binfmt_misc is virtualized? interesting. @enr0n is your kernel new enough for that?

@poettering
Copy link
Member

@enr0n i don't understand how binfmt_misc ended up being mounted in your container? this is lxd, and lxd uses userns by efault, so your container shouldn't have the perms to. are you turning that off?

@enr0n
Copy link
Contributor

enr0n commented Dec 13, 2022

Hmm, binfmt_misc is virtualized? interesting. @enr0n is your kernel new enough for that?

AFAICT these patches are not (yet?) in the mainline kernel. There was some chatter about the status of patches on the v2 thread a few months ago.

@enr0n i don't understand how binfmt_misc ended up being mounted in your container? this is lxd, and lxd uses userns by efault, so your container shouldn't have the perms to. are you turning that off?

I'll have to look into the details to answer this better, but I am running a very plain LXD config with unprivileged containers by default. I can double check this with a fresh install to make sure I haven't missed something special in my setup.

@enr0n
Copy link
Contributor

enr0n commented Dec 13, 2022

I just did the following in a clean Ubuntu 22.04 VM:

sudo snap install lxd
lxd init --auto
lxc launch ubuntu-daily:lunar lunar
lxc exec lunar bash

and binfmt_misc is still mounted in the container.

@enr0n
Copy link
Contributor

enr0n commented Dec 13, 2022

@enr0n i don't understand how binfmt_misc ended up being mounted in your container? this is lxd, and lxd uses userns by efault, so your container shouldn't have the perms to. are you turning that off?

LXD does this with the apparmor profile it seems:

$ sudo grep binfmt /var/snap/lxd/common/lxd/security/apparmor/profiles/lxd-lunar
  # Handle binfmt
  mount fstype=binfmt_misc -> /proc/sys/fs/binfmt_misc/,
  deny /proc/sys/fs/binfmt_misc/{,**} rwklx,

@poettering
Copy link
Member

@enr0n but apparmor can only deny stuff normally allowed, but not allow stuff normally denied.

So, I don't understand how unbloocking binfmt_misc would do any good, the mounting should have been forbidden anyway...

@stgraber
Copy link
Contributor

Just providing a bit of background here for why it's done this way in LXD.

binfmt_misc is indeed bind-mounted by the container manager from the host system into the container.
Until binfmt_misc namespacing is a thing, we don't actually want our containers messing with this, so apparmor is used to prevent actual access to it.

Now as for why this is done at all, it's because some other init systems (upstart on CentOS and older Ubuntu at least) expect to be able to mount binfmt_misc and fail quite spectacularly when they cannot do this (init stops, nothing starts up).
So this bind-mount and a few similar others are used to make those init systems happy by seeing it's already mounted and moving on.

@poettering
Copy link
Member

hmm, so not sure how do address this best then. ideally we wouldn't trigger the binfmt_misc mount via a ConditionReadWritePath= check (since as mentioned this is a bit icky, since on the host the thing is mounted via autofs implemented by pid 1 itself, so we'd trigger the autofs we ourselves maintain — which won't cause a deadlock or so because there's a circuit breaker, but it's still ugly). Hence I'd prefer the ConditionPathIsMountPoint= check, which should cover things nicely – except for the LXD case where the thing is mounted but not usable.

Sniff.

@yuwata
Copy link
Member Author

yuwata commented Dec 14, 2022

Thank you for the explanation.

But, why not to mount it read-only, but using apparmor? At least, until the namespace support for binfmt comes to the kernel, mounting with read-only makes things simpler, and then make systemd-binfmt.service become happy.

@poettering
Copy link
Member

hmm, so with @yuwata's patch systemd-binfmt actually does the read-only check internally as well. So maybe the approach is actually to use ConditionIsMountPoint= here, which will cover all non-LXD usecases very nicely. And then LXD is covered via @yuwata's patch which then does an explicit read-only check early-on.

This should make things clean for everyone. The only downside would be that in LXD the systemd-binfmt tool gets invoked but then immediately exits, while for other cases we wouldn't even start that.

That sounds like an OK compromise, no?

src/basic/stat-util.c Outdated Show resolved Hide resolved
src/binfmt/binfmt.c Outdated Show resolved Hide resolved
@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels Dec 14, 2022
@enr0n
Copy link
Contributor

enr0n commented Dec 14, 2022

hmm, so with @yuwata's patch systemd-binfmt actually does the read-only check internally as well. So maybe the approach is actually to use ConditionIsMountPoint= here, which will cover all non-LXD usecases very nicely. And then LXD is covered via @yuwata's patch which then does an explicit read-only check early-on.

This should make things clean for everyone. The only downside would be that in LXD the systemd-binfmt tool gets invoked but then immediately exits, while for other cases we wouldn't even start that.

That sounds like an OK compromise, no?

For the LXD case, I think systemd-binfmt needs to handle -EACCES when trying to read or write /proc/sys/fs/binfmt_misc, not -EROFS. Otherwise yeah I think that's a good compromise.

Then, reimplement path_is_read_only_fs() by the function to avoid race.
No functional changes, just refactoring and preparation for later
commits.
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Dec 14, 2022
@yuwata
Copy link
Member Author

yuwata commented Dec 14, 2022

@poettering and @enr0n Thank you for the suggestions. All comments are addressed. PTAL.

@poettering poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed please-review PR is ready for (re-)review by a maintainer labels Dec 14, 2022
@yuwata yuwata merged commit 1af1c95 into systemd:main Dec 15, 2022
@yuwata yuwata deleted the binfmt branch December 15, 2022 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binfmt good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed units
Development

Successfully merging this pull request may close these issues.

None yet

5 participants