Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX #13674

Closed
gleventhal opened this issue Sep 27, 2019 · 16 comments
Closed

Comments

@gleventhal
Copy link

@gleventhal gleventhal commented Sep 27, 2019

systemd version the issue has been seen with

...systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

This is the version that ships with CentOS Linux release 7.7.1908 (Core)

Used distribution

…CentOS Linux release 7.7.1908 (Core)

Expected behaviour you didn't see

… systemctl restart to run successfully, cron to start a session and run successfully.

Unexpected behaviour you saw

…Cron failed with:
crond[31768]: pam_systemd(crond:session): Failed to create session: No buffer space available
systemd-logind failed with:
systemd-logind[2245]: Failed to start session scope session-2154.scope: No buffer space available"

Steps to reproduce the problem
All I have to do is walk the user directories on our NFS, which uses automount. Each mounted directory adds a node to:

├─/org/freedesktop/systemd1/unit/nfs_mount_dir_etc...

After running:

$ find /nfs/users -maxdepth 2 -type d -exec ls {} \;

We see these errors after running this find command. busctl times out during this period:

$ busctl tree org.freedesktop.systemd1 |grep systemd1/unit/nfs
Failed to introspect object /org/freedesktop/systemd1/unit/nfs_2emount of service org.freedesktop.systemd1: Connection timed out

I see that the Wqueue and RQueue statically allocated buffers have been increased due to similar complaints, but I am skeptical about any solution that doesn't use something more dynamic.

This is particularly dangerous since a non-root user can prevent root crontabs from running and can prevent systemd from executing a command.

@gleventhal gleventhal changed the title A non-privileged user could DOS systemd by exhausting the WQueue A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX Sep 27, 2019
@gleventhal

This comment has been minimized.

Copy link
Author

@gleventhal gleventhal commented Oct 22, 2019

@poettering Bump?

@poettering

This comment has been minimized.

Copy link
Member

@poettering poettering commented Nov 3, 2019

We have to put limits on everything we allocate. Note that the queue lengths are just hard limits for dynamic allocation, and the actual queues are always allocated dynamically only slightly larger than what we need at the specific moment.

Thus: these aren't limits you are supposed to ever hit, and if you do anyway, then other bad things have happened much earlier. They are safety nets, nothing more.

They have been bumped a couple of times, most recently in 83a32ea.

You are running a very old version, maybe just upgrade to something newer? (or ask your distro to backport the bumping).

Or to summarize this in simpler terms: if you manage to hit the limit IRL, then the limit needs to be bumped further for everyone, but not removed entirely, because that means run-away client code can consume resources without bounds, and we should never allow that. Failing at some point with a clear error is much better than failing too late when errors can't be reported anymore.

Anyway, let's close this for now, as this recently was bumped yet and your version is 4y old. Let's presume this is fixed.

if you run into the same issues with a current version of systemd, please report back and we can bump the limit further.

@gleventhal

This comment has been minimized.

Copy link
Author

@gleventhal gleventhal commented Nov 6, 2019

@poettering Is it just me, or is Red Hat / CentOS not really keeping very current with Systemd back-porting? I am using CentOS 7.8 which is ~reasonably current (for enterprise), and ships with systemd-219-67.el7_7.2.x86_64 but below are the values we have set for this issue (many orders of magnitude less than current)

#define BUS_WQUEUE_MAX 1024
#define BUS_RQUEUE_MAX 64*1024

In addition, we also have the systemd-coredump bug where if we hit E2BIG due to ProcessMax, it fails to drop privs when writing to the journal so that non-root users get false negative results when running coredumpctl on dumps that they own but were too big to store (but they should still have access to the journal entries, but dont due to this bug)

@floppym

This comment has been minimized.

Copy link
Contributor

@floppym floppym commented Nov 6, 2019

I would guess there is little motivation to backport changes unless a paying customer requests it.

@gleventhal

This comment has been minimized.

Copy link
Author

@gleventhal gleventhal commented Nov 7, 2019

Aw nuts. So I guess it's either deal with it, patch it, or nag Red Hat. :(

@poettering

This comment has been minimized.

Copy link
Member

@poettering poettering commented Nov 8, 2019

/cc @msekletar @lnykryn something to backport to rhel?

@lnykryn

This comment has been minimized.

Copy link
Member

@lnykryn lnykryn commented Nov 8, 2019

Yeah, looks reasonable
https://bugzilla.redhat.com/show_bug.cgi?id=1770158

@gleventhal @floppym Could you please next time file a bug in our bugzilla? It would speed things up a lot.

@gleventhal

This comment has been minimized.

Copy link
Author

@gleventhal gleventhal commented Nov 8, 2019

@lnykryn Yes, I will go through Bugzilla next time, thanks!

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Nov 14, 2019

@poettering I understand there has to be a ceiling, but could we get this ceiling user-configurable, e.g. through a sysctl, rather than/in addition to baking something into the code? We'd find it very useful to be able to fix this issue ourselves without having to roll a new version of systemd.

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Nov 27, 2019

@poettering I've been investigating further to try to understand how mounting a couple of thousand NFS mounts can trigger 1.3 million+ dbus messages and subsequent buffer space issues.

It turns out (news to me) that systemd is watching for changes to /proc/self/mountinfo and whenever there is a change in mounts, the entire mountinfo table is re-communicated from systemd to systemd-logind over dbus messages.

When you add 2,000 mount-points, for each mount-point that is added the entire table is sent again (4 property change messages per mount), and of course the table gets incrementally larger each time. I counted over 1.3 million dbus messages via a stap probe as a result.

Under normal circumstances, systemd-logind manages to keep the queue drained, but sometimes - if it is under pressure from other system events - it cannot keep up and systemd chokes. There is an exception in mount_setup_unit() whereby fstype of "autofs" will not produce these messages, but no such exception for mount-points added by /usr/sbin/automount where fstype is "nfs".

Firstly, I don't really understand why systemd needs to watch /proc/self/mountinfo and set-up mount units for any mount that appears. It would be nice to read an explanation of the benefits somewhere and maybe an option to disable this behaviour if the user doesn't need this. If it is enabled, improving the logic so that the entire mount table is not re-communicated for each new mount would be a useful optimisation.

Secondly, I was speaking with Ian Kent to try to determine if there's something that could be done in /usr/sbin/automount to make systemd ignore mount-points managed by it, but at the moment he doesn't know of any way to do that. In fact he admits he was previously unaware that systemd was setting up mount units for itself. Can we get some sort of change into systemd that would allow the automount daemon to communicate that it is responsible for the mount-points and that systemd should ignore them?

@poettering

This comment has been minimized.

Copy link
Member

@poettering poettering commented Nov 27, 2019

@jsmarkb which systemd version are you using?

we track mounts as "mount units", so that people can have deps on them. it's kinda at the core of what systemd does: starting up in the right order, i.e. making sure that everything X needs is done before X is started, and that prominently includes mounts. If you don't want this behaviour you don't want systemd really.

systemd-logind doesn't track mounts though, it's PID 1 only.

we generally generate bus messages for mounts coming and going and changing state. If a mount doesn't change state we should not generate events for it. If we do anyway, it would be a bug. So yes, if you start n mount points you should get O(n) messages.

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Nov 27, 2019

@poettering I'm using 219 same as @gleventhal.

If a mount is added and removed dynamically by the inlined Linux automounter, then systemd won't be able to build dependencies on it, so the subsequent mount unit that is created serves no purpose as far as I can see?

"If you don't want this behaviour you don't want systemd really". Not really much choice these days :-)

I've watched systemd send dbus messages to systemd-logind for each and every new mount. I was puzzled as to why systemd-logind was involved, but it definitely was.

Right, events ought not be generated for mounts that don't change state, but that's not what I'm witnessing here.

@fsateler

This comment has been minimized.

Copy link
Member

@fsateler fsateler commented Dec 29, 2019

#10268 partially addresses this, and was merged for 240.

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Jan 21, 2020

Wasn't that backed out by ec8126d7? I don't see it in the master branch anymore?

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Jan 29, 2020

@poettering - I'm not sure this issue ought to be closed.

If the fix was increasing BUS_WQUEUE_MAX, that doesn't change anything if the summary of this bug is "by exceeding BUS_WQUEUE_MAX". Whatever you set BUS_WQUEUE_MAX to you will always be able to trigger the problem by exceeding it. At best, it kicks the can down the road, at worst it leaves customers with bigger environments at risk of hitting the exact same issue.

If #10268 is a partial fix, as I pointed out 8 days ago that commit was reverted in December 2018. So effectively we're left with a closed issue and no strategic fix.

I would suggest the solution is two-fold:

  1. Make BUS_WQUEUE_MAX user-configurable. This would, at least, provide customers some sort of immediate solution if they hit the problem.
  2. Reduce the number of dbus messages that systemd produces during a mount storm by optimising the data paths. This could manifest itself in a number of different patches, such as:
    • A re-implemented version of ec8126d7 that doesn't break system boot. Personally I'm unconvinced by the approach taken in this patch. Incidentally, I wonder how much of a negative impact current systemd behaviour with mount-points has on system boot time.
    • Fixing the current /proc/self/mountinfo scraper so that it remembers what information it has already communicated and only sends dbus messages for brand new mounts. This would significantly reduce the load, dropping the 1.3 million dbus messages I witnessed down to a much more reasonable number.
    • Working with kernel folk to get notifications going when filesystems are mounted and unmounted, after which the /proc/self/mountinfo scraper can be retired.

What do you think? Can we reopen this issue and start prioritising the possible solutions?

@jsmarkb

This comment has been minimized.

Copy link

@jsmarkb jsmarkb commented Feb 26, 2020

I've written a blog now that describes how this issue was discovered and why auto-mounting thousands of directories in quick succession can result in millions of dbus messages being sent between systemd and systemd-logind, resulting in the buffer reaching BUS_WQUEUE_MAX. The blog contains plenty of data collected via SystemTap to prove the point. You can read all about it in Troubleshooting systemd with SystemTap.

The whole reason I blogged about it was because I didn't think that the problem was fully understood or fully resolved when this issue was closed.

@poettering - we still have no long-term fix for this issue. Would you like me to open a new issue or can this one be re-opened?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.