New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX #13674
Comments
|
@poettering Bump? |
|
We have to put limits on everything we allocate. Note that the queue lengths are just hard limits for dynamic allocation, and the actual queues are always allocated dynamically only slightly larger than what we need at the specific moment. Thus: these aren't limits you are supposed to ever hit, and if you do anyway, then other bad things have happened much earlier. They are safety nets, nothing more. They have been bumped a couple of times, most recently in 83a32ea. You are running a very old version, maybe just upgrade to something newer? (or ask your distro to backport the bumping). Or to summarize this in simpler terms: if you manage to hit the limit IRL, then the limit needs to be bumped further for everyone, but not removed entirely, because that means run-away client code can consume resources without bounds, and we should never allow that. Failing at some point with a clear error is much better than failing too late when errors can't be reported anymore. Anyway, let's close this for now, as this recently was bumped yet and your version is 4y old. Let's presume this is fixed. if you run into the same issues with a current version of systemd, please report back and we can bump the limit further. |
|
@poettering Is it just me, or is Red Hat / CentOS not really keeping very current with Systemd back-porting? I am using CentOS 7.8 which is ~reasonably current (for enterprise), and ships with systemd-219-67.el7_7.2.x86_64 but below are the values we have set for this issue (many orders of magnitude less than current) In addition, we also have the systemd-coredump bug where if we hit E2BIG due to ProcessMax, it fails to drop privs when writing to the journal so that non-root users get false negative results when running coredumpctl on dumps that they own but were too big to store (but they should still have access to the journal entries, but dont due to this bug) |
|
I would guess there is little motivation to backport changes unless a paying customer requests it. |
|
Aw nuts. So I guess it's either deal with it, patch it, or nag Red Hat. :( |
|
/cc @msekletar @lnykryn something to backport to rhel? |
|
Yeah, looks reasonable @gleventhal @floppym Could you please next time file a bug in our bugzilla? It would speed things up a lot. |
|
@lnykryn Yes, I will go through Bugzilla next time, thanks! |
|
@poettering I understand there has to be a ceiling, but could we get this ceiling user-configurable, e.g. through a sysctl, rather than/in addition to baking something into the code? We'd find it very useful to be able to fix this issue ourselves without having to roll a new version of systemd. |
|
@poettering I've been investigating further to try to understand how mounting a couple of thousand NFS mounts can trigger 1.3 million+ dbus messages and subsequent buffer space issues. It turns out (news to me) that systemd is watching for changes to /proc/self/mountinfo and whenever there is a change in mounts, the entire mountinfo table is re-communicated from systemd to systemd-logind over dbus messages. When you add 2,000 mount-points, for each mount-point that is added the entire table is sent again (4 property change messages per mount), and of course the table gets incrementally larger each time. I counted over 1.3 million dbus messages via a stap probe as a result. Under normal circumstances, systemd-logind manages to keep the queue drained, but sometimes - if it is under pressure from other system events - it cannot keep up and systemd chokes. There is an exception in mount_setup_unit() whereby fstype of "autofs" will not produce these messages, but no such exception for mount-points added by /usr/sbin/automount where fstype is "nfs". Firstly, I don't really understand why systemd needs to watch /proc/self/mountinfo and set-up mount units for any mount that appears. It would be nice to read an explanation of the benefits somewhere and maybe an option to disable this behaviour if the user doesn't need this. If it is enabled, improving the logic so that the entire mount table is not re-communicated for each new mount would be a useful optimisation. Secondly, I was speaking with Ian Kent to try to determine if there's something that could be done in /usr/sbin/automount to make systemd ignore mount-points managed by it, but at the moment he doesn't know of any way to do that. In fact he admits he was previously unaware that systemd was setting up mount units for itself. Can we get some sort of change into systemd that would allow the automount daemon to communicate that it is responsible for the mount-points and that systemd should ignore them? |
|
@jsmarkb which systemd version are you using? we track mounts as "mount units", so that people can have deps on them. it's kinda at the core of what systemd does: starting up in the right order, i.e. making sure that everything X needs is done before X is started, and that prominently includes mounts. If you don't want this behaviour you don't want systemd really. systemd-logind doesn't track mounts though, it's PID 1 only. we generally generate bus messages for mounts coming and going and changing state. If a mount doesn't change state we should not generate events for it. If we do anyway, it would be a bug. So yes, if you start n mount points you should get O(n) messages. |
|
@poettering I'm using 219 same as @gleventhal. If a mount is added and removed dynamically by the inlined Linux automounter, then systemd won't be able to build dependencies on it, so the subsequent mount unit that is created serves no purpose as far as I can see? "If you don't want this behaviour you don't want systemd really". Not really much choice these days :-) I've watched systemd send dbus messages to systemd-logind for each and every new mount. I was puzzled as to why systemd-logind was involved, but it definitely was. Right, events ought not be generated for mounts that don't change state, but that's not what I'm witnessing here. |
|
#10268 partially addresses this, and was merged for 240. |
|
Wasn't that backed out by ec8126d7? I don't see it in the master branch anymore? |
|
@poettering - I'm not sure this issue ought to be closed. If the fix was increasing If #10268 is a partial fix, as I pointed out 8 days ago that commit was reverted in December 2018. So effectively we're left with a closed issue and no strategic fix. I would suggest the solution is two-fold:
What do you think? Can we reopen this issue and start prioritising the possible solutions? |
|
I've written a blog now that describes how this issue was discovered and why auto-mounting thousands of directories in quick succession can result in millions of The whole reason I blogged about it was because I didn't think that the problem was fully understood or fully resolved when this issue was closed. @poettering - we still have no long-term fix for this issue. Would you like me to open a new issue or can this one be re-opened? |
|
Maybe to the solution is to have systemd to maintain a snapshot of /proc/self/mountinfo and upon change notification compare the current version with the snapshot and then send dbus messages only for changed items |
|
@poettering - bump? |
that's already what we do. |
|
See #10872 for further discussions about making /proc/self/mountinfo change handle more efficient. |
do you mean by the changes to fix #4902 that are in PR 5087? |
Interesting, how do I get this functionality, which PR introduced the fix? I'd like to test it. The version of systemd I have here does not send dbus messages for only changed items, it sends dbus messages for every item again whenever a change is made, which is why we were exceeding BUS_WQUEUE_MAX in the first place, as I've been explaining in this thread since 29th November. |
@jsmarkb , did you get an answer to this? |
|
@award-54 unfortunately not. @poettering please can you confirm your assertion by providing a PR? As far as I'm aware this is still an unresolved issue with systemd, but I would be extremely happy to be proved wrong. |
|
@jsmarkb it always worked that way. The kernel API requires us to do that: we get a POLLHUP event from poll() on /proc/self/mountinfo each time it changes. You are then supposed to reread the file and compare with previous state. Which is what we always did. |
|
@poettering Are you sure? Was there a bug in the implementation? According to my SystemTap findings, systemd sent over 1.3 million messages to systemd-logind over dbus when 2,000 automounted NFS directories were mounted in quick succession. Analysing those dbus messages revealed that for each new mount it processed, 4 dbus messages were sent for each existing mount as well as the new one. So it might be comparing previous state, but it is sending everything every time, not just the changes. This is why we were hitting the limit and it is this behaviour for which I am seeking a fix. |
|
Can you provide a reproducer on current systemd upstream (i.e. v245 or v244)? Otherwise, for older versions, if you are sure there's a bug, please contact your downstream distro. |
|
I can reproduce the problem easily on RHEL 8.2 (systemd v239) by mounting lots of empty directories using the NFS automounter from another RHEL 8.2 server using a simple shell loop like this: Using the following stap script I clocked 1,993,043 messages exchanged between systemd and systemd-logind through dbus when I mounted 1,000 empty directories via the automounter: This was on a single core VM with a vanilla install of RHEL 8.2 set-up specifically for this testing and with 814MB of RAM, where 10 mounts took 0.328 seconds, 100 mounts took 3.926 seconds and 1,000 mounts took 207.433 seconds (see table at bottom of this post). I was not running stap when I was doing the timing tests, for obvious reasons. When I tried 2,000 mounts with stap to count the number of dbus messages, the host crashed with out of memory errors. When I tried 2,000 mounts without stap to just time it, the host crashed and rebooted after about 5 hours 20 minutes with: I didn't see any out of memory errors in the logs afterwards and I couldn't SSH into the box during this time (I was getting 'connection reset') so I imagine there was a lot of paging activity going on. I did see that dbus-daemon was consuming 27.2% of memory and 42.2% of CPU early on in the run while my other terminal was still responding. So we are still able to cause DoS on a RHEL 8 system from a non-privileged account by creating an NFS mount storm. I cannot reproduce the same problem on Fedora 32 (systemd v245). The dbus messages seem to be different, more condensed, so I think something has changed between v239 and v245. I'd love to know exactly what changed, but I don't know where to start digging around the code base, because @poettering your comment 'it always worked that way' doesn't seem to ring true with the test results? I'm seeing every single mount-point on RHEL 8 re-communicated every time there is a change, but not seeing that behaviour on Fedora 32. I did notice that while Fedora 32 sends considerably fewer dbus messages than RHEL 8, it did seem to be sending them quite slowly. However, I haven't taken any proper timings to back this up as the issue seemed to be a tangent and not one I was prepared at that time to follow. The full timing results from my RHEL 8.2 test are below: |
|
I repeated the autofs+systemd tests today on a dual core VM with 4GB RAM, comparing Fedora 32 (v245) with RHEL 8.2 (v239). My results are in the tables below. When conducting the tests, I noticed on Fedora 32 that systemd would peak at 85% CPU (according to top), while on RHEL 8.2 systemd peaked at 55% CPU and dbus-daemon up to 70% CPU and 5.7% MEM. I find it interesting that the number of messages that systemd sends on Fedora 32 is much less but the average messages per second is also considerably less. I wonder what changed between v239 and v245 that can explain this? I suppose this is now a question of how much CPU time, memory, and wall-clock time can be perceived as "reasonable" for enumerating a certain number of these autofs mounts. For 3,000 mounts, 5 minutes (Fedora 32) and 10 minutes (RHEL 8.2) seems unreasonable to me. Also, looking at my previous comment above (RHEL 8.2), I'm not sure it's obvious that enumerating 2,000 mounts on a host with limited RAM will cause it to crash. |
|
Maybe the systemd code I'm looking at is oldish, or maybe I just haven't dug into the the functions that are called for each (and every) mountinfo record, not sure. But systemd does call into these functions for every entry. The sad fact is systemd isn't alone and that's because for many, many years mount table handling has been based on text files and more recently on proc file system lists generated on the fly. Basically, user space applications have been forced to use outdated means to deal with these tables and that isn't going to change quickly or easily, starting from the difficulty of getting kernel changes merged to provide a more modern API. Having had to deal with mount tables myself for many years in autofs I know that trying to keep an up to date copy of mount entries is "not" as simple as it sounds and will most likely lead to a seemingly unending stream of bugs, basically a time burner for little or no gain. The only real solution "is" an improved kernel API and we (David Howells, Karel Zak, et. al.) are trying hard to get what we think is a much better API into the kernel. But Linus want's to see actual user space development to show this (large, it caters for other areas than just mount handling) implementation will be used. Karel Zak and myself have been working on this although I'm probably the only one looking at systemd and I'm about to return to working on that again. So far we have mostly added the ability to use the new fsinfo() system call to libmount (used by systemd) and I'm about to start work on the new notifications side of the proposed change. So I'm going to need someone in the systemd development community to advise and help me with this implementation. Systemd appears to use it's own notifications handling sub-system which means these changes can't be restricted to libmount so that means changes to systemd proper. Backporting these changes to distributions is a completely different task but before that's even considered we need an established solution upstream. So let's do that. |
|
If systemd were truly only communicating the presence of a new mount-point over dbus to systemd-logind when it arrives (and not every single mount-point on the system again), then how do we explain away systemd sending 8 million dbus messages when enumerating only 3000 mount-points on RHEL 8.2? How do we explain 314 seconds to enumerate 3000 mount-points on Fedora 32? I agree that the kernel needs a better API, but even given the data we have available today I cannot understand these numbers. |
|
Maybe it's not so surprising. Assuming one dbus message is sent per mount and that, for some reason, a message is sent whether mounts read from the mount table are modified or not. If its assumed that sometimes there will be two mount table reads, one due to the poll of the proc mount table and one due to SIGCHLD of mount(8) completion then ...
So somewhere in between 4.5 and 9 milion messages would be seen as the number of entries of the mount table increases to 3000 (not counting what was originally present). That doesn't take into account there would be some entries already present in the mount table but that probably wouldn't contribute a great deal since it's a constant additive of number_of_new_mounts * initial_number_of_mounts, umm not sure I'm correct on that ... Perhaps there's something going wrong with coalescing of poll notifications, perhaps the time it takes to complete a mount is long enough for systemd to receive individual poll notifications a fair bit of the time. It's probably worth considering trying to work out how to not send messages if a mount hasn't been modified. |
|
Actually I think it's unlikely SIGCHLD handling comes into this. |
|
@mrobson Thanks, I did try that patch and noticed that, with it, I could consistently and reliably hit That said, despite reaching |
systemd version the issue has been seen with
This is the version that ships with CentOS Linux release 7.7.1908 (Core)
Used distribution
Expected behaviour you didn't see
Unexpected behaviour you saw
Steps to reproduce the problem
All I have to do is walk the user directories on our NFS, which uses automount. Each mounted directory adds a node to:
After running:
We see these errors after running this find command. busctl times out during this period:
I see that the Wqueue and RQueue statically allocated buffers have been increased due to similar complaints, but I am skeptical about any solution that doesn't use something more dynamic.
This is particularly dangerous since a non-root user can prevent root crontabs from running and can prevent systemd from executing a command.
The text was updated successfully, but these errors were encountered: