Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX #13674

Closed
gleventhal opened this issue Sep 27, 2019 · 35 comments
Closed

Comments

@gleventhal
Copy link

@gleventhal gleventhal commented Sep 27, 2019

systemd version the issue has been seen with

...systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

This is the version that ships with CentOS Linux release 7.7.1908 (Core)

Used distribution

…CentOS Linux release 7.7.1908 (Core)

Expected behaviour you didn't see

… systemctl restart to run successfully, cron to start a session and run successfully.

Unexpected behaviour you saw

…Cron failed with:
crond[31768]: pam_systemd(crond:session): Failed to create session: No buffer space available
systemd-logind failed with:
systemd-logind[2245]: Failed to start session scope session-2154.scope: No buffer space available"

Steps to reproduce the problem
All I have to do is walk the user directories on our NFS, which uses automount. Each mounted directory adds a node to:

├─/org/freedesktop/systemd1/unit/nfs_mount_dir_etc...

After running:

$ find /nfs/users -maxdepth 2 -type d -exec ls {} \;

We see these errors after running this find command. busctl times out during this period:

$ busctl tree org.freedesktop.systemd1 |grep systemd1/unit/nfs
Failed to introspect object /org/freedesktop/systemd1/unit/nfs_2emount of service org.freedesktop.systemd1: Connection timed out

I see that the Wqueue and RQueue statically allocated buffers have been increased due to similar complaints, but I am skeptical about any solution that doesn't use something more dynamic.

This is particularly dangerous since a non-root user can prevent root crontabs from running and can prevent systemd from executing a command.

@gleventhal gleventhal changed the title A non-privileged user could DOS systemd by exhausting the WQueue A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX Sep 27, 2019
@gleventhal
Copy link
Author

@gleventhal gleventhal commented Oct 22, 2019

@poettering Bump?

@poettering
Copy link
Member

@poettering poettering commented Nov 3, 2019

We have to put limits on everything we allocate. Note that the queue lengths are just hard limits for dynamic allocation, and the actual queues are always allocated dynamically only slightly larger than what we need at the specific moment.

Thus: these aren't limits you are supposed to ever hit, and if you do anyway, then other bad things have happened much earlier. They are safety nets, nothing more.

They have been bumped a couple of times, most recently in 83a32ea.

You are running a very old version, maybe just upgrade to something newer? (or ask your distro to backport the bumping).

Or to summarize this in simpler terms: if you manage to hit the limit IRL, then the limit needs to be bumped further for everyone, but not removed entirely, because that means run-away client code can consume resources without bounds, and we should never allow that. Failing at some point with a clear error is much better than failing too late when errors can't be reported anymore.

Anyway, let's close this for now, as this recently was bumped yet and your version is 4y old. Let's presume this is fixed.

if you run into the same issues with a current version of systemd, please report back and we can bump the limit further.

@gleventhal
Copy link
Author

@gleventhal gleventhal commented Nov 6, 2019

@poettering Is it just me, or is Red Hat / CentOS not really keeping very current with Systemd back-porting? I am using CentOS 7.8 which is ~reasonably current (for enterprise), and ships with systemd-219-67.el7_7.2.x86_64 but below are the values we have set for this issue (many orders of magnitude less than current)

#define BUS_WQUEUE_MAX 1024
#define BUS_RQUEUE_MAX 64*1024

In addition, we also have the systemd-coredump bug where if we hit E2BIG due to ProcessMax, it fails to drop privs when writing to the journal so that non-root users get false negative results when running coredumpctl on dumps that they own but were too big to store (but they should still have access to the journal entries, but dont due to this bug)

@floppym
Copy link
Contributor

@floppym floppym commented Nov 6, 2019

I would guess there is little motivation to backport changes unless a paying customer requests it.

@gleventhal
Copy link
Author

@gleventhal gleventhal commented Nov 7, 2019

Aw nuts. So I guess it's either deal with it, patch it, or nag Red Hat. :(

@poettering
Copy link
Member

@poettering poettering commented Nov 8, 2019

/cc @msekletar @lnykryn something to backport to rhel?

@lnykryn
Copy link
Member

@lnykryn lnykryn commented Nov 8, 2019

Yeah, looks reasonable
https://bugzilla.redhat.com/show_bug.cgi?id=1770158

@gleventhal @floppym Could you please next time file a bug in our bugzilla? It would speed things up a lot.

@gleventhal
Copy link
Author

@gleventhal gleventhal commented Nov 8, 2019

@lnykryn Yes, I will go through Bugzilla next time, thanks!

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Nov 14, 2019

@poettering I understand there has to be a ceiling, but could we get this ceiling user-configurable, e.g. through a sysctl, rather than/in addition to baking something into the code? We'd find it very useful to be able to fix this issue ourselves without having to roll a new version of systemd.

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Nov 27, 2019

@poettering I've been investigating further to try to understand how mounting a couple of thousand NFS mounts can trigger 1.3 million+ dbus messages and subsequent buffer space issues.

It turns out (news to me) that systemd is watching for changes to /proc/self/mountinfo and whenever there is a change in mounts, the entire mountinfo table is re-communicated from systemd to systemd-logind over dbus messages.

When you add 2,000 mount-points, for each mount-point that is added the entire table is sent again (4 property change messages per mount), and of course the table gets incrementally larger each time. I counted over 1.3 million dbus messages via a stap probe as a result.

Under normal circumstances, systemd-logind manages to keep the queue drained, but sometimes - if it is under pressure from other system events - it cannot keep up and systemd chokes. There is an exception in mount_setup_unit() whereby fstype of "autofs" will not produce these messages, but no such exception for mount-points added by /usr/sbin/automount where fstype is "nfs".

Firstly, I don't really understand why systemd needs to watch /proc/self/mountinfo and set-up mount units for any mount that appears. It would be nice to read an explanation of the benefits somewhere and maybe an option to disable this behaviour if the user doesn't need this. If it is enabled, improving the logic so that the entire mount table is not re-communicated for each new mount would be a useful optimisation.

Secondly, I was speaking with Ian Kent to try to determine if there's something that could be done in /usr/sbin/automount to make systemd ignore mount-points managed by it, but at the moment he doesn't know of any way to do that. In fact he admits he was previously unaware that systemd was setting up mount units for itself. Can we get some sort of change into systemd that would allow the automount daemon to communicate that it is responsible for the mount-points and that systemd should ignore them?

@poettering
Copy link
Member

@poettering poettering commented Nov 27, 2019

@jsmarkb which systemd version are you using?

we track mounts as "mount units", so that people can have deps on them. it's kinda at the core of what systemd does: starting up in the right order, i.e. making sure that everything X needs is done before X is started, and that prominently includes mounts. If you don't want this behaviour you don't want systemd really.

systemd-logind doesn't track mounts though, it's PID 1 only.

we generally generate bus messages for mounts coming and going and changing state. If a mount doesn't change state we should not generate events for it. If we do anyway, it would be a bug. So yes, if you start n mount points you should get O(n) messages.

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Nov 27, 2019

@poettering I'm using 219 same as @gleventhal.

If a mount is added and removed dynamically by the inlined Linux automounter, then systemd won't be able to build dependencies on it, so the subsequent mount unit that is created serves no purpose as far as I can see?

"If you don't want this behaviour you don't want systemd really". Not really much choice these days :-)

I've watched systemd send dbus messages to systemd-logind for each and every new mount. I was puzzled as to why systemd-logind was involved, but it definitely was.

Right, events ought not be generated for mounts that don't change state, but that's not what I'm witnessing here.

@fsateler
Copy link
Member

@fsateler fsateler commented Dec 29, 2019

#10268 partially addresses this, and was merged for 240.

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jan 21, 2020

Wasn't that backed out by ec8126d7? I don't see it in the master branch anymore?

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jan 29, 2020

@poettering - I'm not sure this issue ought to be closed.

If the fix was increasing BUS_WQUEUE_MAX, that doesn't change anything if the summary of this bug is "by exceeding BUS_WQUEUE_MAX". Whatever you set BUS_WQUEUE_MAX to you will always be able to trigger the problem by exceeding it. At best, it kicks the can down the road, at worst it leaves customers with bigger environments at risk of hitting the exact same issue.

If #10268 is a partial fix, as I pointed out 8 days ago that commit was reverted in December 2018. So effectively we're left with a closed issue and no strategic fix.

I would suggest the solution is two-fold:

  1. Make BUS_WQUEUE_MAX user-configurable. This would, at least, provide customers some sort of immediate solution if they hit the problem.
  2. Reduce the number of dbus messages that systemd produces during a mount storm by optimising the data paths. This could manifest itself in a number of different patches, such as:
    • A re-implemented version of ec8126d7 that doesn't break system boot. Personally I'm unconvinced by the approach taken in this patch. Incidentally, I wonder how much of a negative impact current systemd behaviour with mount-points has on system boot time.
    • Fixing the current /proc/self/mountinfo scraper so that it remembers what information it has already communicated and only sends dbus messages for brand new mounts. This would significantly reduce the load, dropping the 1.3 million dbus messages I witnessed down to a much more reasonable number.
    • Working with kernel folk to get notifications going when filesystems are mounted and unmounted, after which the /proc/self/mountinfo scraper can be retired.

What do you think? Can we reopen this issue and start prioritising the possible solutions?

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Feb 26, 2020

I've written a blog now that describes how this issue was discovered and why auto-mounting thousands of directories in quick succession can result in millions of dbus messages being sent between systemd and systemd-logind, resulting in the buffer reaching BUS_WQUEUE_MAX. The blog contains plenty of data collected via SystemTap to prove the point. You can read all about it in Troubleshooting systemd with SystemTap.

The whole reason I blogged about it was because I didn't think that the problem was fully understood or fully resolved when this issue was closed.

@poettering - we still have no long-term fix for this issue. Would you like me to open a new issue or can this one be re-opened?

@Vadiml1024
Copy link

@Vadiml1024 Vadiml1024 commented Apr 30, 2020

Maybe to the solution is to have systemd to maintain a snapshot of /proc/self/mountinfo and upon change notification compare the current version with the snapshot and then send dbus messages only for changed items

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Apr 30, 2020

@poettering - bump?

@poettering
Copy link
Member

@poettering poettering commented Apr 30, 2020

Maybe to the solution is to have systemd to maintain a snapshot of /proc/self/mountinfo and upon change notification compare the current version with the snapshot and then send dbus messages only for changed items

that's already what we do.

@poettering
Copy link
Member

@poettering poettering commented Apr 30, 2020

See #10872 for further discussions about making /proc/self/mountinfo change handle more efficient.

@mattkeenan
Copy link

@mattkeenan mattkeenan commented Apr 30, 2020

Maybe to the solution is to have systemd to maintain a snapshot of /proc/self/mountinfo and upon change notification compare the current version with the snapshot and then send dbus messages only for changed items

that's already what we do.

do you mean by the changes to fix #4902 that are in PR 5087?

@jsmarkb
Copy link

@jsmarkb jsmarkb commented May 1, 2020

Maybe to the solution is to have systemd to maintain a snapshot of /proc/self/mountinfo and upon change notification compare the current version with the snapshot and then send dbus messages only for changed items

that's already what we do.

Interesting, how do I get this functionality, which PR introduced the fix? I'd like to test it. The version of systemd I have here does not send dbus messages for only changed items, it sends dbus messages for every item again whenever a change is made, which is why we were exceeding BUS_WQUEUE_MAX in the first place, as I've been explaining in this thread since 29th November.

@award-54
Copy link

@award-54 award-54 commented Jun 1, 2020

Interesting, how do I get this functionality, which PR introduced the fix? I'd like to test it. The version of systemd I have here does not send dbus messages for only changed items, it sends dbus messages for every item again whenever a change is made, which is why we were exceeding BUS_WQUEUE_MAX in the first place, as I've been explaining in this thread since 29th November.

@jsmarkb , did you get an answer to this?

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jun 2, 2020

@award-54 unfortunately not. @poettering please can you confirm your assertion by providing a PR? As far as I'm aware this is still an unresolved issue with systemd, but I would be extremely happy to be proved wrong.

@poettering
Copy link
Member

@poettering poettering commented Jun 2, 2020

@jsmarkb it always worked that way. The kernel API requires us to do that: we get a POLLHUP event from poll() on /proc/self/mountinfo each time it changes. You are then supposed to reread the file and compare with previous state. Which is what we always did.

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jun 2, 2020

@poettering Are you sure? Was there a bug in the implementation? According to my SystemTap findings, systemd sent over 1.3 million messages to systemd-logind over dbus when 2,000 automounted NFS directories were mounted in quick succession. Analysing those dbus messages revealed that for each new mount it processed, 4 dbus messages were sent for each existing mount as well as the new one. So it might be comparing previous state, but it is sending everything every time, not just the changes. This is why we were hitting the limit and it is this behaviour for which I am seeking a fix.

@poettering
Copy link
Member

@poettering poettering commented Jun 2, 2020

Can you provide a reproducer on current systemd upstream (i.e. v245 or v244)?

Otherwise, for older versions, if you are sure there's a bug, please contact your downstream distro.

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jun 16, 2020

I can reproduce the problem easily on RHEL 8.2 (systemd v239) by mounting lots of empty directories using the NFS automounter from another RHEL 8.2 server using a simple shell loop like this:

# j=2000; time (i=0; while [ $i -lt $j ]; do ls /mnt/$i; ((i+=1)); done)

Using the following stap script I clocked 1,993,043 messages exchanged between systemd and systemd-logind through dbus when I mounted 1,000 empty directories via the automounter:

global write_count

probe process("/usr/lib/systemd/systemd").library("/usr/lib/systemd/libsystemd-shared-239.so").function("bus_socket_write_message")
{
    write_count[$bus->output_fd]++
}

probe timer.ms($1)
{
    foreach (fd in write_count)
        printf("systemd/bus_socket_write_message()       sent %6d messages -> fd %d\n", write_count[fd], fd)

    delete write_count
}

This was on a single core VM with a vanilla install of RHEL 8.2 set-up specifically for this testing and with 814MB of RAM, where 10 mounts took 0.328 seconds, 100 mounts took 3.926 seconds and 1,000 mounts took 207.433 seconds (see table at bottom of this post). I was not running stap when I was doing the timing tests, for obvious reasons. When I tried 2,000 mounts with stap to count the number of dbus messages, the host crashed with out of memory errors. When I tried 2,000 mounts without stap to just time it, the host crashed and rebooted after about 5 hours 20 minutes with:

kernel: Out of memory: Killed process 738 (systemd-logind) total-vm:327272kB, anon-rss:233504kB, file-rss:0kB, shmem-rss:0kB, UID:0

I didn't see any out of memory errors in the logs afterwards and I couldn't SSH into the box during this time (I was getting 'connection reset') so I imagine there was a lot of paging activity going on. I did see that dbus-daemon was consuming 27.2% of memory and 42.2% of CPU early on in the run while my other terminal was still responding. So we are still able to cause DoS on a RHEL 8 system from a non-privileged account by creating an NFS mount storm.

I cannot reproduce the same problem on Fedora 32 (systemd v245). The dbus messages seem to be different, more condensed, so I think something has changed between v239 and v245. I'd love to know exactly what changed, but I don't know where to start digging around the code base, because @poettering your comment 'it always worked that way' doesn't seem to ring true with the test results? I'm seeing every single mount-point on RHEL 8 re-communicated every time there is a change, but not seeing that behaviour on Fedora 32. I did notice that while Fedora 32 sends considerably fewer dbus messages than RHEL 8, it did seem to be sending them quite slowly. However, I haven't taken any proper timings to back this up as the issue seemed to be a tangent and not one I was prepared at that time to follow.

The full timing results from my RHEL 8.2 test are below:

# of mounts	time (secs)	# of messages
10		0.328		270
20		0.555		940
30		0.781		2010
40		1.13		3480
50		1.538		5350
60		1.9		7620
70		2.498		10290
80		2.853		13360
90		3.355		16830
100		3.926		20700
200		11.61		81400
300		23.444		182100
400		38.081		322800
500		56.518		503500
600		78.301		724200
700		104.076		984900
800		134.282		1285600
900		169.159		1620102
1000		207.433		1993043
2000		<crash>		<crash>

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jun 18, 2020

I repeated the autofs+systemd tests today on a dual core VM with 4GB RAM, comparing Fedora 32 (v245) with RHEL 8.2 (v239). My results are in the tables below. When conducting the tests, I noticed on Fedora 32 that systemd would peak at 85% CPU (according to top), while on RHEL 8.2 systemd peaked at 55% CPU and dbus-daemon up to 70% CPU and 5.7% MEM. I find it interesting that the number of messages that systemd sends on Fedora 32 is much less but the average messages per second is also considerably less.

FEDORA 32

  # of     time    # of dbus   average msg
mounts    (secs)    messages       per sec
    10    0.644           68           106
    20    1.249          130           104
    30    1.872          184            98
    40    2.493          266           107
    50    3.155          286            91
    60    3.711          392           106
    70    4.445          432            97
    80    5.034          548           109
    90    5.636          622           110
   100    6.203          630           102
   200   15.591         1304            84
   300   20.224         2026           100
   400   27.938         2656            95
   500   33.768         3284            97
   600   41.830         3834            92
   700   49.766         4346            87
   800   58.314         5024            86
   900   68.016         5636            83
  1000   78.123         6220            80
  2000  187.784        12230            65
  3000  314.772        16724            53
RHEL 8.2

  # of     time    # of dbus   average msg
mounts    (secs)    messages       per sec
    10     0.444         140           315
    20     0.809         480           593
    30     1.228        1020           831
    40     1.499        1760          1174
    50     1.840        2700          1467
    60     2.215        3840          1734
    70     2.875        5180          1802
    80     3.101        6720          2167
    90     3.635        8616          2370
   100     4.019       10400          2588
   200     9.450       40800          4317
   300    14.272       91710          6426
   400    21.987      162388          7386
   500    28.749      252009          8766
   600    36.854      369550         10027
   700    46.043      497136         10797
   800    56.393      645432         11445
   900    68.743      843130         12265
  1000    82.178     1036786         12616
  2000   290.638     4219919         14520
  3000   635.100     8029527         12643

I wonder what changed between v239 and v245 that can explain this?

I suppose this is now a question of how much CPU time, memory, and wall-clock time can be perceived as "reasonable" for enumerating a certain number of these autofs mounts. For 3,000 mounts, 5 minutes (Fedora 32) and 10 minutes (RHEL 8.2) seems unreasonable to me.

Also, looking at my previous comment above (RHEL 8.2), I'm not sure it's obvious that enumerating 2,000 mounts on a host with limited RAM will cause it to crash.

@raven-au
Copy link

@raven-au raven-au commented Jun 19, 2020

Maybe the systemd code I'm looking at is oldish, or maybe I just haven't dug into the the functions that are called for each (and every) mountinfo record, not sure.

But systemd does call into these functions for every entry.

The sad fact is systemd isn't alone and that's because for many, many years mount table handling has been based on text files and more recently on proc file system lists generated on the fly.

Basically, user space applications have been forced to use outdated means to deal with these tables and that isn't going to change quickly or easily, starting from the difficulty of getting kernel changes merged to provide a more modern API.

Having had to deal with mount tables myself for many years in autofs I know that trying to keep an up to date copy of mount entries is "not" as simple as it sounds and will most likely lead to a seemingly unending stream of bugs, basically a time burner for little or no gain.

The only real solution "is" an improved kernel API and we (David Howells, Karel Zak, et. al.) are trying hard to get what we think is a much better API into the kernel.

But Linus want's to see actual user space development to show this (large, it caters for other areas than just mount handling) implementation will be used.

Karel Zak and myself have been working on this although I'm probably the only one looking at systemd and I'm about to return to working on that again.

So far we have mostly added the ability to use the new fsinfo() system call to libmount (used by systemd) and I'm about to start work on the new notifications side of the proposed change.

So I'm going to need someone in the systemd development community to advise and help me with this implementation. Systemd appears to use it's own notifications handling sub-system which means these changes can't be restricted to libmount so that means changes to systemd proper.

Backporting these changes to distributions is a completely different task but before that's even considered we need an established solution upstream.

So let's do that.
Anyone care to volunteer to help me with the systemd side of this?

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Jun 19, 2020

If systemd were truly only communicating the presence of a new mount-point over dbus to systemd-logind when it arrives (and not every single mount-point on the system again), then how do we explain away systemd sending 8 million dbus messages when enumerating only 3000 mount-points on RHEL 8.2? How do we explain 314 seconds to enumerate 3000 mount-points on Fedora 32?

I agree that the kernel needs a better API, but even given the data we have available today I cannot understand these numbers.

@raven-au
Copy link

@raven-au raven-au commented Jul 6, 2020

Maybe it's not so surprising.

Assuming one dbus message is sent per mount and that, for some reason, a message is sent whether mounts read from the mount table are modified or not. If its assumed that sometimes there will be two mount table reads, one due to the poll of the proc mount table and one due to SIGCHLD of mount(8) completion then ...

mounts = 3000, messages_per_mount = 1
sum of i*messages_per_mount = 4501500 for i = 1..3000 mounts
sum of 2*i*messages_per_mount = 9003000 for i = 1..3000 mounts

So somewhere in between 4.5 and 9 milion messages would be seen as the number of entries of the mount table increases to 3000 (not counting what was originally present).

That doesn't take into account there would be some entries already present in the mount table but that probably wouldn't contribute a great deal since it's a constant additive of number_of_new_mounts * initial_number_of_mounts, umm not sure I'm correct on that ...

Perhaps there's something going wrong with coalescing of poll notifications, perhaps the time it takes to complete a mount is long enough for systemd to receive individual poll notifications a fair bit of the time.

It's probably worth considering trying to work out how to not send messages if a mount hasn't been modified.

@raven-au
Copy link

@raven-au raven-au commented Jul 7, 2020

Actually I think it's unlikely SIGCHLD handling comes into this.
But I remember reading there are 2 messages per mount so the calculation is the same.

@mrobson
Copy link

@mrobson mrobson commented Jul 15, 2020

@jsmarkb There was a change here to reduce PropertiesChanged messaged on mounts: #15233 but that only looks to be in 246-rc right now

@jsmarkb
Copy link

@jsmarkb jsmarkb commented Sep 29, 2020

@mrobson Thanks, I did try that patch and noticed that, with it, I could consistently and reliably hit ENOBUFS from bus_send_internal while instrumenting with SystemTap and when enumerating 2,000 autofs-controlled mount-points, even though we have patched BUS_RQUEUE_MAX and BUS_WQUEUE_MAX to 393216. Without the patch, seeing an ENOBUFS from that function was far more hit-and-miss.

That said, despite reaching ENOBUFS I did not witness any user-visible side-effects. I was expecting to see, at the very least, some No buffer space available errors from either systemd or systemd-logind as we had done in the past; but not this time. More thorough testing is required as I think the testing I performed was inconclusive and so we are not using this patch in production yet and as such this issue is still outstanding for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests