[WIP] Mount storms and sd-event rate-limiting #16537

msekletar · 2020-07-21T16:01:47Z

This is my stab at #15464. @poettering @keszybz @neilbrown PTAL.

Obviously this is missing docs, I will work on that once we have implementation details flashed out.

raven-au · 2020-08-06T06:53:28Z

Ok, so I'm still getting used to the github review procedure so I'll just comment here.

It looks like the change here restricts the time delay to increments of 1 second.

If rate limiting is active on the source used to handle mount table updates and the changes (that may eventually get merged) in #16671 are in use then that will almost certainly cause message buffer overflow on (probably almost) every event following the timeout in the rapid mounting situation. And that will cause a fall back to the load all mount table mechanism every time.

But equally, in the rapid mounting situation I also need to delay, just a little, and just once, to let the message buffer fill with more requests so I can grab a second batch of messages.

I haven't actually tested the messages per event that I see in the rapid mounting case, and I'm about to do that, but I observed waiting for 20 milliseconds just once and checking for another message buffer and processing it further reduced CPU overhead from 15-26% to between 5-10% (for the 20000 mount test case I used).

But even more than that, risking message buffer overflow like this really needs to empty the message buffer first, before waiting and immediately trying again. So I'd probably need to do this within the event handler immediately after clearing the message buffer if I see some percentage fill of messages indicating high mount/umount activity. Looking at this change I didn't quite get where and how the rate limit gets set ...

Assuming I can use this the granularity of the time delay should be at most millisecond.

raven-au · 2020-08-10T09:53:08Z

I've done some testing with the patches here.
I did change the interval to be in milliseconds though.

Using just these two patches I didn't see much improvement running my simple 20000 mount test.
There appeared to be some improvement early on but it became similar to without the change as the mount table became large.

I also tried using this with the watch notifications I'm working on.
I was able to get similar results to using a single 20 milisecond sleep and a single more available check as I previously did.

It seems that the overhead of the rate limit triggering is high so I really need to use a short interval (I used 5 but 10 seemed the same IIRC, it's probably overshadowed by the rate limit overhead) to get this to function. I also needed to use a short burst value (5) to get it to change responsivly enough.

I think the problem is with the rate limit system overhead or perhaps it's not quite the right tool for this task.

It's really easy for me to overflow the message buffer and lose messages.
I had to increase the maximum messages per buffer in libmount from 256 (libmount maximum) to 512 (kernel absolute maximum) to avoid buffer overflows.

What I saw was single message buffers for a while (too long really) and then one or two large fill buffers then more singles.
And the large fill buffers were fairly often as high as around 450 messages which is way too close to the limit I think.

But it did work sufficiently well for my case, all be it with somewhat more overhead than a single short sleep and a single retry only.

raven-au · 2020-08-12T05:51:04Z

So I've done a little more testing with this.

The patches I use include the mount notifications and fsinfo() changes as well as the changes here updated to use milliseconds for interval. My test is the usual 20000 autofs direct mounts.

When I test booting into a kernel without the mount notifications and fsinfo() changes, to ensure my changes work in that case, I can also test the existing mount table processing.

Using 1000ms (original in the patches here) interval I really don't see any difference at all which is puzzling.
Using 200ms and 100ms interval I see a fairly good improvement for the mount case but virtually no change for the umount case which is also puzzling.

@msekletar do you have any thoughts as to why that might be the case?
Can you give me some insight into the meaning of interval and burst?

It's clear that the rate limit can be slow to kick-in but I don't think that can account for these findings.

msekletar · 2020-08-12T14:06:08Z

Thank you for testing. Meaning of the two parameters should be the same as in other cases where we employ rate-limiting in systemd. Hence you configure the interval in seconds and maximum number of events that can occur during that interval. If you stay below maximum you are not ratelimited, once you go above then the next event will be dispatched right after the interval elapses.

raven-au · 2020-08-13T01:29:30Z

Thank you for testing. Meaning of the two parameters should be the same as in other cases where we employ rate-limiting in systemd. Hence you configure the interval in seconds and maximum number of events that can occur during that interval. If you stay below maximum you are not ratelimited, once you go above then the next event will be dispatched right after the interval elapses.

One thing that I have seen is, with disabling the source for some time and then turning it back on, it creates a ping-pong effect that ends up using a bunch of effort switching back and forth handling and excessive number of events before switching again instead of dealing with the excessive rate of wake ups.

It occurs to me that changing what this does a little might be more effective.
What about simply switching to timer event handling when the source is rate limited.

Basically, when rate limited, disable the source and call the source event handler on the timer wakeup instead then switch back to polling after some time to prevent over servicing using the timer.

The difficulty then is that you don't don't know when it's no longer rate limited so there would need to be two intervals, one to define the interval after which to wake up and switch back and another to define the timer wake up interval to call the handler.

That could be used to control that ping-pong effect and the interval to switch back could be somewhat longer.

If you think that's worth exploring I should be able to make some changes and perform some experiments.

poettering · 2020-08-14T09:44:24Z

src/libsystemd/sd-event/event-source.h

+        RateLimit r;
+        sd_event_source *timeout_source;
+        bool effective;
+} EventSourceRateLimit;


any reason why you made this a struct of its own? why not just merge this into the sd_event_source struct? it's only used there...

I am assuming that ratelimiting feature isn't going to be used that often so I don't want to waste the stack space unnecessarily in the client code. Having these three variables in a standalone struct made it easier to allocate the struct only in case ratelimiting is requested.

I think it would be simpler to fold this into the main event source structure. the needed space is 4 64bit values instead 1, it's not too horrible. It feels like premature optimization to me to make this indirect.

We usually turn off ratelimiting of by setting the interval or burst fields to zero. ratelimit_below() recognizes that exlicitly. I'd stick to that same logic here for marking disabled ratelimiting.

src/libsystemd/sd-event/sd-event.c

src/libsystemd/sd-event/event-source.h

src/libsystemd/sd-event/sd-event.c

src/systemd/sd-event.h

msekletar · 2020-08-21T16:41:39Z

@poettering I've updated the PR. I think I've addressed everything except things that aren't marked as resolved. PTAL.

lgtm-com · 2020-08-21T17:17:46Z

This pull request introduces 1 alert when merging e4758c9 into eea63c0 - view on LGTM.com

new alerts:

1 for Comparison result is always the same

raven-au · 2020-08-22T06:44:48Z

This latest revision introduces 3 fails for selftests run during the Fedora RPM build.

raven-au · 2020-08-25T02:53:55Z

This latest revision passes the rpm self tests now.
IIUC the the parameter to sd_event_source_set_ratelimit() in commit 3961eb1 also needs to change from 1 to 1000000.

I'll try and run some tests when I get time, hopefully soonish.

msekletar · 2020-08-25T07:34:49Z

@raven-au You are right. Argument value needs to be updated to account for latest changes. Thanks!

raven-au · 2020-08-25T09:15:58Z

I still think the approach isn't quite right.

Simply disabling the source for a bit has problems.
The ratelimiting system appears slow to recognize the limit has been exceeded.
And, from what I saw previously, the delay interval appears shorter than expected.

That leads to a limited reduction (negligible from what I see, including with this latest push) in the overhead of handling the notification storm and appears to cause a sort of ping/pong effect that negates benefits of disabling the source as is done now.

That's why I suggested a slightly different approach above.

One where the source is switched to a timer for some interval which continues to fire every so often during the interval then switches back to the poll wake up thereby avoiding the rapid fire wakeup we get from the kernel.

I think the ability of systemd to handle events fairly quickly is why we don't see multiple change events being merged.

In the past (when working on autofs) I have seen that waits of a second or more can cause visible poor performance and interfere with things waiting on the processes that are delayed. I found that a delay of 100-200 milliseconds tends to result in much reduced CPU usage and still maintains quite good response for things waiting on the delayed processing.

If we were to implement this along the lines I suggested then the interval could be much longer, say 1-4 or 5 seconds, with a timer wakeup of say 200 milliseconds that calls the proc mountinfo handler, then switch back to the poll event handler after the interval has elapsed and wait for the ratelimit to trigger again (perhaps).

I expect that would work well for the proc mountinfo case, particularly because the handler doesn't depend on input from the poll event itself.

Maybe you have particular reasons for not doing it that way, perhaps because it might limit the flexibility of the mechanism ... ?

msekletar · 2020-08-25T12:30:13Z

The ratelimiting system appears slow to recognize the limit has been exceeded.

This shouldn't happen. I mean if the rate-limiting doesn't work in this specific instance that might imply we have broken rate-limiting in other places too, since this patch-set doesn't implement new rate-limiting primitives but reuses the implementation that we already use elsewhere.

Could you please share more details about your test case?

raven-au · 2020-08-26T03:14:07Z

The ratelimiting system appears slow to recognize the limit has been exceeded.

This shouldn't happen. I mean if the rate-limiting doesn't work in this specific instance that might imply we have broken rate-limiting in other places too, since this patch-set doesn't implement new rate-limiting primitives but reuses the implementation that we already use elsewhere.

Could you please share more details about your test case?

Well my test case may be somewhat artificial.
It will generate a lot of notifications, and that's what we are concerned with, but it's probably more extreme than we would see in real life.

I use an autofs direct mount map with 20000 entries, so at autofs startup there will be 20000 mounts performed and at autofs shutdown they will be umounted.
And, doing the math, that means a notification rate, if no notifications are skipped during mountinfo processing, of about 1500 - 1600 per second.

So I suppose there will be a large number of notifications before the ratelimit kicks in because the rate is so high.
But still, the mechanism should be able to cope with a very high rate of notifications.

raven-au · 2020-08-26T03:25:14Z

I use an autofs direct mount map with 20000 entries, so at autofs startup there will be 20000 mounts performed and at autofs shutdown they will be umounted.
And, doing the math, that means a notification rate, if no notifications are skipped during mountinfo processing, of about 1500 - 1600 per second.

I should also add that with an autofs direct mount map of 5000 entries I see similar behaviour but for a much shorter duration and direct mount maps of 5000 - 10000 entries is the sort of size that heavy autofs users would use. Although using direct mount maps isn't that common because of the side effects. I'm working to eliminate the overhead from autofs itself (where I can) but that isn't enough because the side effects of large mount tables and periodic high rates of notifications affects several other important packages, obviously systemd for a start.

msekletar · 2020-08-31T15:42:28Z

Hi Ian (@raven-au),

I think I got bit stuck here. I'll try to move forward by being very specific. First let me share my test case for this PR that I've been using so far. I will also share my test results.

#!/bin/bash

mkdir -p /test/mount

# run tracing for a little bit longer than reproducer to maximize the chance that tracing script catches all events, i.e. also those at very end
bpftrace -e 'uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo / pid == 1 / { @[probe] = count(); } interval:s:12 { exit(); }' >bpftrace.log &

# monitor CPU usage of systemd
pidstat -p 1 12 1 >pidstat.log &

timeout 10s /bin/bash -c 'while true ; do  mount -t tmpfs tmpfs /test/mount && umount /test/mount ; done'
wait

Since we rate limit to only dispatch max 5 events per second, the assumption is that probe will be hit at most 50 times in those ~10 seconds.

Here are the results from the test VM running Fedora 33,

[root@localhost ~]# cat bpftrace.log
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 41

[root@localhost ~]# cat pidstat.log
Linux 5.8.1-300.fc33.x86_64 (localhost)         08/31/2020      _x86_64_        (8 CPU)

11:00:26 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:00:38 AM     0         1    0.25    0.17    0.00    0.00    0.42     0  systemd
Average:        0         1    0.25    0.17    0.00    0.00    0.42     -  systemd

Just for reference here is the same test but with distribution provided RPM,

[root@localhost ~]# cat bpftrace.log
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 6593

[root@localhost ~]# cat pidstat.log
Linux 5.8.1-300.fc33.x86_64 (localhost)         08/31/2020      _x86_64_        (8 CPU)

11:20:06 AM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:20:18 AM     0         1   15.50    9.00    0.00    0.00   24.50     5  systemd
Average:        0         1   15.50    9.00    0.00    0.00   24.50     -  systemd

Clearly, there is a huge increase in CPU cycles and number of calls with unpatched systemd.

Note that the test was run on the system with otherwise empty mount table (just rootfs mount and usual API filesystems mounted). When I ran the test with mount table that had roughly 20 000 entries (autofs waiting to be triggered) I've observed something strange. CPU utilisation was high in both cases and mount_load_proc_self_mountinfo() got called only 1. That suggests busy-looping in libmount. I've discussed this with @karelzak and we think this is because of this workround to kernel bug in libmount https://github.com/karelzak/util-linux/blob/b3391f3ac790b52ec14d763c0ff5cde808309d76/libmount/src/utils.c#L1189.

@karelzak think that this code is no longer needed because associated kernel bug is fixed. Btw, we loop in libmount because while reading big mountinfo we need to invoke multiple read() syscalls and if the file content changes in between (as determined by poll()) we discard what we have read so far and start from the beginning. Hence we never go back to systemd event loop and rate limit isn't checked at all. As mentioned, this was a workaround for quirky kernel behaviour that got fixed.

raven-au · 2020-09-01T01:41:31Z

Hi Ian (@raven-au),

Hi.

snip ...

Since we rate limit to only dispatch max 5 events per second, the assumption is that probe will be hit at most 50 times in those ~10 seconds.

Umm ... ok that should be around 200 ms between reads, I think that's a good amount.
I hadn't looked at it from that perspective so I didn't get that.

snip ...

Clearly, there is a huge increase in CPU cycles and number of calls with unpatched systemd.

Note that the test was run on the system with otherwise empty mount table (just rootfs mount and usual API filesystems mounted). When I ran the test with mount table that had roughly 20 000 entries (autofs waiting to be triggered) I've observed something strange. CPU utilisation was high in both cases and mount_load_proc_self_mountinfo() got called only 1. That suggests busy-looping in libmount. I've discussed this with @karelzak and we think this is because of this workround to kernel bug in libmount https://github.com/karelzak/util-linux/blob/b3391f3ac790b52ec14d763c0ff5cde808309d76/libmount/src/utils.c#L1189.

LOL, my test case is pretty much designed to trigger that little hiccup.
That mount table lock acquire/release between reads is a horrible problem to be sure, thanks to the seq file kernel implementation.

OTOH this is about dealing with high notification rates which implies mountinfo changes so I think we have a bit of an additional problem to deal with.

@karelzak think that this code is no longer needed because associated kernel bug is fixed. Btw, we loop in libmount because while reading big mountinfo we need to invoke multiple read() syscalls and if the file content changes in between (as determined by poll()) we discard what we have read so far and start from the beginning. Hence we never go back to systemd event loop and rate limit isn't checked at all. As mentioned, this was a workaround for quirky kernel behaviour that got fixed.

Yeah but the fix went in fairly recently so it probably needs to stay in libmount ...
Time to spend a few cycles thinking about this then.

Ian

raven-au · 2020-09-01T03:52:19Z

Clearly, there is a huge increase in CPU cycles and number of calls with unpatched systemd.
Note that the test was run on the system with otherwise empty mount table (just rootfs mount and usual API filesystems mounted). When I ran the test with mount table that had roughly 20 000 entries (autofs waiting to be triggered) I've observed something strange. CPU utilisation was high in both cases and mount_load_proc_self_mountinfo() got called only 1. That suggests busy-looping in libmount. I've discussed this with @karelzak and we think this is because of this workround to kernel bug in libmount https://github.com/karelzak/util-linux/blob/b3391f3ac790b52ec14d763c0ff5cde808309d76/libmount/src/utils.c#L1189.

You know I think this isn't quite the right thing to do in libmount.

Using poll() to check if the mount table has changed after read is always going to be chasing an updated mount table under sustained high rate of notifications.

The mount table being out of date after reading it is always possible and frequently likely so I think it's necessary to not worry too much about that and ensure we get the whole table in one read to avoid the mount table lock acquire/release between reads kernel bug.

I wonder if Karel would consider doing that instead?
Let's check with him, I can put together a patch for it.

Ian

raven-au · 2020-09-01T04:52:56Z

The mount table being out of date after reading it is always possible and frequently likely so I think it's necessary to not worry too much about that and ensure we get the whole table in one read to avoid the mount table lock acquire/release between reads kernel bug.

Oh great, the kernel seq file implementation is restricted to a maximum buffer size of page size (4k).
So it makes sure you can't work around the bug!

karelzak · 2020-09-01T09:30:01Z

The problem is (was) kernel skips entries in the mount table if you read mountinfo by more read() syscalls in time someone else uses umount. For more details see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f commit message.

We're not able to detect it in userspace, but we can detect that mount table has been modified between the read()s and we can re-read it. This is the idea of the libmount workaround. Yes, it's expensive and ugly.

The upstream kernel mountinfo read() bugfix is available since Linux 5.8 (and it's already in Fedora and it will be probably backported to CentOS/RHEL-8.4). The libmount workaround is available in util-linux v2.35 (a year ago) and v2.36.

I do not want this ugly read()+poll() stuff in util-linux v2.37 (release expected at the end of this year), and I'm going to remove this from Fedora rawhide this week to make things more usable for @msekletar.

The affected users with old kernel probably also use old util-linux too. From my point of view, it's good enough is to add a warning to ReleaseNotes than keep this expensive workaround in libmount for people who mix old kernels and new utils.

Note, the issue exists for a decade, we care about it last year ;-)

This workaround has been introduced by http://github.com/karelzak/util-linux/commit/e4925f591c1bfb83719418b56b952830d15b77eb And originally requested by systemd/systemd#10872 It seems we do not need it anymore as the problem should be fixed in kernel since 5.8 (kernel commit 9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f). Note that the libmount solution is very expensive as it repeats read() many times (until we get consistent result) if kernel is busy with mount table modification. This behaviour makes events management in systemd (or other places) pretty difficult as read mountinfo takes time on busy systems. Addresses: systemd/systemd#16537 Signed-off-by: Karel Zak <kzak@redhat.com>

karelzak · 2020-09-01T10:38:00Z

Note, I have tested the original reproducer (tmpfs version), and it works as expected with kernel 5.8 and libmount without the workaround.

karelzak · 2020-09-01T12:01:02Z

@msekletar and @raven-au, here is Fedora util-linux package without the libmount workaround
https://koji.fedoraproject.org/koji/taskinfo?taskID=50558850
you can use it for your rate-limiting tests.

raven-au · 2020-09-01T15:16:09Z

The problem is (was) kernel skips entries in the mount table if you read mountinfo by more read() syscalls in time someone else uses umount. For more details see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f commit message.

Yeah, I thought you knew I was aware of the kernel patch, ;)

We're not able to detect it in userspace, but we can detect that mount table has been modified between the read()s and we can re-read it. This is the idea of the libmount workaround. Yes, it's expensive and ugly.

The upstream kernel mountinfo read() bugfix is available since Linux 5.8 (and it's already in Fedora and it will be probably backported to CentOS/RHEL-8.4). The libmount workaround is available in util-linux v2.35 (a year ago) and v2.36.

Yes, Mikos is trying to get this into RHEL now but we have had a small problem.

I do not want this ugly read()+poll() stuff in util-linux v2.37 (release expected at the end of this year), and I'm going to remove this from Fedora rawhide this week to make things more usable for @msekletar.

Also, yes, there is no choice, there's no other way, the kernel seq file implementation has made sure of that.
I'm being affected by it in my testing too, I'll need to grab your patch so I can continue to test this pull request.

The affected users with old kernel probably also use old util-linux too. From my point of view, it's good enough is to add a warning to ReleaseNotes than keep this expensive workaround in libmount for people who mix old kernels and new utils.

Note, the issue exists for a decade, we care about it last year ;-)

Mostly people didn't even know about it, and I didn't know about the seq file locking problem myself until you brought it up, ;)
I've cared about the the mount table excessive resource usage problem for much, much longer so I also care about what @msekletar is trying to do here.

Ian

poettering · 2020-09-02T15:23:34Z

src/libsystemd/sd-event/event-source.h

+        RateLimit r;
+        sd_event_source *timeout_source;
+        bool effective;
+} EventSourceRateLimit;


I think it would be simpler to fold this into the main event source structure. the needed space is 4 64bit values instead 1, it's not too horrible. It feels like premature optimization to me to make this indirect.

We usually turn off ratelimiting of by setting the interval or burst fields to zero. ratelimit_below() recognizes that exlicitly. I'd stick to that same logic here for marking disabled ratelimiting.

poettering · 2020-09-02T15:24:21Z

src/systemd/sd-event.h

@@ -160,6 +160,9 @@ int sd_event_source_set_destroy_callback(sd_event_source *s, sd_event_destroy_t
 int sd_event_source_get_destroy_callback(sd_event_source *s, sd_event_destroy_t *ret);
 int sd_event_source_get_floating(sd_event_source *s);
 int sd_event_source_set_floating(sd_event_source *s, int b);
+int sd_event_source_set_ratelimit(sd_event_source *s, uint64_t interval_usec, uint64_t burst);
+int sd_event_source_get_ratelimit(sd_event_source *s, uint64_t *interval_usec, uint64_t *burst);


doesn't matter, but we usually prefix return args with _ret.

poettering · 2020-09-02T15:26:00Z

src/libsystemd/sd-event/sd-event.c

+
+        r = source_disable_ratelimit(s);
+        if (r < 0)
+                return r;


hmm, i#d prefer if this function would be atomic in behaviour, i.e. fail and make no changes or suceed and make the requested changes. Right now, if the new0() below fails we'll have disabled the ratelimit, but not enabled the new one.

i.e. allocate everything into temporary variables, then install them once you acquired them all, and don#t change state before that completed

poettering · 2020-09-02T15:28:02Z

src/libsystemd/sd-event/sd-event.c

+static int ratelimit_timeout_handler(sd_event_source *s, uint64_t usec, void *userdata) {
+        int r;
+
+        sd_event_source *source = (sd_event_source *) userdata;


no need to cast, void* casts to everything implicitly.

poettering · 2020-09-02T15:28:10Z

src/libsystemd/sd-event/sd-event.c

+
+static int ratelimit_timeout_handler(sd_event_source *s, uint64_t usec, void *userdata) {
+        int r;
+


spurious newline

poettering · 2020-09-02T15:31:08Z

src/libsystemd/sd-event/sd-event.c

-        if (s->prepare)
-                prioq_reshuffle(s->event->prepare, s, &s->prepare_index);
+        s->enabled = m;
+        r = m == SD_EVENT_OFF ? source_disable(s) : source_enable(s, m);


as elsewhere: this is not atomic: now ->enabled is updated before source_{disable/enable}() have a chance to succeed. Functions like this should either fail and leave things untouched or succeed and then update state, but not leave things half-initialized if they fail in the middle

Actually I did this on purpose, I wanted to be mindful about not changing behaviour unnecessarily. Previously we updated the state also before doing other actions.

poettering · 2020-09-02T15:34:51Z

src/libsystemd/sd-event/sd-event.c

+        s->enabled = m;
+        r = m == SD_EVENT_OFF ? source_disable(s) : source_enable(s, m);
+        if (r < 0)
+                return r;



so if the ratelimit is effective and someone calls sd_event_source_set_enabled(), then this will possibly low-level enable the event source again even though it is still ratelimited...

I am pretty sure we should enable the event source on the low level only if both the user-controlled enabled state says so, and the ratelimit is not in effect.

I'd would be great if the source_enable()/source_disable() split could be done in a separate, preparatory commit, it's easier to follow then

poettering · 2020-09-02T15:35:32Z

src/libsystemd/sd-event/sd-event.c

+        if (r < 0)
+                return r;
+
+        r = sd_event_source_set_priority(s->ratelimit->timeout_source, SD_EVENT_PRIORITY_IMPORTANT);


i still think this is wrong: this should have the same prio as the original event source, not be higher. If some event source is set to some really high prio work on them should not be deferred just because the user installed a ton of event sources with ratelimits that run out

(oh, and if the main event source changes prio — because somebody calls sd_event_source_set_priority() on it — the associated rate limit timer event should change prio too. this is what i mean by "recursive" btw, which i don#t like, i.e. al the props from the main event source always have to be propagated to the rate limit timer event source. I'd much rather have it that the primary event source acts as timer some times and as IO event source in others, depending if ratelimit is on or onot)

poettering · 2020-09-02T15:37:40Z

src/core/mount.c

@@ -1865,6 +1865,10 @@ static void mount_enumerate(Manager *m) {
                }

                (void) sd_event_source_set_description(m->mount_event_source, "mount-monitor-dispatch");
+
+                r = sd_event_source_set_ratelimit(m->mount_event_source, 1000000, 5);


instead of 1000000 write USEC_PER_SEC to make this easier to read.

also maybe write a comment next to it: /* put a ratelimit on this event source trigger only 5 times per 1s */

raven-au · 2020-09-04T04:16:14Z

I've done some more testing with the ratelimit change.
These were done using util-linux built from Karel's github repo. that includes the revert of the poll() change.

Initially, with my 20000 mount autofs start/stop test I saw some reduced CPU usage but it rapidly grew to be similar to the case without ratelimiting.

To understand what was happening I performed some different tests.
Basically, using the test @msekletar provided above, with no additional mounts, 2500, 5000, 10000 and 20000 additional mounts, and in each case I waited until the autofs start had completed so I could measure the effect of mount table size on the ratelimit change.

Here are the results without the ratelimit change:

no extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 2424

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

10:54:12      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:54:24        0         1   18.08   25.83    0.00    0.08   43.92    14  systemd
Average:        0         1   18.08   25.83    0.00    0.08   43.92     -  systemd

2500 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 410

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:05:10      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:05:22        0         1   41.25   40.50    0.00    0.00   81.75     6  systemd
Average:        0         1   41.25   40.50    0.00    0.00   81.75     -  systemd

5000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 218

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

10:56:42      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:56:54        0         1   41.67   40.00    0.00    0.00   81.67    10  systemd
Average:        0         1   41.67   40.00    0.00    0.00   81.67     -  systemd

10000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 110

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

10:59:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:59:19        0         1   42.58   39.17    0.00    0.00   81.75    11  systemd
Average:        0         1   42.58   39.17    0.00    0.00   81.75     -  systemd

20000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 55

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:01:25      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:01:37        0         1   42.42   39.42    0.00    0.00   81.83     1  systemd
Average:        0         1   42.42   39.42    0.00    0.00   81.83     -  systemd

And the results with the ratelimit change:

no exta mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 36

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:27:11      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:27:23        0         1    0.33    0.50    0.00    0.00    0.83     9  systemd
Average:        0         1    0.33    0.50    0.00    0.00    0.83     -  systemd

2500 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 38

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:29:17      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:29:29        0         1    4.67    4.50    0.00    0.00    9.17     9  systemd
Average:        0         1    4.67    4.50    0.00    0.00    9.17     -  systemd

5000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 38

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:31:05      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:31:17        0         1    8.83    7.75    0.00    0.00   16.58     9  systemd
Average:        0         1    8.83    7.75    0.00    0.00   16.58     -  systemd

10000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 40

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:33:18      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:33:30        0         1   16.33   15.67    0.00    0.00   32.00     7  systemd
Average:        0         1   16.33   15.67    0.00    0.00   32.00     -  systemd

20000 extra mounts:
Attaching 2 probes...
@[uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo]: 41

Linux 5.8.0-0.rc1.20200618git1b5044021070.1.fsinfo.fc31.x86_64 (localhost.localdomain) 	04/09/20 	_x86_64_	(16 CPU)

11:39:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:39:19        0         1   32.58   31.58    0.00    0.00   64.17     8  systemd
Average:        0         1   32.58   31.58    0.00    0.00   64.17     -  systemd

These results are quite interesting although not that surprising when you think about it.

In short they show that the ratelimit is quite effective but as the size of the mount table grows the number of notifications systemd can process reduces and the ratelimit becomes less effective.

That makes perfect sense but does show quite convincingly that an improved kernel mount table handling API is still very much needed.

So, while this change looks worthwhile regardless, we don't really know what the environment is for the OpenShift/container high CPU usage cases we've been seeing so we don't know how effective this will be in resolving them.

Ian

karelzak · 2020-09-04T07:33:56Z

What is the current rate-limit? Maybe we can be more aggressive, I don't think you need to update in-systemd mount table more often than 2x per second. What about exporting this setting to systemd config file? I can imagine that in some extreme cases users can modify this setting to reduce overhead.

@raven-au it would be nice to re-test it with a different setting and compare results.

poettering · 2020-09-04T07:43:48Z

src/libsystemd/sd-event/sd-event.c

+        return 1;
+}
+
+_public_ int sd_event_source_set_ratelimit(sd_event_source *s, uint64_t interval_usec, uint64_t burst) {


so this should be supported for all kinds of event sources, except for timer event sources btw, because i can't see how that would make sense there. i.e. return EINVAL if this is invoked for a timer event source

karelzak · 2020-09-04T07:43:50Z

Note, for 20000 nodes:

                       UID       PID    %usr %system  %guest   %wait    %CPU
without: Average:        0         1   42.42   39.42    0.00    0.00   81.83  
   with: Average:        0         1   32.58   31.58    0.00    0.00   64.17

Not so bad.

poettering · 2020-09-04T07:50:53Z

hmm, in your current code turning on the ratelimit means two event sources are created, and I think besides the priority the description string should be propagated too in some form (i.e. suffixed)

Also, what happens if ratelimit is on, and is currently in effect, and I then disable the event source? something needs to happen to the timer event source in this case.

This kind of recursive propagation between the user-facing event source and the internal auxiliary timer event source is something i really don't like about the current approach. I think there should be only one event source, and it sometimes should behave like the event source it is, but if the ratelimit is in effect it should just behave like timer event source, and then when the ratelimit is over, behave again like it was supposed to behave originally. i.e. an event source under rate limit in effect should follow most codepaths of a timer event source, and without it being in effect it should follow the codepaths it normally does.

raven-au · 2020-09-04T07:53:55Z

What is the current rate-limit? Maybe we can be more aggressive, I don't think you need to update in-systemd mount table more often than 2x per second. What about exporting this setting to systemd config file? I can imagine that in some extreme cases users can modify this setting to reduce overhead.

@raven-au it would be nice to re-test it with a different setting and compare results.

IIUC, with the current hard coded settings, this mechanism turns off the event source for 1 second if the rate of notifications is greater than 5 per second.

That's already pretty aggressive, turning off the source for 1 second every time the rate is triggered could easily lead to sluggish mount unit responsiveness.

But I'll wait to see what @msekletar thinks about the settings.

raven-au · 2020-09-04T08:16:11Z

This kind of recursive propagation between the user-facing event source and the internal auxiliary timer event source is something i really don't like about the current approach. I think there should be only one event source, and it sometimes should behave like the event source it is, but if the ratelimit is in effect it should just behave like timer event source, and then when the ratelimit is over, behave again like it was supposed to behave originally. i.e. an event source under rate limit in effect should follow most codepaths of a timer event source, and without it being in effect it should follow the codepaths it normally does.

That sounds like what I thought would be a better approach (as described above early on), IIUC what your saying.
It sounds like your saying when the rate limit is triggered the event source should change to a timer event to control the number of events that are processed per second and then change back after some sensible amount of time. Doing that would mean you could make the time the source spends ratelimited much longer (perhaps 3-5 seconds or more, rather than 1 second) before switching back which would allow dealing with sustained high notification rates without compromising responsiveness to mount unit state changes as long as the number of events per second when rate limited was some sensible number, not too high but high enough to service requests fairly quickly.

The other aspect of this approach is that if we ever get a suitable fsinfo() style mount information access method and new mount notifications similar to what was posted recently then this mechanism would lend itself well to taking advantage of the notifications queueing mechanism so that notifications could be processed in batches up to 256 (was the libmount default, 512 was the kernel maximum) at a time. Together with mount info lookup improvements that would really make a big difference.

Unfortunately we only have the ratelimit mechanism available to us for improvement atm.

Anyway, that's what I was thinking ...

msekletar · 2020-09-04T08:20:32Z

@kzak I don't think this needs to be exposed in the configuration, because option is very low-level and of no use to vast majority of users. We can always add it if this proves to be a problem in the future.

As for the current ratelimit settings (max. 5 events per second), I think it is reasonable compromise. As @raven-au mentioned, going more aggressive (e.g. max 2 events per second) could be perceived negatively by some users.

I'd also say that current results are very promising, added overhead is basically non-existent for normal case (no or very few extra mount points) and even for the case of 2500 extra mounts the overhead is OK (~10%) in my opinion. I'd think that any admin doing it is prepared that some resources will be consumed due to this. Even with 5000 mounts it is not that bad, systemd is on CPU just 15% of time due to this, i.e. it would be still very responsive and would handle other request in the meantime should they arrive.

msekletar · 2020-09-04T08:26:41Z

@poettering I think I understand conceptually what you want me to do but I don't know how to go about implementing it (how to change one event source type into the other and back at runtime assuming that is what you meant).

Maybe in the interest of saving everyones time it would be better if you could take whatever I have and rework it as you see fit. I'd be more than happy. Otherwise, I fear I'd have to drop the ball on this or maybe @raven-au can finish the job as he seems to understand what you have in mind.

raven-au · 2020-09-04T08:45:46Z

@poettering I think I understand conceptually what you want me to do but I don't know how to go about implementing it (how to change one event source type into the other and back at runtime assuming that is what you meant).

Originally I was wondering if you were concerned about the ratelimit mechanism becoming less flexible if that change was made.
Basically, since some callbacks require the poll() information they can't easily be changed from an io type to a timer type.

But that's not the case with the mount information io handler, little more than a call back function that doesn't assert the poll() parameter the existing one does and also locate the io userdata so it can be passed to the handler and away you go.

I was thinking that a callback could be added to the existing setup and if NULL it would behave like it does now and if not it could switch source type to a timer event and make the callback for the defined interval.

Maybe in the interest of saving everyones time it would be better if you could take whatever I have and rework it as you see fit. I'd be more than happy. Otherwise, I fear I'd have to drop the ball on this or maybe @raven-au can finish the job as he seems to understand what you have in mind.

Well, maybe, but it seems to me that your understanding of the systemd event handling code is far better than mine, ideally we could work on it together?

raven-au · 2020-09-04T08:59:48Z

@poettering I think I understand conceptually what you want me to do but I don't know how to go about implementing it (how to change one event source type into the other and back at runtime assuming that is what you meant).

Originally I was wondering if you were concerned about the ratelimit mechanism becoming less flexible if that change was made.
Basically, since some callbacks require the poll() information they can't easily be changed from an io type to a timer type.

I do think this is an important aspect of the implementation.
I don't think it will be straight forward to implement a generally useful ratelimit mechanism along the lines it is now while also allowing for the switch event type handling that's being suggested.
Perhaps @poettering has some insights we aren't aware of ...

raven-au · 2020-09-04T09:05:40Z

I'd also say that current results are very promising, added overhead is basically non-existent for normal case (no or very few extra mount points) and even for the case of 2500 extra mounts the overhead is OK (~10%) in my opinion. I'd think that any admin doing it is prepared that some resources will be consumed due to this. Even with 5000 mounts it is not that bad, systemd is on CPU just 15% of time due to this, i.e. it would be still very responsive and would handle other request in the meantime should they arrive.

Yes, I agree, without eliminating the overhead of having to get the entire mount table there's not a lot more that can be done.
While I did suggest the event switching, with the recent test results, it's no longer clear to me how much more we will get out of that for the case at hand. It probably deserves some more discussion and and a prototype to check it out though.

raven-au · 2020-09-06T01:19:12Z

This kind of recursive propagation between the user-facing event source and the internal auxiliary timer event source is something i really don't like about the current approach. I think there should be only one event source, and it sometimes should behave like the event source it is, but if the ratelimit is in effect it should just behave like timer event source, and then when the ratelimit is over, behave again like it was supposed to behave originally. i.e. an event source under rate limit in effect should follow most codepaths of a timer event source, and without it being in effect it should follow the codepaths it normally does.

@poettering this sounds like a good recommendation but I was wondering if you have some thoughts on how this would be implemented.

The various source types hold source specific information in a union.
What approach would you recommend to deal with this?

Obviously, adding a timer event source to the source struct is one way but that's still the recursive bit that you'd like to keep away from.
Taking the timer type out of the union and making it a special cased event due to this change looks like the sensible way to do it but it reduces the consistency of event handling somewhat.
Trying to fiddle with the contents of the union when the event type in effect changes is another way to do it but it will probably be messy and possibly risky.

raven-au · 2020-09-07T02:24:11Z

@poettering this sounds like a good recommendation but I was wondering if you have some thoughts on how this would be implemented.
snip ...
Taking the timer type out of the union and making it a special cased event due to this change looks like the sensible way to do it but it reduces the consistency of event handling somewhat.

Thinking about it this could be done without compromising consistency by defining structures of each event type and using them in the union and giving them the same name as used in the union. Then moving the time event out of the union can it can be accessed using the same syntax and name without the need to change existing code.

So that would maintain consistency.

msekletar · 2020-10-07T15:38:15Z

Closed in favor on #17274

I've changed the implementation quite a bit so I don't think it makes sense to continue here.

msekletar force-pushed the mount-storm-ratelimit-v4 branch 2 times, most recently from 0e28c1e to 165651a Compare July 24, 2020 15:53

poettering mentioned this pull request Aug 5, 2020

RFC: Initial implementation using new libmount mount notifications #16671

Closed

poettering requested changes Aug 14, 2020

View reviewed changes

poettering added pid1 reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks sd-event labels Aug 14, 2020

poettering mentioned this pull request Aug 14, 2020

systemd high CPU on some kubernetes nodes #16728

Closed

poettering linked an issue Aug 14, 2020 that may be closed by this pull request

[RFE] Optimise /proc/self/mountinfo notification dispatch #15464

Closed

msekletar force-pushed the mount-storm-ratelimit-v4 branch from 165651a to e4758c9 Compare August 21, 2020 16:39

msekletar force-pushed the mount-storm-ratelimit-v4 branch from e4758c9 to 378e9db Compare August 24, 2020 13:45

sd-event: add support for event source rate limiting

36d5089

msekletar force-pushed the mount-storm-ratelimit-v4 branch from 378e9db to 3961eb1 Compare August 24, 2020 14:07

mount: prevent excessive /proc/self/mountinfo parsing

e2f079a

msekletar force-pushed the mount-storm-ratelimit-v4 branch from 3961eb1 to e2f079a Compare August 25, 2020 07:36

poettering requested changes Sep 2, 2020

View reviewed changes

poettering reviewed Sep 4, 2020

View reviewed changes

msekletar closed this Oct 7, 2020

raven-au mentioned this pull request Oct 8, 2020

mount: compare mountinfo entries against current list of mounts to no reprocess a known mount continuously #17283

Closed

poettering mentioned this pull request Oct 8, 2020

sd-event rate-limiting support and mount storm mitigation #17274

Closed


		static int ratelimit_timeout_handler(sd_event_source s, uint64_t usec, void userdata) {
		int r;

[WIP] Mount storms and sd-event rate-limiting #16537

[WIP] Mount storms and sd-event rate-limiting #16537

Conversation

msekletar commented Jul 21, 2020

raven-au commented Aug 6, 2020 • edited

raven-au commented Aug 10, 2020

raven-au commented Aug 12, 2020

msekletar commented Aug 12, 2020

raven-au commented Aug 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msekletar commented Aug 21, 2020

lgtm-com bot commented Aug 21, 2020

raven-au commented Aug 22, 2020

raven-au commented Aug 25, 2020

msekletar commented Aug 25, 2020

raven-au commented Aug 25, 2020

msekletar commented Aug 25, 2020

raven-au commented Aug 26, 2020

raven-au commented Aug 26, 2020

msekletar commented Aug 31, 2020

raven-au commented Sep 1, 2020

raven-au commented Sep 1, 2020

raven-au commented Sep 1, 2020

karelzak commented Sep 1, 2020

karelzak commented Sep 1, 2020

karelzak commented Sep 1, 2020

raven-au commented Sep 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raven-au commented Sep 4, 2020 • edited

karelzak commented Sep 4, 2020

Choose a reason for hiding this comment

karelzak commented Sep 4, 2020

poettering commented Sep 4, 2020

raven-au commented Sep 4, 2020

raven-au commented Sep 4, 2020

msekletar commented Sep 4, 2020

msekletar commented Sep 4, 2020

raven-au commented Sep 4, 2020 • edited

raven-au commented Sep 4, 2020

raven-au commented Sep 4, 2020 • edited

raven-au commented Sep 6, 2020

raven-au commented Sep 7, 2020

msekletar commented Oct 7, 2020

raven-au commented Aug 6, 2020 •

edited

raven-au commented Sep 4, 2020 •

edited

raven-au commented Sep 4, 2020 •

edited

raven-au commented Sep 4, 2020 •

edited