-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Mount storms and sd-event rate-limiting #16537
Conversation
0e28c1e
to
165651a
Compare
Ok, so I'm still getting used to the github review procedure so I'll just comment here. It looks like the change here restricts the time delay to increments of 1 second. If rate limiting is active on the source used to handle mount table updates and the changes (that may eventually get merged) in #16671 are in use then that will almost certainly cause message buffer overflow on (probably almost) every event following the timeout in the rapid mounting situation. And that will cause a fall back to the load all mount table mechanism every time. But equally, in the rapid mounting situation I also need to delay, just a little, and just once, to let the message buffer fill with more requests so I can grab a second batch of messages. I haven't actually tested the messages per event that I see in the rapid mounting case, and I'm about to do that, but I observed waiting for 20 milliseconds just once and checking for another message buffer and processing it further reduced CPU overhead from 15-26% to between 5-10% (for the 20000 mount test case I used). But even more than that, risking message buffer overflow like this really needs to empty the message buffer first, before waiting and immediately trying again. So I'd probably need to do this within the event handler immediately after clearing the message buffer if I see some percentage fill of messages indicating high mount/umount activity. Looking at this change I didn't quite get where and how the rate limit gets set ... Assuming I can use this the granularity of the time delay should be at most millisecond. |
I've done some testing with the patches here. Using just these two patches I didn't see much improvement running my simple 20000 mount test. I also tried using this with the watch notifications I'm working on. It seems that the overhead of the rate limit triggering is high so I really need to use a short interval (I used 5 but 10 seemed the same IIRC, it's probably overshadowed by the rate limit overhead) to get this to function. I also needed to use a short burst value (5) to get it to change responsivly enough. I think the problem is with the rate limit system overhead or perhaps it's not quite the right tool for this task. It's really easy for me to overflow the message buffer and lose messages. What I saw was single message buffers for a while (too long really) and then one or two large fill buffers then more singles. But it did work sufficiently well for my case, all be it with somewhat more overhead than a single short sleep and a single retry only. |
So I've done a little more testing with this. The patches I use include the mount notifications and fsinfo() changes as well as the changes here updated to use milliseconds for interval. My test is the usual 20000 autofs direct mounts. When I test booting into a kernel without the mount notifications and fsinfo() changes, to ensure my changes work in that case, I can also test the existing mount table processing. Using 1000ms (original in the patches here) interval I really don't see any difference at all which is puzzling. @msekletar do you have any thoughts as to why that might be the case? It's clear that the rate limit can be slow to kick-in but I don't think that can account for these findings. |
Thank you for testing. Meaning of the two parameters should be the same as in other cases where we employ rate-limiting in systemd. Hence you configure the interval in seconds and maximum number of events that can occur during that interval. If you stay below maximum you are not ratelimited, once you go above then the next event will be dispatched right after the interval elapses. |
One thing that I have seen is, with disabling the source for some time and then turning it back on, it creates a ping-pong effect that ends up using a bunch of effort switching back and forth handling and excessive number of events before switching again instead of dealing with the excessive rate of wake ups. It occurs to me that changing what this does a little might be more effective. Basically, when rate limited, disable the source and call the source event handler on the timer wakeup instead then switch back to polling after some time to prevent over servicing using the timer. The difficulty then is that you don't don't know when it's no longer rate limited so there would need to be two intervals, one to define the interval after which to wake up and switch back and another to define the timer wake up interval to call the handler. That could be used to control that ping-pong effect and the interval to switch back could be somewhat longer. If you think that's worth exploring I should be able to make some changes and perform some experiments. |
RateLimit r; | ||
sd_event_source *timeout_source; | ||
bool effective; | ||
} EventSourceRateLimit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why you made this a struct of its own? why not just merge this into the sd_event_source struct? it's only used there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that ratelimiting feature isn't going to be used that often so I don't want to waste the stack space unnecessarily in the client code. Having these three variables in a standalone struct made it easier to allocate the struct only in case ratelimiting is requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler to fold this into the main event source structure. the needed space is 4 64bit values instead 1, it's not too horrible. It feels like premature optimization to me to make this indirect.
We usually turn off ratelimiting of by setting the interval or burst fields to zero. ratelimit_below() recognizes that exlicitly. I'd stick to that same logic here for marking disabled ratelimiting.
165651a
to
e4758c9
Compare
@poettering I've updated the PR. I think I've addressed everything except things that aren't marked as resolved. PTAL. |
This pull request introduces 1 alert when merging e4758c9 into eea63c0 - view on LGTM.com new alerts:
|
This latest revision introduces 3 fails for selftests run during the Fedora RPM build. |
e4758c9
to
378e9db
Compare
378e9db
to
3961eb1
Compare
@raven-au You are right. Argument value needs to be updated to account for latest changes. Thanks! |
3961eb1
to
e2f079a
Compare
I still think the approach isn't quite right. Simply disabling the source for a bit has problems. That leads to a limited reduction (negligible from what I see, including with this latest push) in the overhead of handling the notification storm and appears to cause a sort of ping/pong effect that negates benefits of disabling the source as is done now. That's why I suggested a slightly different approach above. One where the source is switched to a timer for some interval which continues to fire every so often during the interval then switches back to the poll wake up thereby avoiding the rapid fire wakeup we get from the kernel. I think the ability of systemd to handle events fairly quickly is why we don't see multiple change events being merged. In the past (when working on autofs) I have seen that waits of a second or more can cause visible poor performance and interfere with things waiting on the processes that are delayed. I found that a delay of 100-200 milliseconds tends to result in much reduced CPU usage and still maintains quite good response for things waiting on the delayed processing. If we were to implement this along the lines I suggested then the interval could be much longer, say 1-4 or 5 seconds, with a timer wakeup of say 200 milliseconds that calls the proc mountinfo handler, then switch back to the poll event handler after the interval has elapsed and wait for the ratelimit to trigger again (perhaps). I expect that would work well for the proc mountinfo case, particularly because the handler doesn't depend on input from the poll event itself. Maybe you have particular reasons for not doing it that way, perhaps because it might limit the flexibility of the mechanism ... ? |
This shouldn't happen. I mean if the rate-limiting doesn't work in this specific instance that might imply we have broken rate-limiting in other places too, since this patch-set doesn't implement new rate-limiting primitives but reuses the implementation that we already use elsewhere. Could you please share more details about your test case? |
Well my test case may be somewhat artificial. I use an autofs direct mount map with 20000 entries, so at autofs startup there will be 20000 mounts performed and at autofs shutdown they will be umounted. So I suppose there will be a large number of notifications before the ratelimit kicks in because the rate is so high. |
I should also add that with an autofs direct mount map of 5000 entries I see similar behaviour but for a much shorter duration and direct mount maps of 5000 - 10000 entries is the sort of size that heavy autofs users would use. Although using direct mount maps isn't that common because of the side effects. I'm working to eliminate the overhead from autofs itself (where I can) but that isn't enough because the side effects of large mount tables and periodic high rates of notifications affects several other important packages, obviously systemd for a start. |
Hi Ian (@raven-au), I think I got bit stuck here. I'll try to move forward by being very specific. First let me share my test case for this PR that I've been using so far. I will also share my test results. #!/bin/bash
mkdir -p /test/mount
# run tracing for a little bit longer than reproducer to maximize the chance that tracing script catches all events, i.e. also those at very end
bpftrace -e 'uprobe:/usr/lib/systemd/systemd:mount_load_proc_self_mountinfo / pid == 1 / { @[probe] = count(); } interval:s:12 { exit(); }' >bpftrace.log &
# monitor CPU usage of systemd
pidstat -p 1 12 1 >pidstat.log &
timeout 10s /bin/bash -c 'while true ; do mount -t tmpfs tmpfs /test/mount && umount /test/mount ; done'
wait Since we rate limit to only dispatch max 5 events per second, the assumption is that probe will be hit at most 50 times in those ~10 seconds. Here are the results from the test VM running Fedora 33,
Just for reference here is the same test but with distribution provided RPM,
Clearly, there is a huge increase in CPU cycles and number of calls with unpatched systemd. Note that the test was run on the system with otherwise empty mount table (just rootfs mount and usual API filesystems mounted). When I ran the test with mount table that had roughly 20 000 entries (autofs waiting to be triggered) I've observed something strange. CPU utilisation was high in both cases and @karelzak think that this code is no longer needed because associated kernel bug is fixed. Btw, we loop in libmount because while reading big mountinfo we need to invoke multiple |
Hi. snip ...
Umm ... ok that should be around 200 ms between reads, I think that's a good amount. snip ...
LOL, my test case is pretty much designed to trigger that little hiccup. OTOH this is about dealing with high notification rates which implies mountinfo changes so I think we have a bit of an additional problem to deal with.
Yeah but the fix went in fairly recently so it probably needs to stay in libmount ... Ian |
You know I think this isn't quite the right thing to do in libmount. Using poll() to check if the mount table has changed after read is always going to be chasing an updated mount table under sustained high rate of notifications. The mount table being out of date after reading it is always possible and frequently likely so I think it's necessary to not worry too much about that and ensure we get the whole table in one read to avoid the mount table lock acquire/release between reads kernel bug. I wonder if Karel would consider doing that instead? Ian |
Oh great, the kernel seq file implementation is restricted to a maximum buffer size of page size (4k). |
The problem is (was) kernel skips entries in the mount table if you read mountinfo by more read() syscalls in time someone else uses umount. For more details see We're not able to detect it in userspace, but we can detect that mount table has been modified between the read()s and we can re-read it. This is the idea of the libmount workaround. Yes, it's expensive and ugly. The upstream kernel mountinfo read() bugfix is available since Linux 5.8 (and it's already in Fedora and it will be probably backported to CentOS/RHEL-8.4). The libmount workaround is available in util-linux v2.35 (a year ago) and v2.36. I do not want this ugly read()+poll() stuff in util-linux v2.37 (release expected at the end of this year), and I'm going to remove this from Fedora rawhide this week to make things more usable for @msekletar. The affected users with old kernel probably also use old util-linux too. From my point of view, it's good enough is to add a warning to ReleaseNotes than keep this expensive workaround in libmount for people who mix old kernels and new utils. Note, the issue exists for a decade, we care about it last year ;-) |
This workaround has been introduced by http://github.com/karelzak/util-linux/commit/e4925f591c1bfb83719418b56b952830d15b77eb And originally requested by systemd/systemd#10872 It seems we do not need it anymore as the problem should be fixed in kernel since 5.8 (kernel commit 9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f). Note that the libmount solution is very expensive as it repeats read() many times (until we get consistent result) if kernel is busy with mount table modification. This behaviour makes events management in systemd (or other places) pretty difficult as read mountinfo takes time on busy systems. Addresses: systemd/systemd#16537 Signed-off-by: Karel Zak <kzak@redhat.com>
Note, I have tested the original reproducer (tmpfs version), and it works as expected with kernel 5.8 and libmount without the workaround. |
@msekletar and @raven-au, here is Fedora util-linux package without the libmount workaround |
Yeah, I thought you knew I was aware of the kernel patch, ;)
Yes, Mikos is trying to get this into RHEL now but we have had a small problem.
Also, yes, there is no choice, there's no other way, the kernel seq file implementation has made sure of that.
Mostly people didn't even know about it, and I didn't know about the seq file locking problem myself until you brought it up, ;) Ian |
RateLimit r; | ||
sd_event_source *timeout_source; | ||
bool effective; | ||
} EventSourceRateLimit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler to fold this into the main event source structure. the needed space is 4 64bit values instead 1, it's not too horrible. It feels like premature optimization to me to make this indirect.
We usually turn off ratelimiting of by setting the interval or burst fields to zero. ratelimit_below() recognizes that exlicitly. I'd stick to that same logic here for marking disabled ratelimiting.
@@ -160,6 +160,9 @@ int sd_event_source_set_destroy_callback(sd_event_source *s, sd_event_destroy_t | |||
int sd_event_source_get_destroy_callback(sd_event_source *s, sd_event_destroy_t *ret); | |||
int sd_event_source_get_floating(sd_event_source *s); | |||
int sd_event_source_set_floating(sd_event_source *s, int b); | |||
int sd_event_source_set_ratelimit(sd_event_source *s, uint64_t interval_usec, uint64_t burst); | |||
int sd_event_source_get_ratelimit(sd_event_source *s, uint64_t *interval_usec, uint64_t *burst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't matter, but we usually prefix return args with _ret
.
|
||
r = source_disable_ratelimit(s); | ||
if (r < 0) | ||
return r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i#d prefer if this function would be atomic in behaviour, i.e. fail and make no changes or suceed and make the requested changes. Right now, if the new0() below fails we'll have disabled the ratelimit, but not enabled the new one.
i.e. allocate everything into temporary variables, then install them once you acquired them all, and don#t change state before that completed
static int ratelimit_timeout_handler(sd_event_source *s, uint64_t usec, void *userdata) { | ||
int r; | ||
|
||
sd_event_source *source = (sd_event_source *) userdata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to cast, void* casts to everything implicitly.
|
||
static int ratelimit_timeout_handler(sd_event_source *s, uint64_t usec, void *userdata) { | ||
int r; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious newline
if (s->prepare) | ||
prioq_reshuffle(s->event->prepare, s, &s->prepare_index); | ||
s->enabled = m; | ||
r = m == SD_EVENT_OFF ? source_disable(s) : source_enable(s, m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as elsewhere: this is not atomic: now ->enabled
is updated before source_{disable/enable}()
have a chance to succeed. Functions like this should either fail and leave things untouched or succeed and then update state, but not leave things half-initialized if they fail in the middle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I did this on purpose, I wanted to be mindful about not changing behaviour unnecessarily. Previously we updated the state also before doing other actions.
s->enabled = m; | ||
r = m == SD_EVENT_OFF ? source_disable(s) : source_enable(s, m); | ||
if (r < 0) | ||
return r; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if the ratelimit is effective and someone calls sd_event_source_set_enabled(), then this will possibly low-level enable the event source again even though it is still ratelimited...
I am pretty sure we should enable the event source on the low level only if both the user-controlled enabled state says so, and the ratelimit is not in effect.
I'd would be great if the source_enable()/source_disable() split could be done in a separate, preparatory commit, it's easier to follow then
if (r < 0) | ||
return r; | ||
|
||
r = sd_event_source_set_priority(s->ratelimit->timeout_source, SD_EVENT_PRIORITY_IMPORTANT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still think this is wrong: this should have the same prio as the original event source, not be higher. If some event source is set to some really high prio work on them should not be deferred just because the user installed a ton of event sources with ratelimits that run out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oh, and if the main event source changes prio — because somebody calls sd_event_source_set_priority() on it — the associated rate limit timer event should change prio too. this is what i mean by "recursive" btw, which i don#t like, i.e. al the props from the main event source always have to be propagated to the rate limit timer event source. I'd much rather have it that the primary event source acts as timer some times and as IO event source in others, depending if ratelimit is on or onot)
@@ -1865,6 +1865,10 @@ static void mount_enumerate(Manager *m) { | |||
} | |||
|
|||
(void) sd_event_source_set_description(m->mount_event_source, "mount-monitor-dispatch"); | |||
|
|||
r = sd_event_source_set_ratelimit(m->mount_event_source, 1000000, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 1000000 write USEC_PER_SEC to make this easier to read.
also maybe write a comment next to it: /* put a ratelimit on this event source trigger only 5 times per 1s */
I've done some more testing with the ratelimit change. Initially, with my 20000 mount autofs start/stop test I saw some reduced CPU usage but it rapidly grew to be similar to the case without ratelimiting. To understand what was happening I performed some different tests. Here are the results without the ratelimit change:
And the results with the ratelimit change:
These results are quite interesting although not that surprising when you think about it. In short they show that the ratelimit is quite effective but as the size of the mount table grows the number of notifications systemd can process reduces and the ratelimit becomes less effective. That makes perfect sense but does show quite convincingly that an improved kernel mount table handling API is still very much needed. So, while this change looks worthwhile regardless, we don't really know what the environment is for the OpenShift/container high CPU usage cases we've been seeing so we don't know how effective this will be in resolving them. Ian |
What is the current rate-limit? Maybe we can be more aggressive, I don't think you need to update in-systemd mount table more often than 2x per second. What about exporting this setting to systemd config file? I can imagine that in some extreme cases users can modify this setting to reduce overhead. @raven-au it would be nice to re-test it with a different setting and compare results. |
return 1; | ||
} | ||
|
||
_public_ int sd_event_source_set_ratelimit(sd_event_source *s, uint64_t interval_usec, uint64_t burst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this should be supported for all kinds of event sources, except for timer event sources btw, because i can't see how that would make sense there. i.e. return EINVAL if this is invoked for a timer event source
Note, for 20000 nodes:
Not so bad. |
hmm, in your current code turning on the ratelimit means two event sources are created, and I think besides the priority the description string should be propagated too in some form (i.e. suffixed) Also, what happens if ratelimit is on, and is currently in effect, and I then disable the event source? something needs to happen to the timer event source in this case. This kind of recursive propagation between the user-facing event source and the internal auxiliary timer event source is something i really don't like about the current approach. I think there should be only one event source, and it sometimes should behave like the event source it is, but if the ratelimit is in effect it should just behave like timer event source, and then when the ratelimit is over, behave again like it was supposed to behave originally. i.e. an event source under rate limit in effect should follow most codepaths of a timer event source, and without it being in effect it should follow the codepaths it normally does. |
IIUC, with the current hard coded settings, this mechanism turns off the event source for 1 second if the rate of notifications is greater than 5 per second. That's already pretty aggressive, turning off the source for 1 second every time the rate is triggered could easily lead to sluggish mount unit responsiveness. But I'll wait to see what @msekletar thinks about the settings. |
That sounds like what I thought would be a better approach (as described above early on), IIUC what your saying. The other aspect of this approach is that if we ever get a suitable fsinfo() style mount information access method and new mount notifications similar to what was posted recently then this mechanism would lend itself well to taking advantage of the notifications queueing mechanism so that notifications could be processed in batches up to 256 (was the libmount default, 512 was the kernel maximum) at a time. Together with mount info lookup improvements that would really make a big difference. Unfortunately we only have the ratelimit mechanism available to us for improvement atm. Anyway, that's what I was thinking ... |
@kzak I don't think this needs to be exposed in the configuration, because option is very low-level and of no use to vast majority of users. We can always add it if this proves to be a problem in the future. As for the current ratelimit settings (max. 5 events per second), I think it is reasonable compromise. As @raven-au mentioned, going more aggressive (e.g. max 2 events per second) could be perceived negatively by some users. I'd also say that current results are very promising, added overhead is basically non-existent for normal case (no or very few extra mount points) and even for the case of 2500 extra mounts the overhead is OK (~10%) in my opinion. I'd think that any admin doing it is prepared that some resources will be consumed due to this. Even with 5000 mounts it is not that bad, systemd is on CPU just 15% of time due to this, i.e. it would be still very responsive and would handle other request in the meantime should they arrive. |
@poettering I think I understand conceptually what you want me to do but I don't know how to go about implementing it (how to change one event source type into the other and back at runtime assuming that is what you meant). Maybe in the interest of saving everyones time it would be better if you could take whatever I have and rework it as you see fit. I'd be more than happy. Otherwise, I fear I'd have to drop the ball on this or maybe @raven-au can finish the job as he seems to understand what you have in mind. |
Originally I was wondering if you were concerned about the ratelimit mechanism becoming less flexible if that change was made. But that's not the case with the mount information io handler, little more than a call back function that doesn't assert the poll() parameter the existing one does and also locate the io userdata so it can be passed to the handler and away you go. I was thinking that a callback could be added to the existing setup and if NULL it would behave like it does now and if not it could switch source type to a timer event and make the callback for the defined interval.
Well, maybe, but it seems to me that your understanding of the systemd event handling code is far better than mine, ideally we could work on it together? |
I do think this is an important aspect of the implementation. |
Yes, I agree, without eliminating the overhead of having to get the entire mount table there's not a lot more that can be done. |
@poettering this sounds like a good recommendation but I was wondering if you have some thoughts on how this would be implemented. The various source types hold source specific information in a union. Obviously, adding a timer event source to the source struct is one way but that's still the recursive bit that you'd like to keep away from. |
Thinking about it this could be done without compromising consistency by defining structures of each event type and using them in the union and giving them the same name as used in the union. Then moving the time event out of the union can it can be accessed using the same syntax and name without the need to change existing code. So that would maintain consistency. |
Closed in favor on #17274 I've changed the implementation quite a bit so I don't think it makes sense to continue here. |
This is my stab at #15464. @poettering @keszybz @neilbrown PTAL.
Obviously this is missing docs, I will work on that once we have implementation details flashed out.