New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sd-event: fix failure to exit rate limiting state #19811
Conversation
Open to alternatives if we really want to rate limit accesses to /proc/self/mountinfo. But given that systemd has shipped without mount processing rate limiting until now I feel it's acceptable to revert this in the meantime in order to clean up cgroups properly. It also inaccurately shows the leftover mount unit as "active" when the mount is unmounted but systemd hasn't cleaned up the unit. Example output:
|
Let's first understand what the problem is please. The fact that we rate limit the event source doesn't mean we should get out of sync, but instead that it just takes a bit longer to catch up. What's the precise issue you are seeing? |
I've been upgrading systemd from 247.3 to 248.2 (all cgroup2 hosts). On the hosts with systemd stable 248.2 that run container jobs, we've been seeing an increase in left over cgroups from mount units. They seem to exist in systemd memory (at least through When our container manager starts/stops containers, on the bare metal host there can be upwards of 15+ mount and unmounts happening as part of this set up (usually a higher multiple of this since there can be multiple containers per host). This burst of mount/unmount operations appears to be causing this mount processing rate limit to get hit. When the container set up finishes and things settle down, some mount units, whose mountpoints are already unmounted, are still left over in pid1's memory. If If I do I did get a reliable repro of the "leaked" mount units on some of our devboxes with systemd 248.2 by starting/stopping multiple containers (~20+ mountpoints?). The same experiment with this revert does not have the problem. |
It sounds like what your saying is the rate-limiting needs to either run the processing or queue and event as part of it's exit from rate limiting. That 1 second pause is quite long and it wouldn't surprise me if there were no further events following for it to catch-up, so to speak. |
The sd-event ratelimiting will suppress event handling on a specific source if triggered, for a while, but after that while passes it will fire again. This means, with the chosen parameters here, after 1s everything should always be settled. If it isn't, then this would be a bug in sd-event, since apparently the /proc/self/mountinfo event isn't triggered like it should afte rthe 1s window passed. |
Maybe I'm missing something, but could PID1 prepare a mount tree cache with optimized mount trees which can be combined into final per-service mount trees with minimal number of mounting steps? This should reduce the number of mount events in the system, reduce |
@topimiettinen that should be unnecessary. The mounts configured in unit files only ever happen inside the mount namespace created for the relevant unit, they are not visible outside of it, and hence they do not result in mount events on the host. |
I just tested
However, there are no messages from |
I think the problem is maybe with our priority queue comparator functions (i.e. with @poettering any ideas? |
The cached mount trees could be kept private in a new mount manager process, running on a separate mount namespace, so PID1 and global mount namespace would not be polluted with the cache mounts. Perhaps all mount operations could be concentrated to this process and the mount manager process would fork off the services (or hand over fully mounted trees to PID1 somehow). It would be sort of privilege separation, though pointless since there's no mount specific CAP_SYS_MOUNT privilege which could be dropped by PID1. |
hmm, there might indeed be something wrong with the ordering in the prioqs, but if so it's not obvious to me how. Note that leaving the ratelimit state is not triggered by either of the pending or prepare prioq of them, but by the "earliest" prioq, and that one doesn't bother about whether the event source is ratelimited or not much. See earliest_time_prioq_compare() for the ordering function for that prioq, and process_timer() where leaving the ratelimiting is handled. I am a bit puzzled about this, and wonder why the test case didn't catch this. I guess the fact that various other event sources with other states are in the event loop trigger the issue. Is there any isolated reproducer for this? @anitazha any chance you can gdb through it and figure out why process_timer() doesn't properly make the event source leave the ratelimit logic? i.e. why the event_source_leave_ratelimit() call from process_timer() is never issued? |
hmm, maybe the offending event sources are actually marked "pending" while being ratelimited. THe current code prefers to trigger event sources that are not marked "pending" yet (since after all the "earliest" prioq's purpose is to mark event sources as pending once their time has come, and it makes no sense to mark event sources pending that already are marked so). Maybe the bug is there: instead of ordering non-pending before pending we should order "non-pending OR ratelimited" before "pending AND not-ratelimited". @anitazha, any chance you could patch your local version accordinlgy? i.e. find earliest_time_prioq_compare(), and then change the middle part that currently reads like this: /* Move the pending ones to the end */
if (!x->pending && y->pending)
return -1;
if (x->pending && !y->pending)
return 1; to something like this: if (event_source_timer_candidate(x) && !event_source_timer_candidate(y))
return -1;
if (!event_source_timer_candidate(x) && event_source_timer_candidate(y))
return 1; with a helper function: static bool event_source_timer_candidate(sd_event_source *s) {
assert(s);
/* Returns true for event sources that either are not pending yet (i.e. where it's worth to mark them pending) or which are currently ratelimited (i.e. where it's worth leaving the ratelimited state) */
return !s->pending || s->ratelimited;
} or something like that? |
@poettering I tested a local patched version with the changes you suggested to modify the pending vs rate limited logic and it seems to be working. No more leaked cgroups/units with my repro. Also following @msekletar's logs above I see the event source does exit the rate limit state now:
|
Thanks for your help tracking this down @msekletar @poettering. I've updated this PR with the suggested fix (edited author to be Lennart since he gave the code) and added an extended test that triggers the issue. |
C part looks (obviously… ;-)) good to me. The shell part I'd love to have @keszybz' take on (or someone else who's really good at shell, @mrc0mmand maybe?) |
Please run shellcheck on the new test script, in case you haven't yet. Thanks! |
It's hard to trigger the failure to exit the rate limit state in isolation as it needs multiple event sources in order to show that it gets stuck in the queue. Hence why this is an extended test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I figure the sleep_between() issue happens because earliest_time_prioq and latest_time_prioq are now utterly out of sync, since one orders the ratelimited stuff differently from the other. in sd-event timer events contain info for two timestamps: the timestamps after which the even may be triggered, and the timestamp when it has to be triggered. i.e. one is the earliest time and one is the latest time to trigger. The difference between the two is controlled via the timer event "accuracy" value. If you specify a large accuracy value then this basically means the timer is not required to triggere at the precise time, but maybe be triggered a bit later. this gives the event loop some freedom to reduce wake-ups. Instead of waking up from epoll_wait() on each timer as precisely as we can, we try to prolong the sleeps a but, saving a bit of CPU/energy, and then once we are awake process as much as we can before sleeping for longer. Anyway, so much about the backstory. What needs to be fixed here I guess is simply that the changes you made to earliest_time_prioq_compare() is also made to latest_time_prioq_compare(). |
The "accuracy" concept we only apply to proper timer events. For ratelimit stuff we say the earliest/latest time is actually the same. |
Instead of ordering non-pending before pending we should order "non-pending OR ratelimited" before "pending AND not-ratelimited". This fixes a bug where ratelimited events were ordered at the end of the priority queue and could be stuck there for an indeterminate amount of time.
3e360d8
to
81107b8
Compare
Based on that explanation (and a good look at the code) I refactored the two functions to keep them in sync. |
@yuwata have you seen these networkd-test.py failures before?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bionic-arm is having testbed problems, and bionic-s390x is #19881
This reverts commit d586f64.Rate limiting the mount_event_source can cause unmount events to bemissed, which leads to mount unit cgroups being leaked (not cleaned up
when the mount is gone).
As a result of the discussion this PR is updated to fix the sd-event rate limit logic.