Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sd-event: fix failure to exit rate limiting state #19811

Merged
merged 2 commits into from Jun 10, 2021

Conversation

anitazha
Copy link
Member

@anitazha anitazha commented Jun 3, 2021

This reverts commit d586f64.

Rate limiting the mount_event_source can cause unmount events to be
missed, which leads to mount unit cgroups being leaked (not cleaned up
when the mount is gone).

As a result of the discussion this PR is updated to fix the sd-event rate limit logic.

@anitazha
Copy link
Member Author

anitazha commented Jun 3, 2021

Open to alternatives if we really want to rate limit accesses to /proc/self/mountinfo. But given that systemd has shipped without mount processing rate limiting until now I feel it's acceptable to revert this in the meantime in order to clean up cgroups properly.

It also inaccurately shows the leftover mount unit as "active" when the mount is unmounted but systemd hasn't cleaned up the unit. Example output:

$ systemctl status 'mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount'
● mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount
     Loaded: not-found (Reason: Unit mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount not found.)
     Active: active (mounted)
        CPU: 0
     CGroup: /mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount

@anitazha anitazha added the pid1 label Jun 3, 2021
@poettering
Copy link
Member

Let's first understand what the problem is please. The fact that we rate limit the event source doesn't mean we should get out of sync, but instead that it just takes a bit longer to catch up.

What's the precise issue you are seeing?

@anitazha
Copy link
Member Author

anitazha commented Jun 4, 2021

I've been upgrading systemd from 247.3 to 248.2 (all cgroup2 hosts). On the hosts with systemd stable 248.2 that run container jobs, we've been seeing an increase in left over cgroups from mount units. They seem to exist in systemd memory (at least through systemctl list-units, hence the output above) and they have cgroups under /sys/fs/cgroup/. We noticed this due to increasing CPU util from polling cgroups (from all the leftover cgroups) lining up with the roll out phases.

When our container manager starts/stops containers, on the bare metal host there can be upwards of 15+ mount and unmounts happening as part of this set up (usually a higher multiple of this since there can be multiple containers per host). This burst of mount/unmount operations appears to be causing this mount processing rate limit to get hit. When the container set up finishes and things settle down, some mount units, whose mountpoints are already unmounted, are still left over in pid1's memory. If systemctl daemon-reload was run at some point then these mounts also have leftover cgroups associated with them.

If I do systemctl daemon-reexec, the "leaked" mount units+cgroups will get cleaned up. I assume if the mount processing event gets triggered again after the rate limiting passes they should also get cleaned up (I haven't tested this). The problem is that after the containers are set up on the host and the mount/unmounts stop, there doesn't seem to be anything triggering the mount processing event to run again and so the "leaked" mount units+cgroups will hang around there for hours or days (or weeks? I've only been rolling for 2 weeks so far). And then the eventual next trigger will probably be another burst of mount/unmounts causing an endless cycle...

I did get a reliable repro of the "leaked" mount units on some of our devboxes with systemd 248.2 by starting/stopping multiple containers (~20+ mountpoints?). The same experiment with this revert does not have the problem.

@raven-au
Copy link

raven-au commented Jun 4, 2021

It sounds like what your saying is the rate-limiting needs to either run the processing or queue and event as part of it's exit from rate limiting. That 1 second pause is quite long and it wouldn't surprise me if there were no further events following for it to catch-up, so to speak.

@poettering
Copy link
Member

The sd-event ratelimiting will suppress event handling on a specific source if triggered, for a while, but after that while passes it will fire again. This means, with the chosen parameters here, after 1s everything should always be settled. If it isn't, then this would be a bug in sd-event, since apparently the /proc/self/mountinfo event isn't triggered like it should afte rthe 1s window passed.

@topimiettinen
Copy link
Contributor

Maybe I'm missing something, but could PID1 prepare a mount tree cache with optimized mount trees which can be combined into final per-service mount trees with minimal number of mounting steps? This should reduce the number of mount events in the system, reduce /proc/self/mountinfo parsing and thus also improve performance. After reading all service files, PID1 should know all mounts it will ever perform (until next daemon-reload or starting mount units etc) and when constructing the mount tree for a service, instead of performing all mounts sequentially, it could perform some global optimization steps beforehand. On a typical system (unlike mine with excessive use of RootImage=), mount trees for most services should have a lot in common.

@poettering
Copy link
Member

@topimiettinen that should be unnecessary. The mounts configured in unit files only ever happen inside the mount namespace created for the relevant unit, they are not visible outside of it, and hence they do not result in mount events on the host.

@msekletar
Copy link
Contributor

msekletar commented Jun 4, 2021

I just tested /p/s/mountinfo ratelimiting on Fedora Rawhide and it is indeed broken. I see that we enter ratelimited state correctly but then we never leave it and m->mount_event_source stays ratelimited even after timeout. I have these messages in the journal,

[root@ci-vm ~]# journalctl -b -u init.scope | grep rate
Jun 04 17:45:16 ci-vm systemd[1]: Event source 0x561fbab2d6a0 (mount-monitor-dispatch) entered rate limit state.
Jun 04 17:59:37 ci-vm systemd[1]: Event source 0x55cb353e66a0 (mount-monitor-dispatch) entered rate limit state.
Jun 04 18:02:51 ci-vm systemd[1]: Event source 0x55fa78a576a0 (mount-monitor-dispatch) entered rate limit state.

However, there are no messages from event_source_leave_ratelimit() that we left rate limited state. I've also confirmed the same in gdb. Actually event_source_leave_ratelimit() wasn't even called. Weird...

@msekletar
Copy link
Contributor

I think the problem is maybe with our priority queue comparator functions (i.e. with pending_prioq_compare() and prepare_prioq_compare()). Those functions order ratelimited event sources always at the very end of those queues. I think that logic that then picks the new wakeup time and rearms the timer might be confused because of this?

@poettering any ideas?

@topimiettinen
Copy link
Contributor

@topimiettinen that should be unnecessary. The mounts configured in unit files only ever happen inside the mount namespace created for the relevant unit, they are not visible outside of it, and hence they do not result in mount events on the host.

The cached mount trees could be kept private in a new mount manager process, running on a separate mount namespace, so PID1 and global mount namespace would not be polluted with the cache mounts. Perhaps all mount operations could be concentrated to this process and the mount manager process would fork off the services (or hand over fully mounted trees to PID1 somehow). It would be sort of privilege separation, though pointless since there's no mount specific CAP_SYS_MOUNT privilege which could be dropped by PID1.

@msekletar msekletar added this to the v249 milestone Jun 7, 2021
@poettering
Copy link
Member

I think the problem is maybe with our priority queue comparator functions (i.e. with pending_prioq_compare() and prepare_prioq_compare()). Those functions order ratelimited event sources always at the very end of those queues. I think that logic that then picks the new wakeup time and rearms the timer might be confused because of this?

@poettering any ideas?

hmm, there might indeed be something wrong with the ordering in the prioqs, but if so it's not obvious to me how. Note that leaving the ratelimit state is not triggered by either of the pending or prepare prioq of them, but by the "earliest" prioq, and that one doesn't bother about whether the event source is ratelimited or not much. See earliest_time_prioq_compare() for the ordering function for that prioq, and process_timer() where leaving the ratelimiting is handled.

I am a bit puzzled about this, and wonder why the test case didn't catch this. I guess the fact that various other event sources with other states are in the event loop trigger the issue.

Is there any isolated reproducer for this?

@anitazha any chance you can gdb through it and figure out why process_timer() doesn't properly make the event source leave the ratelimit logic? i.e. why the event_source_leave_ratelimit() call from process_timer() is never issued?

@poettering
Copy link
Member

hmm, maybe the offending event sources are actually marked "pending" while being ratelimited. THe current code prefers to trigger event sources that are not marked "pending" yet (since after all the "earliest" prioq's purpose is to mark event sources as pending once their time has come, and it makes no sense to mark event sources pending that already are marked so).

Maybe the bug is there: instead of ordering non-pending before pending we should order "non-pending OR ratelimited" before "pending AND not-ratelimited".

@anitazha, any chance you could patch your local version accordinlgy? i.e. find earliest_time_prioq_compare(), and then change the middle part that currently reads like this:

        /* Move the pending ones to the end */
        if (!x->pending && y->pending)
                return -1;
        if (x->pending && !y->pending)
                return 1;

to something like this:

if (event_source_timer_candidate(x) && !event_source_timer_candidate(y))
        return -1;
if (!event_source_timer_candidate(x) && event_source_timer_candidate(y))
        return 1;

with a helper function:

static bool event_source_timer_candidate(sd_event_source *s) {
        assert(s);

        /* Returns true for event sources that either are not pending yet (i.e. where it's worth to mark them pending) or which are currently ratelimited (i.e. where it's worth leaving the ratelimited state) */
        return !s->pending || s->ratelimited;
}

or something like that?

@anitazha
Copy link
Member Author

anitazha commented Jun 8, 2021

@poettering I tested a local patched version with the changes you suggested to modify the pending vs rate limited logic and it seems to be working. No more leaked cgroups/units with my repro. Also following @msekletar's logs above I see the event source does exit the rate limit state now:

$ sudo journalctl -f -u init.scope -g rate
-- Journal begins at Mon 2021-06-07 18:27:20 PDT. --
Jun 07 18:29:04 devvm1510 systemd[1]: Event source 0x56455e8ae270 (mount-monitor-dispatch) entered rate limit state.
Jun 07 18:29:04 devvm1510 systemd[1]: Event source 0x56455e8ae270 (mount-monitor-dispatch) left rate limit state.

@anitazha anitazha changed the title Revert "core: prevent excessive /proc/self/mountinfo parsing" sd-event: fix failure to exit rate limiting state Jun 8, 2021
@anitazha
Copy link
Member Author

anitazha commented Jun 8, 2021

Thanks for your help tracking this down @msekletar @poettering. I've updated this PR with the suggested fix (edited author to be Lennart since he gave the code) and added an extended test that triggers the issue.

test/TEST-60-MOUNT-RATELIMIT/Makefile Outdated Show resolved Hide resolved
test/units/testsuite-60.sh Outdated Show resolved Hide resolved
test/units/testsuite-60.sh Outdated Show resolved Hide resolved
@poettering
Copy link
Member

C part looks (obviously… ;-)) good to me. The shell part I'd love to have @keszybz' take on (or someone else who's really good at shell, @mrc0mmand maybe?)

@bluca
Copy link
Member

bluca commented Jun 8, 2021

Please run shellcheck on the new test script, in case you haven't yet. Thanks!

test/units/testsuite-60.sh Show resolved Hide resolved
test/units/testsuite-60.sh Show resolved Hide resolved
src/tests.sh Outdated Show resolved Hide resolved
It's hard to trigger the failure to exit the rate limit state in
isolation as it needs multiple event sources in order to show that it
gets stuck in the queue. Hence why this is an extended test.
Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@yuwata yuwata added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed needs-review labels Jun 9, 2021
@poettering poettering removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Jun 9, 2021
@poettering
Copy link
Member

I figure the sleep_between() issue happens because earliest_time_prioq and latest_time_prioq are now utterly out of sync, since one orders the ratelimited stuff differently from the other.

in sd-event timer events contain info for two timestamps: the timestamps after which the even may be triggered, and the timestamp when it has to be triggered. i.e. one is the earliest time and one is the latest time to trigger. The difference between the two is controlled via the timer event "accuracy" value. If you specify a large accuracy value then this basically means the timer is not required to triggere at the precise time, but maybe be triggered a bit later. this gives the event loop some freedom to reduce wake-ups. Instead of waking up from epoll_wait() on each timer as precisely as we can, we try to prolong the sleeps a but, saving a bit of CPU/energy, and then once we are awake process as much as we can before sleeping for longer.

Anyway, so much about the backstory. What needs to be fixed here I guess is simply that the changes you made to earliest_time_prioq_compare() is also made to latest_time_prioq_compare().

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jun 9, 2021
@poettering
Copy link
Member

The "accuracy" concept we only apply to proper timer events. For ratelimit stuff we say the earliest/latest time is actually the same.

Instead of ordering non-pending before pending we should order
"non-pending OR ratelimited" before "pending AND not-ratelimited".
This fixes a bug where ratelimited events were ordered at the end of the
priority queue and could be stuck there for an indeterminate amount of
time.
@anitazha
Copy link
Member Author

Based on that explanation (and a good look at the code) I refactored the two functions to keep them in sync.

@poettering poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jun 10, 2021
@bluca
Copy link
Member

bluca commented Jun 10, 2021

@yuwata have you seen these networkd-test.py failures before?

======================================================================
ERROR: test_bridge_port_priority (__main__.BridgeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 245, in test_bridge_port_priority
    self.assertEqual(self.read_attr('port1', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port1/brport/priority'

======================================================================
ERROR: test_bridge_port_priority_set_zero (__main__.BridgeTest)
It should be possible to set the bridge port priority to 0
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 258, in test_bridge_port_priority_set_zero
    self.assertEqual(self.read_attr('port2', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port2/brport/priority'

======================================================================
ERROR: test_bridge_port_property (__main__.BridgeTest)
Test the "[Bridge]" section keys
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 271, in test_bridge_port_property
    self.assertEqual(self.read_attr('port2', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port2/brport/priority'

----------------------------------------------------------------------
Ran 35 tests in 159.772s

FAILED (errors=3, skipped=2)
autopkgtest [04:45:26]: test networkd-test.py: -----------------------]

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 10, 2021
@yuwata
Copy link
Member

yuwata commented Jun 10, 2021

@yuwata have you seen these networkd-test.py failures before?

I hope #19881 fixes the failures. At least, the failures should be not related to this PR.

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 10, 2021
Copy link
Member

@bluca bluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bionic-arm is having testbed problems, and bionic-s390x is #19881

@bluca bluca merged commit 7ad9bad into systemd:main Jun 10, 2021
yuwata added a commit to yuwata/systemd that referenced this pull request Jun 12, 2021
yuwata added a commit to yuwata/systemd that referenced this pull request Jun 12, 2021
@anitazha anitazha deleted the revert_mount_rl branch June 24, 2021 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed pid1
Development

Successfully merging this pull request may close these issues.

v248 still monitoring /proc/self/mountinfo ?
8 participants