sd-event: fix failure to exit rate limiting state #19811

anitazha · 2021-06-03T23:05:29Z

~~This reverts commit d586f64.~~

Rate limiting the mount_event_source can cause unmount events to be
missed, which leads to mount unit cgroups being leaked (not cleaned up
when the mount is gone).

As a result of the discussion this PR is updated to fix the sd-event rate limit logic.

anitazha · 2021-06-03T23:10:23Z

Open to alternatives if we really want to rate limit accesses to /proc/self/mountinfo. But given that systemd has shipped without mount processing rate limiting until now I feel it's acceptable to revert this in the meantime in order to clean up cgroups properly.

It also inaccurately shows the leftover mount unit as "active" when the mount is unmounted but systemd hasn't cleaned up the unit. Example output:

$ systemctl status 'mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount'
● mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount
     Loaded: not-found (Reason: Unit mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount not found.)
     Active: active (mounted)
        CPU: 0
     CGroup: /mnt-xarfuse-uid\x2d0-ecb0d435\x2dseed\x2d4026531840.mount

poettering · 2021-06-04T07:32:38Z

Let's first understand what the problem is please. The fact that we rate limit the event source doesn't mean we should get out of sync, but instead that it just takes a bit longer to catch up.

What's the precise issue you are seeing?

anitazha · 2021-06-04T09:30:25Z

I've been upgrading systemd from 247.3 to 248.2 (all cgroup2 hosts). On the hosts with systemd stable 248.2 that run container jobs, we've been seeing an increase in left over cgroups from mount units. They seem to exist in systemd memory (at least through systemctl list-units, hence the output above) and they have cgroups under /sys/fs/cgroup/. We noticed this due to increasing CPU util from polling cgroups (from all the leftover cgroups) lining up with the roll out phases.

When our container manager starts/stops containers, on the bare metal host there can be upwards of 15+ mount and unmounts happening as part of this set up (usually a higher multiple of this since there can be multiple containers per host). This burst of mount/unmount operations appears to be causing this mount processing rate limit to get hit. When the container set up finishes and things settle down, some mount units, whose mountpoints are already unmounted, are still left over in pid1's memory. If systemctl daemon-reload was run at some point then these mounts also have leftover cgroups associated with them.

If I do systemctl daemon-reexec, the "leaked" mount units+cgroups will get cleaned up. I assume if the mount processing event gets triggered again after the rate limiting passes they should also get cleaned up (I haven't tested this). The problem is that after the containers are set up on the host and the mount/unmounts stop, there doesn't seem to be anything triggering the mount processing event to run again and so the "leaked" mount units+cgroups will hang around there for hours or days (or weeks? I've only been rolling for 2 weeks so far). And then the eventual next trigger will probably be another burst of mount/unmounts causing an endless cycle...

I did get a reliable repro of the "leaked" mount units on some of our devboxes with systemd 248.2 by starting/stopping multiple containers (~20+ mountpoints?). The same experiment with this revert does not have the problem.

raven-au · 2021-06-04T10:49:08Z

It sounds like what your saying is the rate-limiting needs to either run the processing or queue and event as part of it's exit from rate limiting. That 1 second pause is quite long and it wouldn't surprise me if there were no further events following for it to catch-up, so to speak.

poettering · 2021-06-04T12:03:18Z

The sd-event ratelimiting will suppress event handling on a specific source if triggered, for a while, but after that while passes it will fire again. This means, with the chosen parameters here, after 1s everything should always be settled. If it isn't, then this would be a bug in sd-event, since apparently the /proc/self/mountinfo event isn't triggered like it should afte rthe 1s window passed.

topimiettinen · 2021-06-04T14:28:03Z

Maybe I'm missing something, but could PID1 prepare a mount tree cache with optimized mount trees which can be combined into final per-service mount trees with minimal number of mounting steps? This should reduce the number of mount events in the system, reduce /proc/self/mountinfo parsing and thus also improve performance. After reading all service files, PID1 should know all mounts it will ever perform (until next daemon-reload or starting mount units etc) and when constructing the mount tree for a service, instead of performing all mounts sequentially, it could perform some global optimization steps beforehand. On a typical system (unlike mine with excessive use of RootImage=), mount trees for most services should have a lot in common.

poettering · 2021-06-04T14:40:22Z

@topimiettinen that should be unnecessary. The mounts configured in unit files only ever happen inside the mount namespace created for the relevant unit, they are not visible outside of it, and hence they do not result in mount events on the host.

msekletar · 2021-06-04T22:18:19Z

I just tested /p/s/mountinfo ratelimiting on Fedora Rawhide and it is indeed broken. I see that we enter ratelimited state correctly but then we never leave it and m->mount_event_source stays ratelimited even after timeout. I have these messages in the journal,

[root@ci-vm ~]# journalctl -b -u init.scope | grep rate
Jun 04 17:45:16 ci-vm systemd[1]: Event source 0x561fbab2d6a0 (mount-monitor-dispatch) entered rate limit state.
Jun 04 17:59:37 ci-vm systemd[1]: Event source 0x55cb353e66a0 (mount-monitor-dispatch) entered rate limit state.
Jun 04 18:02:51 ci-vm systemd[1]: Event source 0x55fa78a576a0 (mount-monitor-dispatch) entered rate limit state.

However, there are no messages from event_source_leave_ratelimit() that we left rate limited state. I've also confirmed the same in gdb. Actually event_source_leave_ratelimit() wasn't even called. Weird...

msekletar · 2021-06-05T10:19:26Z

I think the problem is maybe with our priority queue comparator functions (i.e. with pending_prioq_compare() and prepare_prioq_compare()). Those functions order ratelimited event sources always at the very end of those queues. I think that logic that then picks the new wakeup time and rearms the timer might be confused because of this?

@poettering any ideas?

topimiettinen · 2021-06-06T13:55:26Z

@topimiettinen that should be unnecessary. The mounts configured in unit files only ever happen inside the mount namespace created for the relevant unit, they are not visible outside of it, and hence they do not result in mount events on the host.

The cached mount trees could be kept private in a new mount manager process, running on a separate mount namespace, so PID1 and global mount namespace would not be polluted with the cache mounts. Perhaps all mount operations could be concentrated to this process and the mount manager process would fork off the services (or hand over fully mounted trees to PID1 somehow). It would be sort of privilege separation, though pointless since there's no mount specific CAP_SYS_MOUNT privilege which could be dropped by PID1.

poettering · 2021-06-07T15:47:57Z

I think the problem is maybe with our priority queue comparator functions (i.e. with pending_prioq_compare() and prepare_prioq_compare()). Those functions order ratelimited event sources always at the very end of those queues. I think that logic that then picks the new wakeup time and rearms the timer might be confused because of this?

@poettering any ideas?

hmm, there might indeed be something wrong with the ordering in the prioqs, but if so it's not obvious to me how. Note that leaving the ratelimit state is not triggered by either of the pending or prepare prioq of them, but by the "earliest" prioq, and that one doesn't bother about whether the event source is ratelimited or not much. See earliest_time_prioq_compare() for the ordering function for that prioq, and process_timer() where leaving the ratelimiting is handled.

I am a bit puzzled about this, and wonder why the test case didn't catch this. I guess the fact that various other event sources with other states are in the event loop trigger the issue.

Is there any isolated reproducer for this?

@anitazha any chance you can gdb through it and figure out why process_timer() doesn't properly make the event source leave the ratelimit logic? i.e. why the event_source_leave_ratelimit() call from process_timer() is never issued?

poettering · 2021-06-07T15:58:53Z

hmm, maybe the offending event sources are actually marked "pending" while being ratelimited. THe current code prefers to trigger event sources that are not marked "pending" yet (since after all the "earliest" prioq's purpose is to mark event sources as pending once their time has come, and it makes no sense to mark event sources pending that already are marked so).

Maybe the bug is there: instead of ordering non-pending before pending we should order "non-pending OR ratelimited" before "pending AND not-ratelimited".

@anitazha, any chance you could patch your local version accordinlgy? i.e. find earliest_time_prioq_compare(), and then change the middle part that currently reads like this:

        /* Move the pending ones to the end */
        if (!x->pending && y->pending)
                return -1;
        if (x->pending && !y->pending)
                return 1;

to something like this:

if (event_source_timer_candidate(x) && !event_source_timer_candidate(y))
        return -1;
if (!event_source_timer_candidate(x) && event_source_timer_candidate(y))
        return 1;

with a helper function:

static bool event_source_timer_candidate(sd_event_source *s) {
        assert(s);

        /* Returns true for event sources that either are not pending yet (i.e. where it's worth to mark them pending) or which are currently ratelimited (i.e. where it's worth leaving the ratelimited state) */
        return !s->pending || s->ratelimited;
}

or something like that?

anitazha · 2021-06-08T01:34:19Z

@poettering I tested a local patched version with the changes you suggested to modify the pending vs rate limited logic and it seems to be working. No more leaked cgroups/units with my repro. Also following @msekletar's logs above I see the event source does exit the rate limit state now:

$ sudo journalctl -f -u init.scope -g rate
-- Journal begins at Mon 2021-06-07 18:27:20 PDT. --
Jun 07 18:29:04 devvm1510 systemd[1]: Event source 0x56455e8ae270 (mount-monitor-dispatch) entered rate limit state.
Jun 07 18:29:04 devvm1510 systemd[1]: Event source 0x56455e8ae270 (mount-monitor-dispatch) left rate limit state.

anitazha · 2021-06-08T08:03:39Z

Thanks for your help tracking this down @msekletar @poettering. I've updated this PR with the suggested fix (edited author to be Lennart since he gave the code) and added an extended test that triggers the issue.

test/TEST-60-MOUNT-RATELIMIT/Makefile

test/units/testsuite-60.sh

poettering · 2021-06-08T09:22:53Z

C part looks (obviously… ;-)) good to me. The shell part I'd love to have @keszybz' take on (or someone else who's really good at shell, @mrc0mmand maybe?)

bluca · 2021-06-08T12:39:12Z

Please run shellcheck on the new test script, in case you haven't yet. Thanks!

src/libsystemd/sd-event/sd-event.c

test/units/testsuite-60.sh

src/tests.sh

It's hard to trigger the failure to exit the rate limit state in isolation as it needs multiple event sources in order to show that it gets stuck in the queue. Hence why this is an extended test.

yuwata

LGTM.

poettering · 2021-06-09T20:24:07Z

I figure the sleep_between() issue happens because earliest_time_prioq and latest_time_prioq are now utterly out of sync, since one orders the ratelimited stuff differently from the other.

in sd-event timer events contain info for two timestamps: the timestamps after which the even may be triggered, and the timestamp when it has to be triggered. i.e. one is the earliest time and one is the latest time to trigger. The difference between the two is controlled via the timer event "accuracy" value. If you specify a large accuracy value then this basically means the timer is not required to triggere at the precise time, but maybe be triggered a bit later. this gives the event loop some freedom to reduce wake-ups. Instead of waking up from epoll_wait() on each timer as precisely as we can, we try to prolong the sleeps a but, saving a bit of CPU/energy, and then once we are awake process as much as we can before sleeping for longer.

Anyway, so much about the backstory. What needs to be fixed here I guess is simply that the changes you made to earliest_time_prioq_compare() is also made to latest_time_prioq_compare().

poettering · 2021-06-09T20:24:52Z

The "accuracy" concept we only apply to proper timer events. For ratelimit stuff we say the earliest/latest time is actually the same.

Instead of ordering non-pending before pending we should order "non-pending OR ratelimited" before "pending AND not-ratelimited". This fixes a bug where ratelimited events were ordered at the end of the priority queue and could be stuck there for an indeterminate amount of time.

anitazha · 2021-06-10T04:12:21Z

Based on that explanation (and a good look at the code) I refactored the two functions to keep them in sync.

bluca · 2021-06-10T09:44:05Z

@yuwata have you seen these networkd-test.py failures before?

======================================================================
ERROR: test_bridge_port_priority (__main__.BridgeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 245, in test_bridge_port_priority
    self.assertEqual(self.read_attr('port1', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port1/brport/priority'

======================================================================
ERROR: test_bridge_port_priority_set_zero (__main__.BridgeTest)
It should be possible to set the bridge port priority to 0
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 258, in test_bridge_port_priority_set_zero
    self.assertEqual(self.read_attr('port2', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port2/brport/priority'

======================================================================
ERROR: test_bridge_port_property (__main__.BridgeTest)
Test the "[Bridge]" section keys
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 271, in test_bridge_port_property
    self.assertEqual(self.read_attr('port2', 'brport/priority'), '32')
  File "/tmp/autopkgtest.76C9uV/build.PST/systemd/test/networkd-test.py", line 145, in read_attr
    with open(os.path.join('/sys/class/net', link, attribute)) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/port2/brport/priority'

----------------------------------------------------------------------
Ran 35 tests in 159.772s

FAILED (errors=3, skipped=2)
autopkgtest [04:45:26]: test networkd-test.py: -----------------------]

Prompted by systemd#19811 (comment).

yuwata · 2021-06-10T10:02:10Z

@yuwata have you seen these networkd-test.py failures before?

I hope #19881 fixes the failures. At least, the failures should be not related to this PR.

Prompted by systemd#19811 (comment).

bluca

bionic-arm is having testbed problems, and bionic-s390x is #19881

This reverts commit 7ad9bad, reversing changes made to 390a22f.

anitazha added the pid1 label Jun 3, 2021

msekletar added this to the v249 milestone Jun 7, 2021

anitazha force-pushed the revert_mount_rl branch from a56013f to 40487e1 Compare June 8, 2021 07:58

anitazha changed the title ~~Revert "core: prevent excessive /proc/self/mountinfo parsing"~~ sd-event: fix failure to exit rate limiting state Jun 8, 2021

anitazha force-pushed the revert_mount_rl branch from 40487e1 to fdd03e3 Compare June 8, 2021 08:06

msekletar mentioned this pull request Jun 8, 2021

sd-event: make sure we can actually leave ratelimited state #19841

Closed

yuwata requested changes Jun 8, 2021

View reviewed changes

test/TEST-60-MOUNT-RATELIMIT/Makefile Outdated Show resolved Hide resolved

test/units/testsuite-60.sh Outdated Show resolved Hide resolved

anitazha force-pushed the revert_mount_rl branch from fdd03e3 to 4a96ebe Compare June 8, 2021 08:32

yuwata reviewed Jun 8, 2021

View reviewed changes

test/units/testsuite-60.sh Outdated Show resolved Hide resolved

poettering reviewed Jun 8, 2021

View reviewed changes

test/units/testsuite-60.sh Outdated Show resolved Hide resolved

poettering reviewed Jun 8, 2021

View reviewed changes

test/units/testsuite-60.sh Outdated Show resolved Hide resolved

poettering reviewed Jun 8, 2021

View reviewed changes

test/units/testsuite-60.sh Outdated Show resolved Hide resolved

poettering mentioned this pull request Jun 8, 2021

v248 still monitoring /proc/self/mountinfo ? #19464

Closed

poettering linked an issue Jun 8, 2021 that may be closed by this pull request

v248 still monitoring /proc/self/mountinfo ? #19464

Closed

bluca added the needs-review label Jun 9, 2021

yuwata reviewed Jun 9, 2021

View reviewed changes

src/libsystemd/sd-event/sd-event.c Outdated Show resolved Hide resolved

anitazha force-pushed the revert_mount_rl branch from cae9eaa to 2ce4e66 Compare June 9, 2021 16:10

yuwata requested changes Jun 9, 2021

View reviewed changes

test/units/testsuite-60.sh Show resolved Hide resolved

test/units/testsuite-60.sh Show resolved Hide resolved

src/tests.sh Outdated Show resolved Hide resolved

test: add extended test for triggering mount rate limit

0c81900

It's hard to trigger the failure to exit the rate limit state in isolation as it needs multiple event sources in order to show that it gets stuck in the queue. Hence why this is an extended test.

anitazha force-pushed the revert_mount_rl branch from 2ce4e66 to 3e360d8 Compare June 9, 2021 19:06

yuwata approved these changes Jun 9, 2021

View reviewed changes

yuwata added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed needs-review labels Jun 9, 2021

poettering removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Jun 9, 2021

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jun 9, 2021

anitazha force-pushed the revert_mount_rl branch from 3e360d8 to 81107b8 Compare June 10, 2021 04:10

poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jun 10, 2021

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 10, 2021

test: wait for interfaces online

00b0a47

Prompted by systemd#19811 (comment).

yuwata mentioned this pull request Jun 10, 2021

test: wait for interfaces online #19881

Closed

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 10, 2021

test: wait for interfaces online

d347e2c

Prompted by systemd#19811 (comment).

bluca approved these changes Jun 10, 2021

View reviewed changes

bluca merged commit 7ad9bad into systemd:main Jun 10, 2021

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 12, 2021

Revert "Merge pull request systemd#19811 from anitazha/revert_mount_rl"

5937e79

This reverts commit 7ad9bad, reversing changes made to 390a22f.

yuwata added a commit to yuwata/systemd that referenced this pull request Jun 12, 2021

Revert "Merge pull request systemd#19811 from anitazha/revert_mount_rl"

24bfff1

This reverts commit 7ad9bad, reversing changes made to 390a22f.

This was referenced Jun 13, 2021

Recent CI tests failures #19903

Closed

Revert "Merge pull request #19811 from anitazha/revert_mount_rl" #19909

Closed

anitazha deleted the revert_mount_rl branch June 24, 2021 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sd-event: fix failure to exit rate limiting state #19811

sd-event: fix failure to exit rate limiting state #19811

anitazha commented Jun 3, 2021 •

edited

anitazha commented Jun 3, 2021 •

edited

poettering commented Jun 4, 2021

anitazha commented Jun 4, 2021

raven-au commented Jun 4, 2021

poettering commented Jun 4, 2021

topimiettinen commented Jun 4, 2021

poettering commented Jun 4, 2021

msekletar commented Jun 4, 2021 •

edited

msekletar commented Jun 5, 2021

topimiettinen commented Jun 6, 2021

poettering commented Jun 7, 2021

poettering commented Jun 7, 2021

anitazha commented Jun 8, 2021 •

edited

anitazha commented Jun 8, 2021

poettering commented Jun 8, 2021

bluca commented Jun 8, 2021

yuwata left a comment

poettering commented Jun 9, 2021

poettering commented Jun 9, 2021

anitazha commented Jun 10, 2021

bluca commented Jun 10, 2021

yuwata commented Jun 10, 2021

bluca left a comment

sd-event: fix failure to exit rate limiting state #19811

sd-event: fix failure to exit rate limiting state #19811

Conversation

anitazha commented Jun 3, 2021 • edited

anitazha commented Jun 3, 2021 • edited

poettering commented Jun 4, 2021

anitazha commented Jun 4, 2021

raven-au commented Jun 4, 2021

poettering commented Jun 4, 2021

topimiettinen commented Jun 4, 2021

poettering commented Jun 4, 2021

msekletar commented Jun 4, 2021 • edited

msekletar commented Jun 5, 2021

topimiettinen commented Jun 6, 2021

poettering commented Jun 7, 2021

poettering commented Jun 7, 2021

anitazha commented Jun 8, 2021 • edited

anitazha commented Jun 8, 2021

poettering commented Jun 8, 2021

bluca commented Jun 8, 2021

yuwata left a comment

Choose a reason for hiding this comment

poettering commented Jun 9, 2021

poettering commented Jun 9, 2021

anitazha commented Jun 10, 2021

bluca commented Jun 10, 2021

yuwata commented Jun 10, 2021

bluca left a comment

Choose a reason for hiding this comment

anitazha commented Jun 3, 2021 •

edited

anitazha commented Jun 3, 2021 •

edited

msekletar commented Jun 4, 2021 •

edited

anitazha commented Jun 8, 2021 •

edited