core/mount: minimize impact on mount storm. #10268

neilbrown · 2018-10-04T06:00:06Z

If create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done

it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.

I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.

One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.

This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs of when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensure systemd won't use more than 10% of
a CPU processing mountinfo.

With this patch, my test aboce takes about 5 seconds.

fsateler · 2018-10-04T15:45:53Z

The description sounds like it fixes #8703

poettering · 2018-10-05T11:21:55Z

hmm, i wish the kernel had better APIs for this, i.e. proper notifications for mounts being added or going away.

I am a bit puzzled by this though: why would this take 20s to process for systemd? i mean, the poll events of /proc/self/mountinfo should be coalesced, so that after the last mount is established it should take only a short time until PID 1 caught up. i.e. making 2,000 changes or 2,000,000 in a short time window should not really matter much as the kernel API should coalesce the POLLIN events for those events and systemd should process /proc/self/mountinfo only a few times while the mounts are being established and a last time immediately after the last was added.

Or is this not so much about the mount events being triggered but about the large size of mount table after all mounts have been established? If that's the case then this suggests that some processing logic when reading the mount table in systemd is O(n^2) or so (instead of the expected O(n)). Quite frankly that's totally possible, but I think we should really work on optimizing that then...

or are you saying that because systemd needs to process the mount table on each POLLIN with O(n) this simply takes away CPU time so much that your loop around mount() gets too little CPU time? i.e. that this is caused by CPU time starvation in the loop?

neilbrown · 2018-10-05T22:13:32Z

I agree about the poor api. Improvements might be coming (https://lwn.net/Articles/760714/) but they are still a long way off I think.

Parsing mountinfo does take longer as the file grows, but it looks linear. It does seem to do quite a lot of work (from eyeballing strace output) so there probably is room for improvement there. I can see it taking 4 seconds to parse when getting close to 2000 entries.
Does systemd do anything to detect "unchanged" lines and minimize the processing of them?

I think the problem is a combination of the parsing taking quite a lot of time, and the high rate of mount events making it that systemd is always runnable. If the mount loop is run with a 'nice' of -20 it runs much faster as systemd doesn't keep interrupting it
So optimizing the parsing would probably help a bit, but rate-limiting the parsing is probably the bigger help.

poettering

conceptually lgtm, but a few comments

(and yupp, this all feels like a big hack around crappy kernel APIs and crappy kernel CPU scheduling, if it's so easy to starve out other processes...)

poettering · 2018-10-10T08:42:05Z

src/core/manager.h

@@ -224,6 +224,9 @@ struct Manager {
        /* Data specific to the mount subsystem */
        struct libmnt_monitor *mount_monitor;
        sd_event_source *mount_event_source;
+        sd_event_source *mount_timeout_source;
+        uint64_t mount_last_read;


please uniformly use "usec_t" for time values. also consider suffixing the variable names with _usec or so, that it's always clear it's a time value and in which unit is used...

I used uint64_t for absolute times because that it what sd_event_add_time() expects. But I'll change to usec_t.

poettering · 2018-10-10T08:42:51Z

src/core/mount.c

@@ -53,6 +53,7 @@ static const UnitActiveState state_translation_table[_MOUNT_STATE_MAX] = {

 static int mount_dispatch_timer(sd_event_source *source, usec_t usec, void *userdata);
 static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata);
+static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata);


if it's a timeout , then it's not "io", hence shouldn't carry the name "io"...

I think a better name would be "mount_dispatch_proc_self_mountinfo_timer()" or so

We are going to do some io (exactly the same way that mount_dispatch_io does) because there was a timeout... But I can also see that value of your suggestion - I'll go with that.

poettering · 2018-10-10T08:43:30Z

src/core/mount.c

@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
        Iterator i;
        Unit *u;
        int r;
+        uint64_t this_run = now(CLOCK_MONOTONIC);


as above: please use usec_t...

Also, according to our CODING_STYLE we don't like mixing variable declarations and function invocations in one line...

poettering · 2018-10-10T08:44:27Z

src/core/mount.c

@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
        Iterator i;
        Unit *u;
        int r;
+        uint64_t this_run = now(CLOCK_MONOTONIC);
+
+        if (source && this_run < m->mount_last_read + m->mount_last_duration * 10) {


this is going to explode if mount_last_duration has not been initialized yet, no?

poettering · 2018-10-10T08:44:53Z

src/core/mount.c

+                /* If we run too often we can steal CPU from
+                 * the process that is mounting/unmounting things.
+                 */
+                sd_event_source_set_enabled(source, SD_EVENT_OFF);


needs some error handling (even if it is just logging someithing at log_warning() level and then ignoring it

poettering · 2018-10-10T08:45:25Z

src/core/mount.c

+        }
+
+        /* If an error occurs, assume 10ms */
+        m->mount_last_duration = 10000;


please write as 10 * USEC_PER_MSEC

poettering · 2018-10-10T08:45:36Z

src/core/mount.c

@@ -1891,9 +1913,21 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
                device_found_node(m, what, 0, DEVICE_FOUND_MOUNT);
        }

+        m->mount_last_read = this_run;
+        m->mount_last_duration = now(CLOCK_MONOTONIC) - this_run;


poettering · 2018-10-10T08:45:43Z

src/core/mount.c

+static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) {
+        Manager *m = userdata;
+
+        sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON);


error handling

poettering · 2018-10-10T08:47:13Z

src/core/mount.c

+        sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON);
+        m->mount_timeout_source = sd_event_source_unref(source);
+        return mount_dispatch_io(NULL, mnt_monitor_get_fd(m->mount_monitor),
+                                 EPOLLIN, userdata);


can't say i am fan of invoking IO handlers with a NULL source here. that's just ugly... can't we find a different way? maybe just move the processing of the /proc/self/mountinfo changes into a new call of its own, that gets clean arguments

poettering · 2018-10-10T08:48:13Z

src/core/mount.c

        return 0;
 }

+static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) {
+        Manager *m = userdata;


somewhere here there should be a big comment explaining the situation and why we do this all. i.e. that this is about CPU starving out other processes when a mount storm takes place. And that this is a hack around bad kernel APIs and CPU scheduling

neilbrown · 2018-11-02T00:00:46Z

Thanks for the thorough review. I think I have addressed everything you raised.
I disagree with your suggestion that this shows a weakness in the Linux scheduler.
The behaviour of systemd in this case (or of any process trying to collect mount notifications) is that the more often it is scheduled, the more total work it has to do. This means that "optimal" scheduling is to wait until there are no more mount events happening. Linux cannot know that this is optimal as in most cases frequent scheduling is good for a process, not bad.

poettering · 2018-11-21T10:51:14Z

src/core/mount.c

@@ -1734,6 +1735,8 @@ static void mount_enumerate(Manager *m) {
                        goto fail;
                }

+                m->mount_last_read_usec = 0;
+                m->mount_last_duration_usec = 0;


this should be necessary. All our objects when allocated are memset() to zero. Hence the Manager object is too

poettering · 2018-11-21T10:53:42Z

src/core/mount.c

        int r;

+        if (now(CLOCK_MONOTONIC) < usec_add(m->mount_last_read_usec,
+                                            m->mount_last_duration_usec * 10)) {


not that it matters, but given that we do the same usec_add() invocation again a few lines down, maybe just store the result in a variable here and reuse it?

poettering · 2018-11-21T10:59:04Z

CI failed, and it looks like this is caused by this commit:

exec-dynamicuser-fixeduser.service: Changed dead -> start
Assertion 's' failed at ../src/libsystemd/sd-event/sd-event.c:2154, function sd_event_source_set_enabled(). Ignoring.
Fail to reenable /proc/self/mountinfo monitor, ignoring: Invalid argument
/run/systemd/unit-root/home is duplicate.
/run/systemd/unit-root/root is duplicate.
/run/systemd/unit-root/run/user is duplicate.
/run/systemd/unit-root/home is redundant by /run/systemd/unit-root/
/run/systemd/unit-root/root is redundant by /run/systemd/unit-root/
/run/systemd/unit-root/run/user is redundant by /run/systemd/unit-root/
Failed to unshare the mount namespace: Operation not supported
exec-dynamicuser-fixeduser.service: Failed to set up namespace, and refusing to continue since the selected namespacing options alter mount environment non-trivially.
Bind mounts: 0, temporary filesystems: 0, root directory: no, root image: no, dynamic user: yes
exec-dynamicuser-fixeduser.service: Failed to set up mount namespacing: Operation not supported
exec-dynamicuser-fixeduser.service: Failed at step NAMESPACE spawning /bin/sh: Operation not supported
FAIL: test-execute (code: 139)
Segmentation fault (core dumped)

neilbrown · 2018-11-30T04:49:16Z

That CI failure must have been because mount_dispatch_proc_self_mountinfo_timer() ran after mount_shutdown(). I need to sd_event_source_unref(m->mount_time_source) in mount_shutdown().
I've added that and made the other changes you suggested - thanks.

poettering · 2018-12-08T21:49:31Z

Patch looks great, but unfortunately it needs a rebase right now to be mergeable. Can you rebase, please?

If we create 2000 mounts (on a 1-CPU qemu VM) with mkdir -p /MNT/{1..2000} time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done it takes around 20 seconds to complete. Much of this time is taken up by systemd repeatedly processing /proc/self/mountinfo. If I disable the processing, the time drops to about 4 seconds. I have reports that on a larger system with multiple active user sessions, each with it's own systemd, the impact can be higher. One particular use-case where a large number of mounts can be expected in quick succession is when the "clearcase" SCM starts up. This patch modifies the handling up events from /proc/self/mountinfo so that systemd backs off when a storm is detected. Specifically the time to process mountinfo is measured, and the process will not be repeated until 10 times that duration has passed. This ensures systemd won't use more than 10% of real time processing mountinfo. With this patch, my test above takes about 5 seconds.

asavah · 2018-12-16T20:47:25Z

@neilbrown This is bad, like TOTALLY bad, results in unbootable system, maybe something went wrong with the rebase?

systemd falls back to an emergency shell.
Surprisingly while in the emergency shell systemctl status home.mount returns OK, and the fs is actually mounted.
The following errors may be observed in the journal:

Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null)
Dec 16 22:22:20 msib450ig kernel: VFS: Mounted root (ext4 filesystem) readonly on device 259:2.
Dec 16 22:22:20 msib450ig kernel: devtmpfs: mounted
Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p2): re-mounted. Opts: discard
Dec 16 22:22:20 msib450ig systemd[1]: Mounting /boot/efi...
Dec 16 22:22:20 msib450ig systemd[1]: Mounting /home...
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Mount process finished, but there is no mount.
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Failed with result 'protocol'.
Dec 16 22:22:20 msib450ig systemd[1]: Failed to mount /boot/efi.
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Consumed 3ms CPU time.
Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: discard
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Mount process finished, but there is no mount.
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Failed with result 'protocol'.
Dec 16 22:22:20 msib450ig systemd[1]: Failed to mount /home.
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Consumed 5ms CPU time.

/etc/fstab is simple:

# fstab automatically generated for msib450ig
#<dev>  <mount point>   <type>  <options>       <dump>  <pass>
/dev/nvme0n1p1  /boot/efi       vfat    umask=0077      0       0
/dev/nvme0n1p2  /       ext4    defaults,noatime,discard        0       0
/dev/nvme0n1p3  /home   ext4    defaults,noatime,discard        0       0

Reverting this PR results in a working system

Dec 16 22:39:06 msib450ig kernel: VFS: Mounted root (ext4 filesystem) readonly on device 259:2.
Dec 16 22:39:06 msib450ig kernel: devtmpfs: mounted
Dec 16 22:39:06 msib450ig kernel: EXT4-fs (nvme0n1p2): re-mounted. Opts: discard
Dec 16 22:39:06 msib450ig systemd[1]: Mounting /home...
Dec 16 22:39:06 msib450ig systemd[1]: Mounting /boot/efi...
Dec 16 22:39:06 msib450ig systemd[1]: Mounted /boot/efi.
Dec 16 22:39:06 msib450ig kernel: EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: discard
Dec 16 22:39:06 msib450ig systemd[1]: Mounted /home.

@poettering please consider reverting this unless there is a trivial fixas this PR is broken in runtime.

edit: typos

neilbrown · 2018-12-16T21:43:15Z

I suspect the patch subtly breaks this assumption:
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
as it will sometimes delay the loading of new mountinfo (that is the point of the patch).
Maybe if we add
m->mount_last_read_usec = 0;
in mount_spawn() before exec_spawn(). It is a bit of a hack, but should allow the mount io event to be handled immediately.
Possibly a cleaner approach would be to increment a counter in mount_spawn(), and decrement it in mount_sigchld_event(), and have mount_dispatch_io() always process the event if the count is non-zero.
Can you try one of those and see if the problem goes away?
Thanks!

asavah · 2018-12-16T22:06:33Z

@neilbrown

Maybe if we add
m->mount_last_read_usec = 0;
in mount_spawn() before exec_spawn()

Nope, won't even build, in mount_spawn m is Mount , not Manager

../../../src/systemd/src/core/mount.c: In function ‘mount_spawn’:
../../../src/systemd/src/core/mount.c:791:10: error: ‘Mount’ {aka ‘struct Mount’} has no member named ‘mount_last_read_usec’
         m->mount_last_read_usec = 0;
          ^~

If you could provide patches I would gladly test those.
My c skills are quite poor for what this project requires :)

neilbrown · 2018-12-16T22:21:17Z

m->meta.manager->mount_last_read_usec = 0;

should compile.
I'll write a patch for the second approach shortly.

neilbrown · 2018-12-16T22:39:14Z

I've just added a patch to my neilbrown/systemd/mount-storm branch. It is commit bc59f84
I have only compile-tested it.
I would really appreciate if you could test it - thanks.

mbiebl · 2018-12-16T22:47:08Z

I'm surprised this wasn't caught by the CI...

asavah · 2018-12-16T23:02:43Z

@neilbrown Nice, that patch does fix the issue. Tested 3 times, clean boot. TYVM

@mbiebl maybe it triggered local-fs.target failure due to the fact that it's an UEFI system with /boot/efi mount forced in fstab and that I have /home as a separate fs, dunno didn't check what CI does.

In case someone is interested - full log from a failing boot (without the patch linked by neilbrown)
https://pastebin.com/xVBKzRS0

neilbrown · 2018-12-17T04:50:03Z

Hmmm... I'm guessing I need to open a new pull request...
Done #11178

yuwata added pid1 scalability labels Oct 4, 2018

poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Oct 5, 2018

poettering requested changes Oct 10, 2018

View reviewed changes

poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Oct 10, 2018

neilbrown force-pushed the mount-storm branch from ce6d415 to 463bd36 Compare November 1, 2018 23:57

yuwata removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 12, 2018

poettering reviewed Nov 21, 2018

View reviewed changes

poettering added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Nov 21, 2018

neilbrown force-pushed the mount-storm branch from 463bd36 to 86d3296 Compare November 30, 2018 04:49

poettering added needs-rebase and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Dec 8, 2018

neilbrown force-pushed the mount-storm branch from 86d3296 to 314515a Compare December 9, 2018 21:53

poettering merged commit 89f9752 into systemd:master Dec 16, 2018

neilbrown deleted the mount-storm branch December 16, 2018 21:35

neilbrown restored the mount-storm branch December 17, 2018 04:46

evverx added the merged-but-needs-fixing label Dec 18, 2018

asavah mentioned this pull request Dec 18, 2018

mount: disable mount-storm protection while mount unit is starting. #11178

Merged

mbiebl mentioned this pull request Dec 18, 2018

udev: tiny cleanups #11196

Merged

asavah mentioned this pull request Dec 18, 2018

Regression: systemd-gpt-auto-generator causes boot delays and intermittent boot failure. #11205

Closed

fsateler mentioned this pull request Dec 29, 2019

A non-privileged user could (easily) DOS systemd by exceeding BUS_WQUEUE_MAX #13674

Closed

msekletar mentioned this pull request Apr 17, 2020

[RFE] Optimise /proc/self/mountinfo notification dispatch #15464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/mount: minimize impact on mount storm. #10268

core/mount: minimize impact on mount storm. #10268

neilbrown commented Oct 4, 2018

fsateler commented Oct 4, 2018

poettering commented Oct 5, 2018 •

edited

Loading

neilbrown commented Oct 5, 2018

poettering left a comment

poettering Oct 10, 2018

neilbrown Nov 1, 2018

poettering Oct 10, 2018

neilbrown Nov 1, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

poettering Oct 10, 2018

neilbrown commented Nov 2, 2018

poettering Nov 21, 2018

poettering Nov 21, 2018

poettering commented Nov 21, 2018

neilbrown commented Nov 30, 2018

poettering commented Dec 8, 2018

asavah commented Dec 16, 2018 •

edited

Loading

neilbrown commented Dec 16, 2018

asavah commented Dec 16, 2018

neilbrown commented Dec 16, 2018

neilbrown commented Dec 16, 2018

mbiebl commented Dec 16, 2018 •

edited

Loading

asavah commented Dec 16, 2018 •

edited

Loading

neilbrown commented Dec 17, 2018

core/mount: minimize impact on mount storm. #10268

core/mount: minimize impact on mount storm. #10268

Conversation

neilbrown commented Oct 4, 2018

fsateler commented Oct 4, 2018

poettering commented Oct 5, 2018 • edited Loading

neilbrown commented Oct 5, 2018

poettering left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilbrown commented Nov 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poettering commented Nov 21, 2018

neilbrown commented Nov 30, 2018

poettering commented Dec 8, 2018

asavah commented Dec 16, 2018 • edited Loading

neilbrown commented Dec 16, 2018

asavah commented Dec 16, 2018

neilbrown commented Dec 16, 2018

neilbrown commented Dec 16, 2018

mbiebl commented Dec 16, 2018 • edited Loading

asavah commented Dec 16, 2018 • edited Loading

neilbrown commented Dec 17, 2018

poettering commented Oct 5, 2018 •

edited

Loading

asavah commented Dec 16, 2018 •

edited

Loading

mbiebl commented Dec 16, 2018 •

edited

Loading

asavah commented Dec 16, 2018 •

edited

Loading