Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/mount: minimize impact on mount storm. #10268

Merged
merged 1 commit into from
Dec 16, 2018

Conversation

neilbrown
Copy link
Contributor

If create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done

it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.

I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.

One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.

This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs of when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensure systemd won't use more than 10% of
a CPU processing mountinfo.

With this patch, my test aboce takes about 5 seconds.

@fsateler
Copy link
Member

fsateler commented Oct 4, 2018

The description sounds like it fixes #8703

@poettering
Copy link
Member

poettering commented Oct 5, 2018

hmm, i wish the kernel had better APIs for this, i.e. proper notifications for mounts being added or going away.

I am a bit puzzled by this though: why would this take 20s to process for systemd? i mean, the poll events of /proc/self/mountinfo should be coalesced, so that after the last mount is established it should take only a short time until PID 1 caught up. i.e. making 2,000 changes or 2,000,000 in a short time window should not really matter much as the kernel API should coalesce the POLLIN events for those events and systemd should process /proc/self/mountinfo only a few times while the mounts are being established and a last time immediately after the last was added.

Or is this not so much about the mount events being triggered but about the large size of mount table after all mounts have been established? If that's the case then this suggests that some processing logic when reading the mount table in systemd is O(n^2) or so (instead of the expected O(n)). Quite frankly that's totally possible, but I think we should really work on optimizing that then...

or are you saying that because systemd needs to process the mount table on each POLLIN with O(n) this simply takes away CPU time so much that your loop around mount() gets too little CPU time? i.e. that this is caused by CPU time starvation in the loop?

@poettering poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Oct 5, 2018
@neilbrown
Copy link
Contributor Author

I agree about the poor api. Improvements might be coming (https://lwn.net/Articles/760714/) but they are still a long way off I think.

Parsing mountinfo does take longer as the file grows, but it looks linear. It does seem to do quite a lot of work (from eyeballing strace output) so there probably is room for improvement there. I can see it taking 4 seconds to parse when getting close to 2000 entries.
Does systemd do anything to detect "unchanged" lines and minimize the processing of them?

I think the problem is a combination of the parsing taking quite a lot of time, and the high rate of mount events making it that systemd is always runnable. If the mount loop is run with a 'nice' of -20 it runs much faster as systemd doesn't keep interrupting it
So optimizing the parsing would probably help a bit, but rate-limiting the parsing is probably the bigger help.

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conceptually lgtm, but a few comments

(and yupp, this all feels like a big hack around crappy kernel APIs and crappy kernel CPU scheduling, if it's so easy to starve out other processes...)

@@ -224,6 +224,9 @@ struct Manager {
/* Data specific to the mount subsystem */
struct libmnt_monitor *mount_monitor;
sd_event_source *mount_event_source;
sd_event_source *mount_timeout_source;
uint64_t mount_last_read;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please uniformly use "usec_t" for time values. also consider suffixing the variable names with _usec or so, that it's always clear it's a time value and in which unit is used...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used uint64_t for absolute times because that it what sd_event_add_time() expects. But I'll change to usec_t.

src/core/mount.c Outdated
@@ -53,6 +53,7 @@ static const UnitActiveState state_translation_table[_MOUNT_STATE_MAX] = {

static int mount_dispatch_timer(sd_event_source *source, usec_t usec, void *userdata);
static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata);
static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's a timeout , then it's not "io", hence shouldn't carry the name "io"...

I think a better name would be "mount_dispatch_proc_self_mountinfo_timer()" or so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to do some io (exactly the same way that mount_dispatch_io does) because there was a timeout... But I can also see that value of your suggestion - I'll go with that.

src/core/mount.c Outdated
@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
Iterator i;
Unit *u;
int r;
uint64_t this_run = now(CLOCK_MONOTONIC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above: please use usec_t...

Also, according to our CODING_STYLE we don't like mixing variable declarations and function invocations in one line...

src/core/mount.c Outdated
@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
Iterator i;
Unit *u;
int r;
uint64_t this_run = now(CLOCK_MONOTONIC);

if (source && this_run < m->mount_last_read + m->mount_last_duration * 10) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to explode if mount_last_duration has not been initialized yet, no?

src/core/mount.c Outdated
/* If we run too often we can steal CPU from
* the process that is mounting/unmounting things.
*/
sd_event_source_set_enabled(source, SD_EVENT_OFF);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs some error handling (even if it is just logging someithing at log_warning() level and then ignoring it

src/core/mount.c Outdated
}

/* If an error occurs, assume 10ms */
m->mount_last_duration = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please write as 10 * USEC_PER_MSEC

src/core/mount.c Outdated
@@ -1891,9 +1913,21 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents,
device_found_node(m, what, 0, DEVICE_FOUND_MOUNT);
}

m->mount_last_read = this_run;
m->mount_last_duration = now(CLOCK_MONOTONIC) - this_run;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usec_sub()

src/core/mount.c Outdated
static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) {
Manager *m = userdata;

sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error handling

src/core/mount.c Outdated
sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON);
m->mount_timeout_source = sd_event_source_unref(source);
return mount_dispatch_io(NULL, mnt_monitor_get_fd(m->mount_monitor),
EPOLLIN, userdata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't say i am fan of invoking IO handlers with a NULL source here. that's just ugly... can't we find a different way? maybe just move the processing of the /proc/self/mountinfo changes into a new call of its own, that gets clean arguments

src/core/mount.c Outdated
return 0;
}

static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) {
Manager *m = userdata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere here there should be a big comment explaining the situation and why we do this all. i.e. that this is about CPU starving out other processes when a mount storm takes place. And that this is a hack around bad kernel APIs and CPU scheduling

@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Oct 10, 2018
@neilbrown
Copy link
Contributor Author

Thanks for the thorough review. I think I have addressed everything you raised.
I disagree with your suggestion that this shows a weakness in the Linux scheduler.
The behaviour of systemd in this case (or of any process trying to collect mount notifications) is that the more often it is scheduled, the more total work it has to do. This means that "optimal" scheduling is to wait until there are no more mount events happening. Linux cannot know that this is optimal as in most cases frequent scheduling is good for a process, not bad.

@yuwata yuwata removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 12, 2018
src/core/mount.c Outdated
@@ -1734,6 +1735,8 @@ static void mount_enumerate(Manager *m) {
goto fail;
}

m->mount_last_read_usec = 0;
m->mount_last_duration_usec = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be necessary. All our objects when allocated are memset() to zero. Hence the Manager object is too

src/core/mount.c Outdated
int r;

if (now(CLOCK_MONOTONIC) < usec_add(m->mount_last_read_usec,
m->mount_last_duration_usec * 10)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it matters, but given that we do the same usec_add() invocation again a few lines down, maybe just store the result in a variable here and reuse it?

@poettering
Copy link
Member

CI failed, and it looks like this is caused by this commit:

exec-dynamicuser-fixeduser.service: Changed dead -> start
Assertion 's' failed at ../src/libsystemd/sd-event/sd-event.c:2154, function sd_event_source_set_enabled(). Ignoring.
Fail to reenable /proc/self/mountinfo monitor, ignoring: Invalid argument
/run/systemd/unit-root/home is duplicate.
/run/systemd/unit-root/root is duplicate.
/run/systemd/unit-root/run/user is duplicate.
/run/systemd/unit-root/home is redundant by /run/systemd/unit-root/
/run/systemd/unit-root/root is redundant by /run/systemd/unit-root/
/run/systemd/unit-root/run/user is redundant by /run/systemd/unit-root/
Failed to unshare the mount namespace: Operation not supported
exec-dynamicuser-fixeduser.service: Failed to set up namespace, and refusing to continue since the selected namespacing options alter mount environment non-trivially.
Bind mounts: 0, temporary filesystems: 0, root directory: no, root image: no, dynamic user: yes
exec-dynamicuser-fixeduser.service: Failed to set up mount namespacing: Operation not supported
exec-dynamicuser-fixeduser.service: Failed at step NAMESPACE spawning /bin/sh: Operation not supported
FAIL: test-execute (code: 139)
Segmentation fault (core dumped)

@poettering poettering added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Nov 21, 2018
@neilbrown
Copy link
Contributor Author

That CI failure must have been because mount_dispatch_proc_self_mountinfo_timer() ran after mount_shutdown(). I need to sd_event_source_unref(m->mount_time_source) in mount_shutdown().
I've added that and made the other changes you suggested - thanks.

@poettering
Copy link
Member

Patch looks great, but unfortunately it needs a rebase right now to be mergeable. Can you rebase, please?

@poettering poettering added needs-rebase and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Dec 8, 2018
If we create 2000 mounts (on a 1-CPU qemu VM) with
  mkdir -p /MNT/{1..2000}
  time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done

it takes around 20 seconds to complete.  Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.

I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.

One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.

This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected.  Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed.  This ensures systemd won't use more than 10% of
real time processing mountinfo.

With this patch, my test above takes about 5 seconds.
@asavah
Copy link
Contributor

asavah commented Dec 16, 2018

@neilbrown This is bad, like TOTALLY bad, results in unbootable system, maybe something went wrong with the rebase?

systemd falls back to an emergency shell.
Surprisingly while in the emergency shell systemctl status home.mount returns OK, and the fs is actually mounted.
The following errors may be observed in the journal:

Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null)
Dec 16 22:22:20 msib450ig kernel: VFS: Mounted root (ext4 filesystem) readonly on device 259:2.
Dec 16 22:22:20 msib450ig kernel: devtmpfs: mounted
Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p2): re-mounted. Opts: discard
Dec 16 22:22:20 msib450ig systemd[1]: Mounting /boot/efi...
Dec 16 22:22:20 msib450ig systemd[1]: Mounting /home...
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Mount process finished, but there is no mount.
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Failed with result 'protocol'.
Dec 16 22:22:20 msib450ig systemd[1]: Failed to mount /boot/efi.
Dec 16 22:22:20 msib450ig systemd[1]: boot-efi.mount: Consumed 3ms CPU time.
Dec 16 22:22:20 msib450ig kernel: EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: discard
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Mount process finished, but there is no mount.
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Failed with result 'protocol'.
Dec 16 22:22:20 msib450ig systemd[1]: Failed to mount /home.
Dec 16 22:22:20 msib450ig systemd[1]: home.mount: Consumed 5ms CPU time.

/etc/fstab is simple:

# fstab automatically generated for msib450ig
#<dev>  <mount point>   <type>  <options>       <dump>  <pass>
/dev/nvme0n1p1  /boot/efi       vfat    umask=0077      0       0
/dev/nvme0n1p2  /       ext4    defaults,noatime,discard        0       0
/dev/nvme0n1p3  /home   ext4    defaults,noatime,discard        0       0

Reverting this PR results in a working system

Dec 16 22:39:06 msib450ig kernel: VFS: Mounted root (ext4 filesystem) readonly on device 259:2.
Dec 16 22:39:06 msib450ig kernel: devtmpfs: mounted
Dec 16 22:39:06 msib450ig kernel: EXT4-fs (nvme0n1p2): re-mounted. Opts: discard
Dec 16 22:39:06 msib450ig systemd[1]: Mounting /home...
Dec 16 22:39:06 msib450ig systemd[1]: Mounting /boot/efi...
Dec 16 22:39:06 msib450ig systemd[1]: Mounted /boot/efi.
Dec 16 22:39:06 msib450ig kernel: EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: discard
Dec 16 22:39:06 msib450ig systemd[1]: Mounted /home.

@poettering please consider reverting this unless there is a trivial fixas this PR is broken in runtime.

edit: typos

@neilbrown neilbrown deleted the mount-storm branch December 16, 2018 21:35
@neilbrown
Copy link
Contributor Author

I suspect the patch subtly breaks this assumption:
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
as it will sometimes delay the loading of new mountinfo (that is the point of the patch).
Maybe if we add
m->mount_last_read_usec = 0;
in mount_spawn() before exec_spawn(). It is a bit of a hack, but should allow the mount io event to be handled immediately.
Possibly a cleaner approach would be to increment a counter in mount_spawn(), and decrement it in mount_sigchld_event(), and have mount_dispatch_io() always process the event if the count is non-zero.
Can you try one of those and see if the problem goes away?
Thanks!

@asavah
Copy link
Contributor

asavah commented Dec 16, 2018

@neilbrown

Maybe if we add
m->mount_last_read_usec = 0;
in mount_spawn() before exec_spawn()

Nope, won't even build, in mount_spawn m is Mount , not Manager

../../../src/systemd/src/core/mount.c: In function ‘mount_spawn’:
../../../src/systemd/src/core/mount.c:791:10: error: ‘Mount’ {aka ‘struct Mount’} has no member named ‘mount_last_read_usec’
         m->mount_last_read_usec = 0;
          ^~

If you could provide patches I would gladly test those.
My c skills are quite poor for what this project requires :)

@neilbrown
Copy link
Contributor Author

m->meta.manager->mount_last_read_usec = 0;

should compile.
I'll write a patch for the second approach shortly.

@neilbrown
Copy link
Contributor Author

I've just added a patch to my neilbrown/systemd/mount-storm branch. It is commit bc59f84
I have only compile-tested it.
I would really appreciate if you could test it - thanks.

@mbiebl
Copy link
Contributor

mbiebl commented Dec 16, 2018

I'm surprised this wasn't caught by the CI...

@asavah
Copy link
Contributor

asavah commented Dec 16, 2018

@neilbrown Nice, that patch does fix the issue. Tested 3 times, clean boot. TYVM

@mbiebl maybe it triggered local-fs.target failure due to the fact that it's an UEFI system with /boot/efi mount forced in fstab and that I have /home as a separate fs, dunno didn't check what CI does.

In case someone is interested - full log from a failing boot (without the patch linked by neilbrown)
https://pastebin.com/xVBKzRS0

@neilbrown neilbrown restored the mount-storm branch December 17, 2018 04:46
@neilbrown
Copy link
Contributor Author

Hmmm... I'm guessing I need to open a new pull request...
Done #11178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants