Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

davide125 · 2018-05-29T18:41:33Z

systemd version the issue has been seen with

238

Used distribution

CentOS 7

Expected behaviour you didn't see

pid 1 continues to run normally, children are reaped

Unexpected behaviour you saw

pid 1 slows down considerably, systemctl starts hanging, children stop being reaped

Steps to reproduce the problem
On a git master server, there are a large number of short-lived ssh connections, from a variety of different users. On this host, nss-systemd is enabled in /etc/nsswitch.conf for passwd and group, but pam_systemd is not in use (meaning, sessions are not tracked by logind and user processes are accounted under sshd.service). Running busctl monitor shows that every time an incoming ssh connection is established, a LookupDynamicUserByName call for the ssh login user is issued over the bus, which internally is processed by running dynamic_user_lookup_name (defined in src/core/dynamic-user.c). Over a variable period of time (generally between 6 hours and a day from what we've observed), the processing of these calls becomes slower and slower, to the point that several seconds delays are noticeable in the busctl monitor output. At the same time, commands like systemctl start lagging; eventually, pid 1 stops reaping children altogether, causing zombie processes to pile up by the thousand. It's worth noting that this only happens consistently on this one host, which happens to have a very large and essentially random distribution of user logins (meaning, all the time hundreds of different users might be logging in via ssh at the same time); similarly configured hosts with comparable amounts of traffic, but less distributed (meaning, most of the logins are concentrated into a few users) have not exhibited this issue so far. I'm still working on a synthetic repro for this issue, and will update this if I can come up with something useful. Also worth noting, as far as I can tell this is a regression in 238 (though I can't prove this conclusively at the moment).

The text was updated successfully, but these errors were encountered:

davide125 · 2018-05-29T18:42:52Z

I should add that for the time being removing nss-systemd references from /etc/nsswitch.conf on this host has mitigated the issue for us.

davide125 · 2018-05-29T20:25:25Z

Update: we're now seeing a similar issue on the same machine, this time with nss-mymachines. I see calls to MapToMachineGroup and MapToMachineUser on the bus:

‣ Type=method_call  Endian=l  Flags=0  Version=1  Priority=0 Cookie=2
  Sender=:1.55609  Destination=org.freedesktop.machine1  Path=/org/freedesktop/machine1  Interface=org.freedesktop.machine1.Manager  Member=MapToMachineGroup
  UniqueName=:1.55609
  MESSAGE "u" {
          UINT32 1000483;
  };

‣ Type=error  Endian=l  Flags=1  Version=1  Priority=0 Cookie=55152  ReplyCookie=2
  Sender=:1.18  Destination=:1.55609
  ErrorName=org.freedesktop.machine1.NoSuchGroupMapping  ErrorMessage="No matching group mapping for 1000483."
  UniqueName=:1.18
  MESSAGE "s" {
          STRING "No matching group mapping for 1000483.";
  };

Interestingly, this whole codepath seems limited to systemd-machined, so I'm not entirely sure why/how it's able to affect pid 1. Nevertheless, after a while the same symptoms manifest: pid 1 gets progressively slower and stops being able to reap children.

poettering · 2018-05-30T09:36:25Z

hmm, this smells if something is leaked in PID 1 when these requests are made, and this causes O(n) behaviour or worse somewhere. Any chance you can get some profiling data about PID 1 when this happens? i.e. "perf" data or so that tells us what PID 1 is spending its time on?

yuwata · 2018-05-30T14:23:56Z

I've tried to call LookupDynamicUserByName 1000000 times, but it seems everything fine. Memory usage is not changed. And number of opened fd is constant.

Is this issue related to #8015?

davide125 · 2018-05-30T18:19:54Z

@yuwata: no, on this host we're not using pam_systemd so logind isn't really in play.

It looks like we may have originally confused the effect for the cause here. Upon further testing, we found that this happens even when both nss-systemd and nss-mymachines are disabled, and there's little to no traffic on dbus. I've filed #9138 with some more details, as we think that's a separate issue, and we're still not sure if/how it relates to this one.

poettering · 2018-06-06T11:13:04Z

Now that #9148 is merged, is there anything left on this issue?

davide125 · 2018-06-06T14:45:28Z

No, I don't think so. Will reopen if we find evidence that this is still an issue.

poettering added pid1 needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels May 30, 2018

davide125 mentioned this issue May 30, 2018

systemd slows down considerably when watching a large number of PIDs #9138

Closed

davide125 closed this as completed Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

davide125 commented May 29, 2018

davide125 commented May 29, 2018

davide125 commented May 29, 2018 •

edited

poettering commented May 30, 2018

yuwata commented May 30, 2018

davide125 commented May 30, 2018

poettering commented Jun 6, 2018

davide125 commented Jun 6, 2018

Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

Comments

davide125 commented May 29, 2018

davide125 commented May 29, 2018

davide125 commented May 29, 2018 • edited

poettering commented May 30, 2018

yuwata commented May 30, 2018

davide125 commented May 30, 2018

poettering commented Jun 6, 2018

davide125 commented Jun 6, 2018

davide125 commented May 29, 2018 •

edited