Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of LookupDynamicUserByName calls by nss-systemd slows down pid 1 #9128

Closed
davide125 opened this issue May 29, 2018 · 7 comments
Closed
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1

Comments

@davide125
Copy link
Contributor

systemd version the issue has been seen with

238

Used distribution

CentOS 7

Expected behaviour you didn't see

pid 1 continues to run normally, children are reaped

Unexpected behaviour you saw

pid 1 slows down considerably, systemctl starts hanging, children stop being reaped

Steps to reproduce the problem
On a git master server, there are a large number of short-lived ssh connections, from a variety of different users. On this host, nss-systemd is enabled in /etc/nsswitch.conf for passwd and group, but pam_systemd is not in use (meaning, sessions are not tracked by logind and user processes are accounted under sshd.service). Running busctl monitor shows that every time an incoming ssh connection is established, a LookupDynamicUserByName call for the ssh login user is issued over the bus, which internally is processed by running dynamic_user_lookup_name (defined in src/core/dynamic-user.c). Over a variable period of time (generally between 6 hours and a day from what we've observed), the processing of these calls becomes slower and slower, to the point that several seconds delays are noticeable in the busctl monitor output. At the same time, commands like systemctl start lagging; eventually, pid 1 stops reaping children altogether, causing zombie processes to pile up by the thousand. It's worth noting that this only happens consistently on this one host, which happens to have a very large and essentially random distribution of user logins (meaning, all the time hundreds of different users might be logging in via ssh at the same time); similarly configured hosts with comparable amounts of traffic, but less distributed (meaning, most of the logins are concentrated into a few users) have not exhibited this issue so far. I'm still working on a synthetic repro for this issue, and will update this if I can come up with something useful. Also worth noting, as far as I can tell this is a regression in 238 (though I can't prove this conclusively at the moment).

@davide125
Copy link
Contributor Author

I should add that for the time being removing nss-systemd references from /etc/nsswitch.conf on this host has mitigated the issue for us.

@davide125
Copy link
Contributor Author

davide125 commented May 29, 2018

Update: we're now seeing a similar issue on the same machine, this time with nss-mymachines. I see calls to MapToMachineGroup and MapToMachineUser on the bus:

‣ Type=method_call  Endian=l  Flags=0  Version=1  Priority=0 Cookie=2
  Sender=:1.55609  Destination=org.freedesktop.machine1  Path=/org/freedesktop/machine1  Interface=org.freedesktop.machine1.Manager  Member=MapToMachineGroup
  UniqueName=:1.55609
  MESSAGE "u" {
          UINT32 1000483;
  };

‣ Type=error  Endian=l  Flags=1  Version=1  Priority=0 Cookie=55152  ReplyCookie=2
  Sender=:1.18  Destination=:1.55609
  ErrorName=org.freedesktop.machine1.NoSuchGroupMapping  ErrorMessage="No matching group mapping for 1000483."
  UniqueName=:1.18
  MESSAGE "s" {
          STRING "No matching group mapping for 1000483.";
  };

Interestingly, this whole codepath seems limited to systemd-machined, so I'm not entirely sure why/how it's able to affect pid 1. Nevertheless, after a while the same symptoms manifest: pid 1 gets progressively slower and stops being able to reap children.

@poettering
Copy link
Member

hmm, this smells if something is leaked in PID 1 when these requests are made, and this causes O(n) behaviour or worse somewhere. Any chance you can get some profiling data about PID 1 when this happens? i.e. "perf" data or so that tells us what PID 1 is spending its time on?

@poettering poettering added pid1 needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels May 30, 2018
@yuwata
Copy link
Member

yuwata commented May 30, 2018

I've tried to call LookupDynamicUserByName 1000000 times, but it seems everything fine. Memory usage is not changed. And number of opened fd is constant.

Is this issue related to #8015?

@davide125
Copy link
Contributor Author

@yuwata: no, on this host we're not using pam_systemd so logind isn't really in play.

It looks like we may have originally confused the effect for the cause here. Upon further testing, we found that this happens even when both nss-systemd and nss-mymachines are disabled, and there's little to no traffic on dbus. I've filed #9138 with some more details, as we think that's a separate issue, and we're still not sure if/how it relates to this one.

@poettering
Copy link
Member

Now that #9148 is merged, is there anything left on this issue?

@davide125
Copy link
Contributor Author

No, I don't think so. Will reopen if we find evidence that this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1
Development

No branches or pull requests

3 participants