-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logind randomly loses its dbus connection #2925
Comments
this suggests dbus kicked us from the bus, because we sent invalid data to it. any chance you can run dbus-daemon in debug mode, and see what it says about logind's connection before this happens? |
Doing that now. Meanwhile, while replacing the dbus package, I noticed the issue appears again (but maybe it is different?). It appears that logind does not know that it was kicked, and won't exit:
Precondition: logind and dbus are operational. Then, let's restart just dbus:
The login1 endpoint is gone, but logind is still running (under as its old PID). |
Restarting dbus is not supported generally. That disconnects all clients, and the system generally cannot recover from that. This is a dbus limitation. |
are some messages I am currently getting (while logind is still on the bus). Waiting for more.. |
So, this might be related to https://bugs.freedesktop.org/show_bug.cgi?id=95263. Is this possibly related to some component creating a lot of logind sessions quickly in a loop? I.e. some cron job, some sudo job or maybe some script creating a lot of ssh connections or so? |
@poettering Your guess would make sense in my case. I'm seeing these errors by running 50+ cron jobs every minute. Each job basically just checks for the existence of a lock file to ensure that the script is already/still running. However, each of these concurrent cron jobs seem to occasionally (within 15-60 minutes) cause the reported dbus disconnect and subsequent systemd failures, which can only be "fixed" with a systemd-logind restart. It "seems" that this occurs at around 300 session logins. However, there is a point at which (seems like 1,000 session logins) even a logind restart will not allow the system to recover; only a full reboot will work. |
I have an Ubuntu 16.04 server where SSH logins and |
On Mon, Aug 15, 2016 at 12:23:10PM -0700, Ivan Kozik notifications@github.com wrote:
|
I've only broken that particular server once; it's been up 55 days and I can't reboot it yet :-) Without updating any packages or restarting anything, I just managed to break another updated Ubuntu 16.04 host (systemd 229-4ubuntu7) by flooding it with logins using:
from 10 terminals; I turned off the flood after 5-10 minutes, after which point every connection was delayed for ~30 seconds:
|
Not sure this is related to the bug, but when sending a flood of logins, Here is what the first failure looks like when doing that flood:
Edit: I just noticed on https://bugs.freedesktop.org/show_bug.cgi?id=95263 that Ubuntu is tracking this at https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1591411 |
I can confirm this bug on openSUSE 13.2 running systemd-210.1475218254.1e76ce0-25.48.1.x86_64. restarting dbus and systemd-logind helps (for some time), then it starts failing again. |
Hi @lemmy04, what version of dbus are you running? There was a fix released for this recently (dbus-1-x11-1.8.22-19.1), it fixes the reported upstream bug (bsc#978477) similar way as the proposed/Ubuntu patch does. |
@Werkov i seem to be running that version of dbus, but i'm still seeing that issue... |
Hm, strange. Do you see anything suspicious in the syslog preceding the failures begin? |
nope, nothing stands out. |
Same problem here on Ubuntu 16.04 running as LXC container |
Not reproducible on openSUSE 42.2's dbus-1.8.22 anymore; they seem to have a patch that bears resemblance to the discussion at https://bugs.freedesktop.org/show_bug.cgi?id=95263 . |
@jengelh I can still reproduce it here, and I'm running openSUSE 42.2 with that version of dbus. |
I can't reproduce it... but it almost looks as if a large number of login attempts (read: ssh dictionary attack) triggers it. |
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770135 looks suspiciously similar. |
We have this problem on most, if not all of our 16.04 systems, with the added twist that we can not restart logind:
We use Ansible so repeated SSH logins is normal workflow for us. |
For up-to-date ubuntu 16.04, Kernel 4.8, I get this logind restart loop, every few seconds.
Plus, do I experience very slow logins via ssh on that system (~20 seconds). |
I believe I've been bumping into this about once a week on my laptop machines for a year. What information can I provide that helps? Archlinux LTS 4.9.33-1-lts system version 232. I cannot run any When this happens, my users are able to log into the machine, and run Please help. |
Hi.
|
When a DBus name is released, NameOwnerChanged signal contains an empty string as new_owner. Commit bbc2908 changed interpretation of the empty string to a valid name, which is not consistent with values that are sent by dbus-daemon. As a side effect, this masks symptoms of systemd-logind dbus disconnections (systemd#2925) by completely restarting it so it can freshly reconnect to dbus.
When a DBus name is released, NameOwnerChanged signal contains an empty string as new_owner. Commit bbc2908 changed interpretation of the empty string to a valid name, which is not consistent with values that are sent by dbus-daemon. As a side effect, this masks symptoms of systemd-logind dbus disconnections (#2925) by completely restarting it so it can freshly reconnect to dbus.
Hi.
I'll come back for 3 and 4 when I see the issue. Should only be a couple of days. Thank you for the response. |
This indicates my issue currently is with nfs, which I am using, though I was not accessing any files in the last few hours. Frequent suspends/wakes do sometimes make this pop up. First, looking just now, it looks like there are a pile of open NFS relatead issues on github, so I'll see if I can find one that fits this situation. Second, I think there may be multiple issues that I've run into, so I'll report back the next time this happens unless its NFS related. Thanks for the commands to help diagnose. |
We saw this today with bunch of Debian machines in a cluster (Debian 8.6). We saw slow logins, and the logind-systemd systemd service had failed (attempts to restart it would fail, with messages like this):
I believe it's possible that part of the issue was that when these machines were booted, the external network connection might not have been up. This is speculation though.
The following may help clarify what versions we have:
I'm commenting here partly so that if people have this same issue on Debian they will find this thread. |
FTR, I'm now quite confident that the bug initially reported by @jengelh and @lemmy04 can be worked around by b007626 and ultimately eliminated with fixup of dbus bug 95619. The Debian/Ubuntu instances might be still affected by the original dbus bug 95263 (prior to dbus-1.11.10) (or it's something completely different given given the last reports (unconnected |
OK, let's close this, given that this appears to be fixed in current versions, according to @Werkov's findings. If this persists with current upstream versions, please reopen. |
sorry, not sure why this was closed without verification. ;) linux box running this vm is fine - i cal leave ssh session for hours, come back and it still works just fine . but this vm started kicking me out of ssh session quit frequently (appr. every 10 min) |
It seems that various ssh logins issues are attributed to this particular dbus-logind interaction. @slavamas From your description it seems rather like problems with SSH daemon itself rather than the original issue here. I'd recommend checking syslog, especially messages from SSH service. (My tip would be that there may be a resource starvation and not all SSH clients can't be properly handled, it's just a wild guess though.) |
When a DBus name is released, NameOwnerChanged signal contains an empty string as new_owner. Commit bbc2908 changed interpretation of the empty string to a valid name, which is not consistent with values that are sent by dbus-daemon. As a side effect, this masks symptoms of systemd-logind dbus disconnections (systemd#2925) by completely restarting it so it can freshly reconnect to dbus.
Submission type
systemd version the issue has been seen with
Used distribution
In case of bug report: Unexpected behaviour you saw
Other services are prone to the issue as well; networkd however seems to continue living ever since applying said two patches on top of 228.
In case of bug report: Expected behaviour you didn't see
In case of bug report: Steps to reproduce the problem
The text was updated successfully, but these errors were encountered: