Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logind randomly loses its dbus connection #2925

Closed
1 task
jengelh opened this issue Mar 31, 2016 · 32 comments
Closed
1 task

logind randomly loses its dbus connection #2925

jengelh opened this issue Mar 31, 2016 · 32 comments
Labels
bug 🐛 Programming errors, that need preferential fixing login needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer

Comments

@jengelh
Copy link
Contributor

jengelh commented Mar 31, 2016

Submission type

  • [*] Bug report
  • Request for enhancement (RFE)

systemd version the issue has been seen with

v228 + d8ccf5f + 8936a5e

Used distribution

openSUSE Leap 42.1

In case of bug report: Unexpected behaviour you saw

logind somehow loses its connection to dbus, which causes long delays when trying to login with sshd

# busctl
11:17 dagent:/home/zarafa # busctl 
NAME                             PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRI
:1.0                               1 systemd         root             :1.0          init.scope                -          -     
:1.1                              24 systemd-networkd systemd-network  :1.1          systemd-networkd.service  -          -    
:1.373                         15545 busctl          root             :1.373        sshd.service              -          -     
org.freedesktop.DBus               - -               -                -             -                         -          -     
org.freedesktop.hostname1          - -               -                (activatable) -                         -         
org.freedesktop.locale1            - -               -                (activatable) -                         -         
org.freedesktop.login1             - -               -                (activatable) -                         -         
org.freedesktop.machine1           - -               -                (activatable) -                         -         
org.freedesktop.network1          24 systemd-networkd systemd-network  :1.1          systemd-networkd.service  -          -    
org.freedesktop.systemd1           1 systemd         root             :1.0          init.scope                -          -     
org.freedesktop.timedate1          - -               -                (activatable) -                         -         
11:17 dagent:/home/zarafa # systemctl status systemd-logind
% systemd-logind.service - Login Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-logind.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2016-03-29 15:54:26 CEST; 1 day 19h ago
     Docs: man:systemd-logind.service(8)
           man:logind.conf(5)
           http://www.freedesktop.org/wiki/Software/systemd/logind
           http://www.freedesktop.org/wiki/Software/systemd/multiseat
 Main PID: 128 (systemd-logind)
   Status: "Processing requests..."
   CGroup: /system.slice/system-systemd\x2dnspawn.slice/systemd-nspawn@dagent.service/system.slice/systemd-logind.service
           └─128 /usr/lib/systemd/systemd-logind
.
Mar 30 13:33:08 dagent systemd-logind[128]: New session 2029 of user zarafa.
Mar 30 13:58:31 dagent systemd-logind[128]: New session 2062 of user zarafa.
Mar 30 14:01:50 dagent systemd-logind[128]: Removed session 2062.
Mar 31 02:51:58 dagent systemd-logind[128]: Removed session 2029.
Mar 31 11:00:01 dagent systemd-logind[128]: Failed to abandon session scope: Transport endpoint is not connected
Mar 31 11:05:08 dagent systemd[1]: Started Login Service.
Mar 31 11:05:33 dagent systemd-logind[128]: Failed to abandon session scope: Transport endpoint is not connected
Mar 31 11:12:58 dagent systemd[1]: Started Login Service.
Mar 31 11:13:47 dagent systemd[1]: Started Login Service.
Mar 31 11:15:01 dagent systemd[1]: Started Login Service.
11:17 dagent:/home/zarafa # ps 128
  PID TTY      STAT   TIME COMMAND
  128 ?        Ss     0:01 /usr/lib/systemd/systemd-logind

The long wait is caused by dbus:

# journalctl -u dbus
Mar 31 11:05:08 dagent dbus[23]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesk...
Mar 31 11:05:33 dagent dbus[23]: [system] Failed to activate service 'org.freedesktop.login1': timed out...

Because the service is still running, "systemd start systemd-logind" is a no-op, and since the busname won't reappear, dbus considers it an activation failure.

Other services are prone to the issue as well; networkd however seems to continue living ever since applying said two patches on top of 228.

In case of bug report: Expected behaviour you didn't see

Don't lose the dbus connection.

In case of bug report: Steps to reproduce the problem

if only I knew. Unlike #2136, this issue happens to me without interaction, that is, no "daemon-reload", reexec, or other restarting action was executed to the best of my knowledge.

@poettering
Copy link
Member

this suggests dbus kicked us from the bus, because we sent invalid data to it. any chance you can run dbus-daemon in debug mode, and see what it says about logind's connection before this happens?

@poettering poettering added bug 🐛 Programming errors, that need preferential fixing login needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer labels Apr 1, 2016
@jengelh
Copy link
Contributor Author

jengelh commented Apr 1, 2016

Doing that now. Meanwhile, while replacing the dbus package, I noticed the issue appears again (but maybe it is different?). It appears that logind does not know that it was kicked, and won't exit:

# ps xaf | tail -n3
16184 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
16228 ?        Ssl    0:00 /usr/lib/polkit-1/polkitd --no-debug
16251 ?        Ss     0:00 /usr/lib/systemd/systemd-logind
# busctl | grep login1
org.freedesktop.login1          16251 systemd-logind  root             :1.6          systemd-logind.service    -          -                  

Precondition: logind and dbus are operational. Then, let's restart just dbus:

# systemctl restart dbus
PolicyKit daemon disconnected from the bus.
We are no longer a registered authentication agent.
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting.
# busctl | grep login1
org.freedesktop.login1              - -               -                (activatable) -                         -         
# ps xaf | tail -n3
15920 ?        S      0:00  \_ (sd-pam)
16251 ?        Ss     0:00 /usr/lib/systemd/systemd-logind
16330 ?        Ss     0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation

The login1 endpoint is gone, but logind is still running (under as its old PID).
What I expected to see instead: process 16251 either exiting, or reconnecting to dbus.

@poettering
Copy link
Member

Restarting dbus is not supported generally. That disconnects all clients, and the system generally cannot recover from that. This is a dbus limitation.

@jengelh
Copy link
Contributor Author

jengelh commented Apr 27, 2016

Apr 27 11:52:53 dagent dbus-daemon[25]: 25: [policy.c(1025):bus_client_policy_check_can_send]   (policy) skipping rule because dest org.freedesktop.login1 isn't owned by receiver
Apr 27 11:52:53 dagent dbus-daemon[25]: 25: [policy.c(1226):bus_client_policy_check_can_receive]   (policy) skipping rule because origin org.freedesktop.login1 isn't owned by sender

are some messages I am currently getting (while logind is still on the bus). Waiting for more..

@poettering
Copy link
Member

So, this might be related to https://bugs.freedesktop.org/show_bug.cgi?id=95263.

Is this possibly related to some component creating a lot of logind sessions quickly in a loop? I.e. some cron job, some sudo job or maybe some script creating a lot of ssh connections or so?

@puravida
Copy link

puravida commented Aug 9, 2016

@poettering Your guess would make sense in my case. I'm seeing these errors by running 50+ cron jobs every minute. Each job basically just checks for the existence of a lock file to ensure that the script is already/still running. However, each of these concurrent cron jobs seem to occasionally (within 15-60 minutes) cause the reported dbus disconnect and subsequent systemd failures, which can only be "fixed" with a systemd-logind restart. It "seems" that this occurs at around 300 session logins. However, there is a point at which (seems like 1,000 session logins) even a logind restart will not allow the system to recover; only a full reboot will work.

@ivan
Copy link

ivan commented Aug 15, 2016

I have an Ubuntu 16.04 server where SSH logins and su are taking 30 seconds (possibly) because of this bug. systemctl restart systemd-logind fixes the problem only briefly. I'm logging into the server about once every 3 minutes over SSH.

@Werkov
Copy link
Contributor

Werkov commented Aug 15, 2016

On Mon, Aug 15, 2016 at 12:23:10PM -0700, Ivan Kozik notifications@github.com wrote:

I have an Ubuntu 16.04 server where SSH logins and su are taking 30
seconds (possibly) because of this bug. systemctl restart systemd-logind fixes the problem only briefly. I'm logging into the
server about once every 3 minutes over SSH.
That is AFAIK low intensity to trigger this. Do you see it ~regularly
or rather rarely? Same behavior may happen when dbus is updated and
restarted -- systemd-logind also must be restarted. Manually, sigh.

@ivan
Copy link

ivan commented Aug 15, 2016

I've only broken that particular server once; it's been up 55 days and I can't reboot it yet :-)

Without updating any packages or restarting anything, I just managed to break another updated Ubuntu 16.04 host (systemd 229-4ubuntu7) by flooding it with logins using:

while true; do timeout -k 4s 2s ssh USER@HOST "ls -1"; done

from 10 terminals; I turned off the flood after 5-10 minutes, after which point every connection was delayed for ~30 seconds:

Aug 15 19:41:19 HOST sshd[31621]: Accepted publickey for root from 192.168.1.2 port 39620 ssh2: RSA SHA256:XXXXXX
Aug 15 19:41:19 HOST sshd[31621]: pam_unix(sshd:session): session opened for user root by (uid=0)
Aug 15 19:41:19 HOST dbus[1664]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Aug 15 19:41:19 HOST systemd[1]: Started Login Service.
Aug 15 19:41:44 HOST sshd[31621]: pam_systemd(sshd:session): Failed to create session: Connection timed out
Aug 15 19:41:44 HOST dbus[1664]: [system] Failed to activate service 'org.freedesktop.login1': timed out

systemctl restart systemd-logind has again "fixed" it, probably just temporarily.

@ivan
Copy link

ivan commented Aug 18, 2016

Not sure this is related to the bug, but when sending a flood of logins, loginctl list-sessions seems to grow, and many sessions are never cleaned up.

Here is what the first failure looks like when doing that flood:

Aug 18 03:27:39 HOST sshd[15131]: Received disconnect from 192.168.1.2 port 46804:11: disconnected by user
Aug 18 03:27:39 HOST sshd[15131]: Disconnected from 192.168.1.2 port 46804
Aug 18 03:27:39 HOST sshd[14792]: pam_unix(sshd:session): session closed for user at
Aug 18 03:27:40 HOST systemd[1]: Started Session 1524 of user at.
Aug 18 03:27:40 HOST systemd[1]: Started Session 1525 of user at.
Aug 18 03:27:40 HOST systemd[1]: Started Session 1526 of user at.
Aug 18 03:27:40 HOST systemd[1]: Started Session 1527 of user at.
Aug 18 03:27:40 HOST sshd[15157]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying
Aug 18 03:27:40 HOST systemd-logind[1738]: Failed to abandon session scope: Transport endpoint is not connected
Aug 18 03:27:40 HOST sshd[15153]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying
Aug 18 03:27:40 HOST sshd[15186]: Received disconnect from 192.168.1.2 port 46810:11: disconnected by user
Aug 18 03:27:40 HOST sshd[15186]: Disconnected from 192.168.1.2 port 46810
Aug 18 03:27:40 HOST sshd[14897]: pam_unix(sshd:session): session closed for user at
Aug 18 03:27:40 HOST dbus[1613]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Aug 18 03:27:40 HOST systemd[1]: Started Login Service.
Aug 18 03:27:40 HOST sshd[15243]: Received disconnect from 192.168.1.2 port 46812:11: disconnected by user
Aug 18 03:27:40 HOST sshd[15243]: Disconnected from 192.168.1.2 port 46812

Edit: I just noticed on https://bugs.freedesktop.org/show_bug.cgi?id=95263 that Ubuntu is tracking this at https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1591411

@lemmy04
Copy link

lemmy04 commented Oct 23, 2016

I can confirm this bug on openSUSE 13.2 running systemd-210.1475218254.1e76ce0-25.48.1.x86_64.

restarting dbus and systemd-logind helps (for some time), then it starts failing again.
2 Machines in question are connected to the internet, port 22 is accessible, and is getting hit with dictionary attaches from time to time. iptables already limits connections to port 22 to 3/minute, 5 in bursts.

@Werkov
Copy link
Contributor

Werkov commented Oct 24, 2016

Hi @lemmy04, what version of dbus are you running? There was a fix released for this recently (dbus-1-x11-1.8.22-19.1), it fixes the reported upstream bug (bsc#978477) similar way as the proposed/Ubuntu patch does.

@lemmy04
Copy link

lemmy04 commented Oct 24, 2016

@Werkov i seem to be running that version of dbus, but i'm still seeing that issue...

@Werkov
Copy link
Contributor

Werkov commented Oct 24, 2016

Hm, strange. Do you see anything suspicious in the syslog preceding the failures begin?

@lemmy04
Copy link

lemmy04 commented Oct 24, 2016

nope, nothing stands out.

@h9k
Copy link

h9k commented Dec 28, 2016

Same problem here on Ubuntu 16.04 running as LXC container

@jengelh
Copy link
Contributor Author

jengelh commented Dec 28, 2016

Not reproducible on openSUSE 42.2's dbus-1.8.22 anymore; they seem to have a patch that bears resemblance to the discussion at https://bugs.freedesktop.org/show_bug.cgi?id=95263 .

@lemmy04
Copy link

lemmy04 commented Dec 28, 2016

@jengelh I can still reproduce it here, and I'm running openSUSE 42.2 with that version of dbus.

@Werkov
Copy link
Contributor

Werkov commented Jan 2, 2017

I believe there must be two subtly different causes and the patch mentioned by @jengelh resolves just one of them.
@lemmy04 Can you reproduce it at will? (I understood from your previous comment that you saw that just sporadically.)

@lemmy04
Copy link

lemmy04 commented Jan 3, 2017

I can't reproduce it... but it almost looks as if a large number of login attempts (read: ssh dictionary attack) triggers it.

@mbiebl
Copy link
Contributor

mbiebl commented Jan 3, 2017

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770135 looks suspiciously similar.

@dannyman
Copy link

dannyman commented Jan 3, 2017

We have this problem on most, if not all of our 16.04 systems, with the added twist that we can not restart logind:

djh@elk03-16:~$ uname -a
Linux elk03-16 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:50:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
djh@elk03-16:~$ sudo systemctl status systemd-logind.service
● systemd-logind.service - Login Service
   Loaded: loaded (/lib/systemd/system/systemd-logind.service; static; vendor preset: enabled)
   Active: activating (start) since Tue 2017-01-03 10:14:45 PST; 12s ago
     Docs: man:systemd-logind.service(8)
           man:logind.conf(5)
           http://www.freedesktop.org/wiki/Software/systemd/logind
           http://www.freedesktop.org/wiki/Software/systemd/multiseat
 Main PID: 20516 (systemd-logind)
    Tasks: 1
   Memory: 260.0K
      CPU: 5ms
   CGroup: /system.slice/systemd-logind.service
           └─20516 /lib/systemd/systemd-logind

Jan 03 10:14:45 elk03-16 systemd[1]: Starting Login Service...
djh@elk03-16:~$ sudo systemctl restart systemd-logind.service
Job for systemd-logind.service failed because the control process exited with error code. See "systemctl status systemd-logind.service" and "journalctl -xe" for details.
djh@elk03-16:~$ sudo systemctl status systemd-logind.service
● systemd-logind.service - Login Service
   Loaded: loaded (/lib/systemd/system/systemd-logind.service; static; vendor preset: enabled)
   Active: activating (start) since Tue 2017-01-03 10:15:35 PST; 8s ago
     Docs: man:systemd-logind.service(8)
           man:logind.conf(5)
           http://www.freedesktop.org/wiki/Software/systemd/logind
           http://www.freedesktop.org/wiki/Software/systemd/multiseat
 Main PID: 20541 (systemd-logind)
    Tasks: 1
   Memory: 228.0K
      CPU: 11ms
   CGroup: /system.slice/systemd-logind.service
           └─20541 /lib/systemd/systemd-logind

Jan 03 10:15:35 elk03-16 systemd[1]: Starting Login Service...
djh@elk03-16:~$ sudo journalctl -xe | tail -24
-- The result is failed.
Jan 03 10:15:35 elk03-16 systemd[1]: systemd-logind.service: Unit entered failed state.
Jan 03 10:15:35 elk03-16 systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
Jan 03 10:15:35 elk03-16 systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
Jan 03 10:15:35 elk03-16 sudo[20529]: pam_unix(sudo:session): session closed for user root
Jan 03 10:15:35 elk03-16 systemd[1]: Failed to forward Released message: No buffer space available
Jan 03 10:15:35 elk03-16 systemd[1]: Stopped Login Service.
-- Subject: Unit systemd-logind.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit systemd-logind.service has finished shutting down.
Jan 03 10:15:35 elk03-16 systemd[1]: Starting Login Service...
-- Subject: Unit systemd-logind.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit systemd-logind.service has begun starting up.
Jan 03 10:15:35 elk03-16 systemd[1]: Failed to forward Released message: No buffer space available
Jan 03 10:15:44 elk03-16 sudo[20543]:      djh : TTY=pts/1 ; PWD=/home/djh ; USER=root ; COMMAND=/bin/systemctl status systemd-logind.service
Jan 03 10:15:44 elk03-16 sudo[20543]: pam_unix(sudo:session): session opened for user root by djh(uid=0)
Jan 03 10:15:44 elk03-16 sudo[20543]: pam_unix(sudo:session): session closed for user root
Jan 03 10:15:52 elk03-16 sudo[20546]:      djh : TTY=pts/1 ; PWD=/home/djh ; USER=root ; COMMAND=/bin/journalctl -xe
Jan 03 10:15:52 elk03-16 sudo[20546]: pam_unix(sudo:session): session opened for user root by djh(uid=0)

We use Ansible so repeated SSH logins is normal workflow for us.

@thomaszbz
Copy link

For up-to-date ubuntu 16.04, Kernel 4.8, I get this logind restart loop, every few seconds.

May  7 00:37:18 server systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
May  7 00:37:18 server systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
May  7 00:37:18 server systemd[1]: Stopped Login Service.
May  7 00:37:18 server systemd[1]: Starting Login Service...
May  7 00:37:43 server systemd[1]: systemd-logind.service: Main process exited, code=exited, status=1/FAILURE
May  7 00:37:43 server systemd[1]: Failed to start Login Service.
May  7 00:37:43 server systemd[1]: systemd-logind.service: Unit entered failed state.
May  7 00:37:43 server systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
May  7 00:37:43 server systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
May  7 00:37:43 server systemd[1]: Stopped Login Service.
May  7 00:37:43 server systemd[1]: Starting Login Service...

Plus, do I experience very slow logins via ssh on that system (~20 seconds).

@zachfi
Copy link

zachfi commented Jun 20, 2017

I believe I've been bumping into this about once a week on my laptop machines for a year. What information can I provide that helps? Archlinux LTS 4.9.33-1-lts system version 232.

I cannot run any systemctl commands, as I get a 'Connection timed out' message. So restarting services is a bit hard. Network is hard too, since NetworkManager has issues connecting to dbus. I initially thought this was dbus related, but I'm uncertain.

When this happens, my users are able to log into the machine, and run systemctl --user without issue, but all system systemctl commands fail. Pretty much the only thing that "fixes" it is to reboot with systemctl reboot -f, which I believe skips the connection process. I run crons a few times an hour, etc.

Please help.

@Werkov
Copy link
Contributor

Werkov commented Jun 20, 2017

Hi.

  1. What dbus version do you use?
  2. How often does this happen to you? How many cronjobs do you have in parallel at most?
  3. Can you see org.freedesktop.systemd1 owned by PID 1 in busctl output when this happens?
  4. What's in /proc/1/stack?

Werkov added a commit to Werkov/systemd that referenced this issue Jun 22, 2017
When a DBus name is released, NameOwnerChanged signal contains an empty string
as new_owner. Commit bbc2908 changed interpretation of the empty string to a
valid name, which is not consistent with values that are sent by dbus-daemon.

As a side effect, this masks symptoms of systemd-logind dbus disconnections
(systemd#2925) by completely restarting it so it can freshly reconnect to dbus.
keszybz pushed a commit that referenced this issue Jun 23, 2017
When a DBus name is released, NameOwnerChanged signal contains an empty string
as new_owner. Commit bbc2908 changed interpretation of the empty string to a
valid name, which is not consistent with values that are sent by dbus-daemon.

As a side effect, this masks symptoms of systemd-logind dbus disconnections
(#2925) by completely restarting it so it can freshly reconnect to dbus.
@zachfi
Copy link

zachfi commented Jun 23, 2017

Hi.

  1. dbus 1.10.18
  2. This happens on average about once a week, but sometimes more frequently. Parallel crons at most are probably 1-2. During one of the cron jobs, crontab -l is called perhaps twice.

I'll come back for 3 and 4 when I see the issue. Should only be a couple of days. Thank you for the response.

@zachfi
Copy link

zachfi commented Jun 26, 2017

  1. Yes.
$ sudo cat /proc/1/stack
[<ffffffffa0c3e9e4>] rpc_wait_bit_killable+0x24/0xa0 [sunrpc]
[<ffffffffa0c3f694>] __rpc_execute+0xe4/0x430 [sunrpc]
[<ffffffffa0c3feed>] rpc_execute+0x5d/0xc0 [sunrpc]
[<ffffffffa0c31298>] rpc_run_task+0xf8/0x140 [sunrpc]
[<ffffffffa0c313f3>] rpc_call_sync+0x53/0xc0 [sunrpc]
[<ffffffffa0a0c619>] nfs3_rpc_wrapper.constprop.5+0x39/0xa0 [nfsv3]
[<ffffffffa0a0d996>] nfs3_proc_getattr+0x56/0xb0 [nfsv3]
[<ffffffffa0c82386>] __nfs_revalidate_inode+0xf6/0x340 [nfs]
[<ffffffffa0c828bf>] nfs_revalidate_inode+0x4f/0x60 [nfs]
[<ffffffffa0c79d6f>] nfs_weak_revalidate+0x6f/0xd0 [nfs]
[<ffffffff81215b4a>] complete_walk+0x6a/0x80
[<ffffffff81216703>] path_lookupat+0xb3/0x100
[<ffffffff812189df>] filename_lookup+0xaf/0x140
[<ffffffff81218b46>] user_path_at_empty+0x36/0x40
[<ffffffff8120de96>] vfs_fstatat+0x66/0xc0
[<ffffffff8120e3b6>] SyS_newlstat+0x36/0x70
[<ffffffff815fc637>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

This indicates my issue currently is with nfs, which I am using, though I was not accessing any files in the last few hours. Frequent suspends/wakes do sometimes make this pop up.

First, looking just now, it looks like there are a pile of open NFS relatead issues on github, so I'll see if I can find one that fits this situation.

Second, I think there may be multiple issues that I've run into, so I'll report back the next time this happens unless its NFS related.

Thanks for the commands to help diagnose.

@danpovey
Copy link

We saw this today with bunch of Debian machines in a cluster (Debian 8.6).

We saw slow logins, and the logind-systemd systemd service had failed (attempts to restart it would fail, with messages like this):

Jul 17 17:52:04 b11 systemd-logind[98964]: Failed to enable subscription: Activation of org.freedesktop.systemd1 timed out
Jul 17 17:52:04 b11 systemd[1]: systemd-logind.service has no holdoff time, scheduling restart.
Jul 17 17:52:04 b11 systemd[1]: Unit systemd-logind.service entered failed state.
Jul 17 17:52:04 b11 systemd[1]: Failed to start Login Service.
Jul 17 17:52:04 b11 systemd[1]: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 17 17:52:04 b11 dbus[1963]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 17 17:52:01 b11 CRON[99001]: pam_unix(cron:session): session closed for user root

I believe it's possible that part of the issue was that when these machines were booted, the external network connection might not have been up. This is speculation though.
Attempts to fix the problem by
systemctl restart systemd-logind.service
failed, but the following resolved it in all cases:
systemctl restart dbus; systemctl restart systemd-logind.service
The following messages may help explain-- restaring dbus seems to have fixed the freedesktop thingy:

Jul 17 17:52:20 b11 dbus[99150]: [system] Successfully activated service 'org.freedesktop.systemd1'
Jul 17 17:52:20 b11 dbus[99150]: [system] Activating systemd to hand-off: service name='org.freedesktop.Avahi' unit='dbus-org.freedesktop.
Avahi.service'

The following may help clarify what versions we have:

root@b11:~# dpkg -l | grep dbus | grep -v python | grep -v lib
ii  at-spi2-core                           2.14.0-1                             amd64        Assistive Technology Service Provider Interface (dbus core)
ii  dbus                                   1.8.20-0+deb8u1                      amd64        simple interprocess messaging system (daemon and utilities)
ii  dbus-x11                               1.8.20-0+deb8u1                      amd64        simple interprocess messaging system (X11 deps)
root@b11:~# dpkg -l | grep systemd  
ii  libpam-systemd:amd64                   215-17+deb8u5                        amd64        system and service manager - PAM module
ii  libsystemd-login0:amd64                215-17+deb8u5                        amd64        systemd login utility library (deprecated)
ii  libsystemd0:amd64                      215-17+deb8u5                        amd64        systemd utility library
ii  systemd                                215-17+deb8u5                        amd64        system and service manager
ii  systemd-sysv                           215-17+deb8u5                        amd64        system and service manager - SysV links

I'm commenting here partly so that if people have this same issue on Debian they will find this thread.

@Werkov
Copy link
Contributor

Werkov commented Aug 10, 2017

FTR, I'm now quite confident that the bug initially reported by @jengelh and @lemmy04 can be worked around by b007626 and ultimately eliminated with fixup of dbus bug 95619.

The Debian/Ubuntu instances might be still affected by the original dbus bug 95263 (prior to dbus-1.11.10) (or it's something completely different given given the last reports (unconnected systemctl, no avail with systemd-logind restart)).

@poettering
Copy link
Member

OK, let's close this, given that this appears to be fixed in current versions, according to @Werkov's findings. If this persists with current upstream versions, please reopen.

@slavamas
Copy link

slavamas commented Feb 9, 2018

sorry, not sure why this was closed without verification. ;)
this is what I see on my virtual machine (centos 7)
gdb -c kcore
[New process 1]
Core was generated by `BOOT_IMAGE=/vmlinuz-3.10.0-693.17.1.el7.x86_64 root=/dev/mapper/centos-root ro c'.
#0 0x0000000000000000 in ?? ()
(gdb) quit
[root@centos7 proc]# uname -a
Linux centos7 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

linux box running this vm is fine - i cal leave ssh session for hours, come back and it still works just fine . but this vm started kicking me out of ssh session quit frequently (appr. every 10 min)
I cannot immediately re-login:
ssh: connect to host 192.168.0.10 port 22: Connection refused
after few minutes it will happily let me in.
even if I logout normally and immediately want to login again - I am getting connection refused.
first timestamp - right after login. second one - when I got "broken pipe"
Fri Feb 9 11:19:32 PST 2018
[xx@centos7 ~]$ packet_write_wait: Connection to 192.168.0.10 port 22: Broken pipe
$ date
Fri Feb 9 11:34:29 PST 2018

@Werkov
Copy link
Contributor

Werkov commented Feb 9, 2018

It seems that various ssh logins issues are attributed to this particular dbus-logind interaction.

@slavamas From your description it seems rather like problems with SSH daemon itself rather than the original issue here. I'd recommend checking syslog, especially messages from SSH service. (My tip would be that there may be a resource starvation and not all SSH clients can't be properly handled, it's just a wild guess though.)

doylehuang pushed a commit to doylehuang/systemd that referenced this issue Aug 16, 2018
When a DBus name is released, NameOwnerChanged signal contains an empty string
as new_owner. Commit bbc2908 changed interpretation of the empty string to a
valid name, which is not consistent with values that are sent by dbus-daemon.

As a side effect, this masks symptoms of systemd-logind dbus disconnections
(systemd#2925) by completely restarting it so it can freshly reconnect to dbus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing login needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer
Development

No branches or pull requests