Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upLeak of scope units slowing down "systemctl list-unit-files" and delaying logins #1961
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mattmcdowell
Dec 3, 2015
I am also seeing this problem on CentOS 7. They delay is causing chef-client to fail when managing services after a month or two of uptime.
mattmcdowell
commented
Dec 3, 2015
|
I am also seeing this problem on CentOS 7. They delay is causing chef-client to fail when managing services after a month or two of uptime. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
julianbrost
Dec 20, 2015
Is anyone looking into this or has any hints on how to debug this further? This is pretty annoying as it forces us to reboot every ~2 months.
julianbrost
commented
Dec 20, 2015
|
Is anyone looking into this or has any hints on how to debug this further? This is pretty annoying as it forces us to reboot every ~2 months. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
keszybz
Dec 25, 2015
Member
Yeah, it's easy enough to trigger. Various components fail with exec errors, but pam_systemd also fails in dbus:
Dec 25 11:33:26 rawhide sshd[8020]: Accepted publickey for test from 10.0.0.1 port 54108 ssh2: RSA SHA256:Rah0
Dec 25 11:33:27 rawhide sshd[8020]: pam_systemd(sshd:session): Failed to connect to system bus: Resource temporarily unavailable
Dec 25 11:33:27 rawhide sshd[8020]: pam_unix(sshd:session): session opened for user test by (uid=0)
Dec 25 11:33:27 rawhide sshd[8020]: pam_unix(sshd:session): session closed for user test
... but the reason for left over scope units is probably this:
Dec 25 11:32:42 rawhide sshd[30542]: Accepted publickey for test from 10.0.0.1 port 51414 ssh2: RSA SHA256:Rah0
Dec 25 11:32:43 rawhide sshd[30542]: pam_unix(sshd:session): session opened for user test by (uid=0)
Dec 25 11:32:47 rawhide sshd[30542]: pam_systemd(sshd:session): Failed to release session: Message recipient disconnected from message bus without replying
Dec 25 11:32:47 rawhide sshd[30542]: pam_unix(sshd:session): session closed for user test
or
Dec 25 11:33:13 rawhide sshd[30602]: pam_systemd(sshd:session): Failed to release session: Failed to activate service 'org.freedesktop.login1': timed out
or
Dec 25 11:33:13 rawhide sshd[30558]: pam_systemd(sshd:session): Failed to release session: Connection timed out
|
Yeah, it's easy enough to trigger. Various components fail with exec errors, but pam_systemd also fails in dbus:
... but the reason for left over scope units is probably this:
or
or
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
keszybz
Dec 25, 2015
Member
Hm, it seems that this is not just a transient failure, but systemd-logind becomes permanently fucked up:
$ journalctl -b -u systemd-logind
...
Dec 25 11:32:47 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:32:47 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:32:49 rawhide systemd[1]: Started Login Service.
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:13 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:14 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 11:33:29 rawhide systemd[1]: Started Login Service.
Dec 25 11:54:13 rawhide systemd[1]: Started Login Service.
Dec 25 11:54:53 rawhide systemd[1]: Started Login Service.
Dec 25 12:34:45 rawhide systemd[1]: Started Login Service.
Dec 25 12:39:46 rawhide systemd[1]: Started Login Service.
Dec 25 12:40:11 rawhide systemd-logind[1110]: Failed to abandon session scope: Transport endpoint is not connected
Dec 25 13:05:06 rawhide systemd[1]: Started Login Service.
Dec 25 13:12:24 rawhide systemd[1]: Started Login Service.
An example (normal) login session:
Dec 25 13:05:06 rawhide sshd[21228]: Accepted publickey for test from 10.0.0.1 port 55172 ssh2: RSA SHA256:Rah0
Dec 25 13:05:06 rawhide dbus[1097]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Dec 25 13:05:06 rawhide systemd[1]: Started Login Service.
Dec 25 13:05:32 rawhide dbus[1097]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Dec 25 13:05:32 rawhide sshd[21228]: pam_systemd(sshd:session): Failed to create session: Failed to activate service 'org.freedesktop.login1': timed out
Dec 25 13:05:32 rawhide sshd[21228]: pam_unix(sshd:session): session opened for user test by (uid=0)
so it seems that dbus things that logind is gone. But systemd thinks that logind is alive (it is, the process is still there), so when dbus requests logind, it just returns success. So the issue seems to be between dbus and logind.
|
Hm, it seems that this is not just a transient failure, but systemd-logind becomes permanently fucked up:
An example (normal) login session:
so it seems that dbus things that logind is gone. But systemd thinks that logind is alive (it is, the process is still there), so when dbus requests logind, it just returns success. So the issue seems to be between dbus and logind. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
keszybz
Dec 25, 2015
Member
So, running strace on logind shows that logind never even gets woken up, it only sends WATCHDOG=1 notifications to systemd.
Dec 25 11:32:48 rawhide dbus[1097]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Dec 25 11:33:13 rawhide dbus[1097]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30015ms)
Dec 25 11:33:29 rawhide dbus[1097]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30001ms)
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30001ms)
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30000ms)
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30000ms)
Dec 25 11:33:29 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30000ms)
Dec 25 11:33:30 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30044ms)
...
Dec 25 11:33:32 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30033ms)
Dec 25 11:33:32 rawhide dbus[1097]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30012ms)
Dec 25 11:33:54 rawhide dbus[1097]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Dec 25 11:54:13 rawhide dbus[1097]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Dec 25 11:54:38 rawhide dbus[1097]: [system] Failed to activate service 'org.freedesktop.login1': timed out
...
The question is whether this is a dbus problem, or a systemd-logind problem.
One option would be for logind (and other systemd daemons which are only accessible through dbus), to send a ping to the dbus daemon every once in a while before notifying the watchdog. After all they are useless if the connection to the dbus daemon goes down for any reason.
|
So, running strace on logind shows that logind never even gets woken up, it only sends WATCHDOG=1 notifications to systemd.
The question is whether this is a dbus problem, or a systemd-logind problem. One option would be for logind (and other systemd daemons which are only accessible through dbus), to send a ping to the dbus daemon every once in a while before notifying the watchdog. After all they are useless if the connection to the dbus daemon goes down for any reason. |
added a commit
to keszybz/systemd
that referenced
this issue
Dec 28, 2015
keszybz
self-assigned this
Jan 4, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
canoine
Jan 22, 2016
Meanwhile, is there a way to "clean" these abandoned scope units whitout rebooting ?
I tried to remove all the related files, all the occurences of the session numbers, restart dbus and systemd-logind, but vainly. The list keeps growing desperately, and I can't mess around with rebooting my production servers all the time.
canoine
commented
Jan 22, 2016
|
Meanwhile, is there a way to "clean" these abandoned scope units whitout rebooting ? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mazingerzeta
Jan 22, 2016
You don't actually need to reboot the hosts. @$work, we have found that it is enough to simply remove the scope files a la "rm -f /run/systemd/system/session-*.scope". The hosts that we have applied this to via cron have yet to lock up. It's early days still, but seems to do the trick.
mazingerzeta
commented
Jan 22, 2016
|
You don't actually need to reboot the hosts. @$work, we have found that it is enough to simply remove the scope files a la "rm -f /run/systemd/system/session-*.scope". The hosts that we have applied this to via cron have yet to lock up. It's early days still, but seems to do the trick. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
keszybz
Jan 24, 2016
Member
I'm still working in this. I started looking through the sd-event code which resulted in a bunch of cleanups. Unfortunately without any bearing on this bug. I should have some time to work on this this weekend.
|
I'm still working in this. I started looking through the sd-event code which resulted in a bunch of cleanups. Unfortunately without any bearing on this bug. I should have some time to work on this this weekend. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
canoine
Jan 25, 2016
@mazingerzeta : sadly it isn't on my systems (CentOS 7.2, using systemd 219).
I deleted the /run/systemd/system/session-.scope files, the system/session-.scope.d directories, the sessions/* files, and cleaned the user/ files, but there still are 390 abandoned session-*.scope units when I run systemctl list-units on a server running several cron tasks every minute, and restarted ten days ago.
canoine
commented
Jan 25, 2016
|
@mazingerzeta : sadly it isn't on my systems (CentOS 7.2, using systemd 219). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
carlivar
Jan 29, 2016
We worked around this with a cronjob on all of our systemd systems (CentOS 7 in our case):
* 2,14 * * * root /bin/rm -f /run/systemd/system/*.scope
carlivar
commented
Jan 29, 2016
|
We worked around this with a cronjob on all of our systemd systems (CentOS 7 in our case):
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
part-timeDev
Jan 30, 2016
I am facing the same issue with our git server.
Our build server is polling for changes very frequently against all the different repos/branches using ssh and i see an increasing number of abandoned sessions of that ssh-user. i had the issue with the slowing down login process, too, but was able to fire the following command to stop the left over sessions and using it from time to time proactively at the moment
systemctl |grep "of user git" |grep "abandoned" |grep -e "-[[:digit:]]" |sed "s/\.scope.*/.scope/" |xargs systemctl stop
maybe that helps someone else until the underlying problem is finally solved.
PS: Thanks to everybody working on this issue.
part-timeDev
commented
Jan 30, 2016
|
I am facing the same issue with our git server.
maybe that helps someone else until the underlying problem is finally solved. PS: Thanks to everybody working on this issue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
canoine
commented
Feb 1, 2016
|
Ok, adding "systemctl stop" does the job. Thank you for the hint. |
poettering
added this to the v230 milestone
Feb 26, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lnykryn
Mar 4, 2016
Member
At least the centos version of this bug seems to be caused by losing messages from the cgroup-agent. On my system, cgroup-agent is started every time but systemd sometimes doesn't get the message.
|
At least the centos version of this bug seems to be caused by losing messages from the cgroup-agent. On my system, cgroup-agent is started every time but systemd sometimes doesn't get the message. |
zxwing
referenced this issue
Apr 6, 2016
Closed
Random failures of service module caused by a systemd bug #15295
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dragon9783
commented
Apr 10, 2016
|
LGTM |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tozachroberts
Apr 12, 2016
Any update on when the official fix for this may happen? We have done our best with manual scripted workarounds, but we are still feeling the pain of this every day. I am happy to share specifics of our environment if that may help. We can replicate it quite quickly.
tozachroberts
commented
Apr 12, 2016
|
Any update on when the official fix for this may happen? We have done our best with manual scripted workarounds, but we are still feeling the pain of this every day. I am happy to share specifics of our environment if that may help. We can replicate it quite quickly. |
added a commit
to zstackio/zstack-utility
that referenced
this issue
Apr 23, 2016
added a commit
to zstackio/zstack-utility
that referenced
this issue
Apr 23, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 4, 2016
Member
OK, so I figured out one part of the puzzle I think: dbus-daemon is broken handling incoming messages where there's first a message without auxiliary fd in the socket buffer, which is then immediately followed by one with auxiliary fd. The kernel will already return the auxiliary fd with the first message, and dbus-daemon takes that as broken message and will abort the connection.
|
OK, so I figured out one part of the puzzle I think: dbus-daemon is broken handling incoming messages where there's first a message without auxiliary fd in the socket buffer, which is then immediately followed by one with auxiliary fd. The kernel will already return the auxiliary fd with the first message, and dbus-daemon takes that as broken message and will abort the connection. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 4, 2016
Member
OK, it's slightly different even, and I filed a bug to dbus-daemon now:
|
OK, it's slightly different even, and I filed a bug to dbus-daemon now: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 4, 2016
Member
Here's another related dbus bug, where I just posted a patch:
https://bugs.freedesktop.org/show_bug.cgi?id=95264
(The issue is that we might end up losing cgroups agent messages much earlier than necessary, because of dbus-daemon's low listen backlog of 30)
|
Here's another related dbus bug, where I just posted a patch: https://bugs.freedesktop.org/show_bug.cgi?id=95264 (The issue is that we might end up losing cgroups agent messages much earlier than necessary, because of dbus-daemon's low listen backlog of 30) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 4, 2016
Member
I prepped a set of fixes to logind in #3190 now, but they won't fix this issue fully – that really needs to be fixed in dbus, see https://bugs.freedesktop.org/show_bug.cgi?id=95263
|
I prepped a set of fixes to logind in #3190 now, but they won't fix this issue fully – that really needs to be fixed in dbus, see https://bugs.freedesktop.org/show_bug.cgi?id=95263 |
poettering
added
the
not-our-bug
label
May 4, 2016
poettering
removed this from the v230 milestone
May 4, 2016
added a commit
to poettering/systemd
that referenced
this issue
May 4, 2016
poettering
referenced this issue
May 4, 2016
Merged
core: use an AF_UNIX/SOCK_DGRAM socket for cgroup agent notification #3191
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 4, 2016
Member
And #3191 is another related PR, but also won't fix the issue, because it needs to be fixed in dbus.
|
And #3191 is another related PR, but also won't fix the issue, because it needs to be fixed in dbus. |
added a commit
to poettering/systemd
that referenced
this issue
May 5, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
May 5, 2016
Member
I have now posted a dbus patch for this to the fdo bz:
https://bugs.freedesktop.org/attachment.cgi?id=123493
This makes the issue go away for me here, but it would be good of one of you could test if this works for you too.
Will close this bug now, everything else should be tracked in the fdo bug about this.
|
I have now posted a dbus patch for this to the fdo bz: https://bugs.freedesktop.org/attachment.cgi?id=123493 This makes the issue go away for me here, but it would be good of one of you could test if this works for you too. Will close this bug now, everything else should be tracked in the fdo bug about this. |
poettering
closed this
May 5, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tozachroberts
May 5, 2016
I have only a modest understanding of the patch, but we'll be trying to get a patched dbus in place and test it out over the next few weeks and as soon as I have some results I'll let you know. I am operating in enterprise-land, so it takes a while to make things happen. Thank you!
tozachroberts
commented
May 5, 2016
|
I have only a modest understanding of the patch, but we'll be trying to get a patched dbus in place and test it out over the next few weeks and as soon as I have some results I'll let you know. I am operating in enterprise-land, so it takes a while to make things happen. Thank you! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lnykryn
May 5, 2016
Member
I did quick backport of #3191 to rhel version of systemd and it seems to fix my problem. Thanks!
|
I did quick backport of #3191 to rhel version of systemd and it seems to fix my problem. Thanks! |
added a commit
to lnykryn/systemd-rhel
that referenced
this issue
May 10, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lnykryn
May 10, 2016
Member
I am sorry for abusing the upstream issue tracker (I have not found related bug in centos mantis), but for those centos users ( @mattmcdowell, @canoine @carlivar ), would you be willing to try this test build? https://people.redhat.com/lnykryn/systemd/bz1305608/
|
I am sorry for abusing the upstream issue tracker (I have not found related bug in centos mantis), but for those centos users ( @mattmcdowell, @canoine @carlivar ), would you be willing to try this test build? https://people.redhat.com/lnykryn/systemd/bz1305608/ |
added a commit
to lnykryn/systemd-rhel
that referenced
this issue
Jun 6, 2016
added a commit
to lnykryn/systemd-rhel
that referenced
this issue
Jul 27, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
stanhu
Aug 5, 2016
@poettering Thanks for your work on this. We were bit by this problem hard on GitLab.com, which is running Ubuntu 16.04. We applied your latest patch (https://bugs.freedesktop.org/show_bug.cgi?id=95263#c13), and this seems to have made the problem go away for now. A customer also reported experiencing this issue on Red Hat Enterprise Linux 7, which runs systemd 219.
I would encourage that these patches make it into stable releases ASAP:
Ubuntu bug thread: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1591411
RedHat bug thread: https://bugzilla.redhat.com/show_bug.cgi?id=1271394
stanhu
commented
Aug 5, 2016
•
|
@poettering Thanks for your work on this. We were bit by this problem hard on GitLab.com, which is running Ubuntu 16.04. We applied your latest patch (https://bugs.freedesktop.org/show_bug.cgi?id=95263#c13), and this seems to have made the problem go away for now. A customer also reported experiencing this issue on Red Hat Enterprise Linux 7, which runs systemd 219. I would encourage that these patches make it into stable releases ASAP: Ubuntu bug thread: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1591411 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
isodude
Nov 2, 2016
Just got hit by this, what I find odd is that systemd(pid 1) takes forever to go through all of the scope files. Is it not possible to make the loop tighter, or even make it parallell? Right now I'm forced to remove all the files by hand instead.
openat(22, "session-79467.scope.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW|O_CLOEXEC) = 23
fstat(23, {st_mode=S_IFDIR|0755, st_size=140, ...}) = 0
fcntl(23, F_GETFL) = 0x38800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
fcntl(23, F_SETFD, FD_CLOEXEC) = 0
getdents(23, /* 7 entries */, 32768) = 304
getdents(23, /* 0 entries */, 32768) = 0
close(23) = 0
Or are there patches that I missed that corrects this? I am running systemd-219-19.el7_2.13.
isodude
commented
Nov 2, 2016
|
Just got hit by this, what I find odd is that systemd(pid 1) takes forever to go through all of the scope files. Is it not possible to make the loop tighter, or even make it parallell? Right now I'm forced to remove all the files by hand instead.
Or are there patches that I missed that corrects this? I am running systemd-219-19.el7_2.13. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Burebista1404
Dec 2, 2016
Hi everyone, if you do need to fix this ASAP, then here are the commands that you can type and clean/clear all your scope files and loaded active abandoned sessions:
Cleanup abandoned sessions from systemd:
Delete session files
find /run/systemd/system -name "session-*.scope" -delete
Delete session directories
rm -rf /run/systemd/system/session*scope*
Remove the abandoned sessions
systemctl | grep "abandoned" | grep -e "-[[:digit:]]" | sed "s/\.scope.*/.scope/" | xargs systemctl stop
Burebista1404
commented
Dec 2, 2016
•
|
Hi everyone, if you do need to fix this ASAP, then here are the commands that you can type and clean/clear all your scope files and loaded active abandoned sessions: Cleanup abandoned sessions from systemd: Delete session files
Delete session directories
Remove the abandoned sessions
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
iDemonix
Jan 30, 2017
@Burebista1404, your session directories command has fallen victim to Markdown italics, could you wrap the commands in quotes
so they appear like this?
iDemonix
commented
Jan 30, 2017
|
@Burebista1404, your session directories command has fallen victim to Markdown italics, could you wrap the commands in quotes
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Burebista1404
commented
Jan 30, 2017
|
@iDemonix I've modified as requested. Thanks. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
go-fish
Jun 19, 2017
systemctl | grep "abandoned" | grep -e "-[[:digit:]]" | sed "s/.scope.*/.scope/" | xargs systemctl stop
this command may be kill the processes under the session scope, so be careful to use this command
go-fish
commented
Jun 19, 2017
|
systemctl | grep "abandoned" | grep -e "-[[:digit:]]" | sed "s/.scope.*/.scope/" | xargs systemctl stop this command may be kill the processes under the session scope, so be careful to use this command |
added a commit
to Stratoscale/systemd
that referenced
this issue
Jun 19, 2017
added a commit
to Stratoscale/systemd
that referenced
this issue
Jun 20, 2017
jmaassen
referenced this issue
Jun 21, 2017
Closed
systemd broken, which cripples sshd and complicates testing #443
added a commit
to Werkov/systemd
that referenced
this issue
Jun 22, 2017
added a commit
to openSUSE/systemd
that referenced
this issue
Jul 27, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
diffcity
Oct 31, 2017
what works for me here, on CentOS 7:
systemctl restart dbus-org.freedesktop.login1.service
systemctl restart systemd-logind.service
systemctl daemon-reload
wait a minute or two to see the df value change
diffcity
commented
Oct 31, 2017
|
what works for me here, on CentOS 7: |
added a commit
to wtsi-hgi/hgi-systems
that referenced
this issue
Nov 3, 2017
added a commit
to glennpratt/systemd
that referenced
this issue
Dec 7, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
joshbmarshall
Feb 14, 2018
@diffcity Your workaround is much cleaner than others, and doesn't scare me nearly as much as manually deleting files directly. My servers last about 30 days before filling up /run and your commands clear it out. Will be putting in cron to run daily
joshbmarshall
commented
Feb 14, 2018
|
@diffcity Your workaround is much cleaner than others, and doesn't scare me nearly as much as manually deleting files directly. My servers last about 30 days before filling up /run and your commands clear it out. Will be putting in cron to run daily |
julianbrost commentedNov 19, 2015
There seems to be a race condition in systemd or logind, which results in the leak of scope unit files in /run/systemd/system. This causes PID 1 to use 100% CPU and delays logins.
I am able to reproduce the issue in Debian jessie with systemd 215-17+deb8u2, Debian sid with systemd 227-1 and 228-1 as well as on Arch Linux with systemd 227-1. To reproduce it, prepare the system as follows (install libpam-systemd and add a test user with a ssh key):
First look at the output for this command:
You should see exactly one file "session-$i.scope" (seems to exist only with older versions of systemd like in Debian jessie) and one directory "session-$i.scope.d" for your current login session.
Now let the following command run for a while (you may have to vary the sleep interval depending on how fast the host is, 5 minutes should be enough to get at least a few leaked units):
Then cancel that command and wait for the remaining ssh processes to terminate and look again at the output of
There you will see some remaining scopes that were leaked. Running the command
will now take significantly longer when enough of these leaked files have accumulated. When you strace PID 1, you will notice that it opens many files in /run/systemd/system. On one of our production systems with 35 days uptime, there are roughly 4000 of those leaked scope units and list-unit-files takes about 23 seconds. During that time, further ssh logins are delayed until the command terminates. When I used ; instead of & in the loop doing the ssh logins, I was unable to reproduce the issue, so it looks like it is some kind of race condition.
(I originally reported this as Debian bug #805477)