Regression in systemd-coredump 229: system hangs for 90 seconds during shutdown #2691

Closed
ebiggers opened this Issue Feb 22, 2016 · 45 comments

Projects

None yet
@ebiggers
Contributor

I am occasionally (> 50% of the time) experiencing a hang during shutdown. The problem started after upgrading to systemd 229. Reverting to systemd 228 makes the problem go away.

Attached the shutdown-log.txt file generated as detailed at this page: https://freedesktop.org/wiki/Software/systemd/Debugging/

shutdown-log.txt

@ohsix
ohsix commented Feb 22, 2016

thanks for getting log first, lots of people are noticing this

[ 114.155486] systemd[1]: session-c1.scope: Stopping timed out. Killing.

this is the notable bit; something in your user session isn't being polite

everything between
[ 24.080120] systemd[1]: session-c1.scope: Installed new job session-c1.scope/stop as 426
and above is notable, i don't have time to read the log carefully at the moment

@diegoviola

I've been experiencing the same problem since systemd 228 or earliest versions, I'm on Arch Linux (x86-64).

Here's my logs: http://ix.io/owz

@martinpitt
Contributor

@diegoviola , please file a different issue. The description of this issue says it's a regression of 229, so if you already saw this on 228 your problem must be different.

@martinpitt
Contributor

There is not a lot of useful information in that log, I'm afraid. Please start the debug shell, issue the shutdown, then switch to the debug shell (Ctrl+Alt+F9) and check with pgrep -a -u yourusername which processes in your session are still running and refuse to die. You can then verify that pkill -u yourusername makes the shutdown continue immediately.

@diegoviola

@martinpitt just because he didn't see the problem in 228 doesn't mean it's not the same issue, I've experienced this problem on Arch Linux as well as on Ubuntu.

Also, when I was testing this yesterday (on Arch Linux), I booted my computer normally, I login from the tty and the only thing I had running was bash (no X or any other programs), then I typed poweroff and I still got the issue of waiting 90 seconds during the shutdown process.

I would be happy to re-open or provide a different bug report after this problem is fixed, and in case my problem still persist after fixing this issue, but I believe we're talking about the same problem here.

@snckrz
snckrz commented Feb 22, 2016

Switching to the debug shell is not possible during the hanging. It doesn't do anything. (It does work during the session, so it is running correctly)

@ebiggers
Contributor

This is the output of the pgrep command:

708 /usr/lib/chromium/chromium --ppapi-flash-path=/usr/lib/PepperFlash/libpepflashplayer.so --ppapi-flash-version=18.0.0.204 --incognito
710 chromium
814 chromium

The pkill command did not work unless I added the -9 option, to issue SIGKILL.

@ebiggers
Contributor

This seems to make the problem go away:

echo core > /proc/sys/kernel/core_pattern
@sjuxax
sjuxax commented Feb 23, 2016

Been seeing the same thing here on ArchLinux for the last couple of months. Assumed it was something I broke.

@jsynacek
Contributor

Also see https://bugzilla.redhat.com/show_bug.cgi?id=1310608#c3. The problem is that there is a crash during shutdown and systemd-coredump takes awful lot of time to generate the dump.

@DoctorJellyface

I have the same symptoms but I'm not sure I'm experiencing the same thing. For me user@1000 timeouts. Here's the log file: shutdown-log.txt.

@ohsix
ohsix commented Feb 23, 2016

@diegoviola there's a thing with bugs and bug trackers, people are cooperating to try and find out what the bug actually is, you don't just jump on and say you have the same bug ... you can't unless you've already found the bug.

please verify what people have seen in following comments to see if this really is the same issue.

@ebiggers
Contributor

Although the problem seemed to go away for me by disabling systemd-coredump, there doesn't actually seem to be a systemd-coredump process running during the hang. These were all the processes without [ in their name during the hang:

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 20:40 ?        00:00:01 /sbin/init
root       277     1  0 20:40 tty9     00:00:00 /bin/sh
root       284     1  0 20:40 ?        00:00:00 /usr/lib/systemd/systemd-journald
root       310     1  0 20:40 ?        00:00:00 /usr/lib/systemd/systemd-udevd
systemd+   557     1  0 20:40 ?        00:00:00 /usr/lib/systemd/systemd-timesyncd
root       561     1  0 20:40 ?        00:00:00 /usr/lib/systemd/systemd-logind
root       699     1  0 20:40 ?        00:00:00 /usr/bin/dhclient -d -q -sf /usr/lib/networkmanager/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-de74f643-ea73-4c8b-8147-cec475f2f390-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0
e          766     1  9 20:41 ?        00:00:26 /usr/lib/chromium/chromium --ppapi-flash-path=/usr/lib/PepperFlash/libpepflashplayer.so --ppapi-flash-version=18.0.0.204 --incognito
root      1564   277  0 20:45 tty9     00:00:00 ps -ef
@elkrejzi
Contributor

Well, I've been experiencing the issue since at least systemd-227. It's not the coredump that's the problem.

It's systemd user session (started by display manager via pam_systemd.so) that is blocking the shutdown. When the desktop session manager terminates Xorg and logs out the user, there are still services started by systemd user session (user bus and others that don't need X running).

However, the session manager keeps running like the session is still there, even after the user has logged out. It's like the pam_systemd never told systemd logind that the user has logged out and it needs to terminate the user session.

I observed the behaviour by setting KillUserProcesses=yes in logind.conf. Even when you log in from tty and then you log out, the systemd user daemon (and, depending on the session, dbus services) will remain running for some time, which is why shutdown hangs - it's waiting for user service to terminate on its own. It's like the PAM module never sends a "user logged out" notification to the session manager.

@medhefgo
Contributor

Possible duplicate of issue #1615.

@gzmorell

Hi,
I have a similar problem. With systemd 229 shutdown takes 1 minute and 30 seconds to shutdown.
Downgrading to 228 the problem dissappears.
I am using Arch Linux 64 bit updated with LxQt and with KDE desktops and SDDM.
If I start a debug shell, and check which user process is still running (in a LxQt Session) it appears lxqt-panel as the only remaining process.
Killing this user process makes the system shutdown immediately.
I have already tried with a KDE session. Then there are several processes not stopped.
Killing all of them makes the system shutdown immediately too.
This can be reproduced with every session.
Furthermore if I log out the session, and then shutdown from SDDM the system shutdowns immediately.

@benjarobin
Contributor

I made a mistake in my previous test, that why I deleted my previous message.
Anyway I can reproduce very easily now the problem since I think I may understand why the session is not closed.

With KillUserProcesses=no (which is the default), if there is still a user process running which ignores SIGTERM for whatever reason, systemd wait the timeout of 90s (before sending SIGKILL to these process ?)

In KDE, when closing the session, plasma crash most of the time which lead to this session status :

c2 - benjarobin (1000)
       Since: mar. 2016-03-22 14:16:11 CET; 1min 21s ago
      Leader: 790
        Seat: seat0; vc7
     Display: :0
     Service: sddm; type x11; class user
     Desktop: KDE
       State: closing
        Unit: session-c2.scope
              ├─ 852 klauncher [kdeinit5] --fd=9                           
              ├─1040 gpg-agent --homedir /home/benjarobin/.gnupg --use-standard-socket --daemon
              └─1366 /usr/lib/drkonqi -platform xcb -display :0 --appname kdeinit5 --kdeinit --apppath /usr/bin --signal 11 --pid 852 --startupid 0

The blocking process is drkonqi, I have to send SIGKILL to it to able to shutdown the PC without the 90s timeout

@kyawthusoe3142

Same problem in Debian 8 systemd version 215-17.

@heavyhdx

Still getting this issue on a fresh Arch+KDE installation.

@bmpro
bmpro commented Mar 28, 2016

I can confirm. Same issue with 229. Reverted to 228 for now which works.

@firekage

I have the same behaviour on my desktop machine and on netbook and laptop machines. Also, with a clean install. In my case i have "a stop job is running for user", "a stop job is running for c2 session", "a stop job is running for dile manager".

@firekage

I have the same behaviour on my desktop machine and on netbook and laptop machines. Also, with a clean install. In my case i have "a stop job is running for user", "a stop job is running for c2 session", "a stop job is running for dile manager". Forget to mention: "a stop job is running for user manager UID 1000".

@xxxperimental

Alright, but how do I downgrade systemd to 228?
pacman -U systemd-228.tar.gz spits out:

error: missing package metadata in systemd-228.tar.gz
error: 'systemd-228.tar.gz': invalid or corrupted package

Other than that has anyone found any other solutions to this issue?

@diegoviola

How about git-bisect rather than downgrade? Downgrade is never a solution.

@benjarobin
Contributor

The true source of this "bug" is very very old, I think it was always there.
Here the hang is most of the time due to systemd-coredump which was improved but I think do not properly interact with SIGTERM

But the "true" source of this bug can be triggered with this simple program:

#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

static void sig_handler(int signo)
{
    printf("rcv_sig: %d\n", signo);
}

static void catch_sig(int signo)
{
    printf("Catching sig: %d\n", signo);

    if (signal(signo, sig_handler) == SIG_ERR)
    {
        perror("Failed to catch sig");
        exit(2);
    }
}

int main(void)
{
    catch_sig(SIGHUP); // 1
    catch_sig(SIGTERM);

    if (fork() == 0)
    {
        for(;;)
            sleep(1);
    }
    return 0;
}

Boot the PC without any graphical login , login to a tty, launch this app
Then from the same tty, just send : systemctl poweroff

Systemd wait 90s before sending SIGKILL (which is kind of expected, but 90s is quite long)

@benjarobin
Contributor

To reproduce the issue with systemd-coredump, use this program :

#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

static volatile int crash = 0;

static void sig_handler(int signo)
{
    printf("rcv_sig: %d\n", signo);
    if (signo == 15)
        crash = 1;
}

static void catch_sig(int signo)
{
    printf("Catching sig: %d\n", signo);

    if (signal(signo, sig_handler) == SIG_ERR)
    {
        perror("Failed to catch sig");
        exit(2);
    }
}

int main(void)
{
    catch_sig(SIGHUP); // 1
    catch_sig(SIGTERM);

    if (fork() == 0)
    {
        for(;;) {
            if (crash) {
                volatile int i;
                volatile int *pt = malloc(1);
                for (i = 0; i < 1000000; ++i)
                    *pt++ = i;
            }
            sleep(1);
        }
    }
    return 0;
}

Boot the PC without any graphical login , login to a tty, launch this app
Then from the same tty, just send : systemctl poweroff

With v228 the PC poweroff without any problem, and with v229 the PC wait 90s...

@benjarobin
Contributor

And I got the following result from the early shell (tty9) after running the application of my previous message (which wait the SIGTERM and crash with a segfault when the SIGTERM is received)

************ systemctl list-jobs ************
JOB UNIT                               TYPE  STATE  
287 local-fs.target                    stop  waiting
334 cryptsetup.target                  stop  waiting
306 user.slice                         stop  waiting
313 sysinit.target                     stop  waiting
309 sockets.target                     stop  waiting
348 systemd-random-seed.service        stop  waiting
290 tmp.mount                          stop  waiting
344 systemd-tmpfiles-setup.service     stop  waiting
307 user-0.slice                       stop  waiting
289 run-user-0.mount                   stop  waiting
367 systemd-sysctl.service             stop  waiting
292 shutdown.target                    start waiting
369 final.target                       start waiting
352 local-fs-pre.target                stop  waiting
286 boot-efi.mount                     stop  waiting
328 systemd-logind.service             stop  waiting
339 slices.target                      stop  waiting
347 systemd-remount-fs.service         stop  waiting
284 systemd-reboot.service             start waiting
337 paths.target                       stop  waiting
368 systemd-update-utmp.service        stop  waiting
327 systemd-user-sessions.service      stop  waiting
302 systemd-tmpfiles-setup-dev.service stop  waiting
283 reboot.target                      start waiting
357 systemd-ask-password-console.path  stop  waiting
359 systemd-modules-load.service       stop  waiting
342 systemd-vconsole-setup.service     stop  waiting
285 umount.target                      start waiting
305 swap.target                        stop  waiting
331 dbus.socket                        stop  waiting
318 basic.target                       stop  waiting
343 remote-fs.target                   stop  waiting
362 systemd-ask-password-wall.path     stop  waiting
291 media-sf_systemd.mount             stop  waiting
296 session-c1.scope                   stop  running

35 jobs listed.
***************** ps -AF ********************
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root         1     0  0 10225  5176   1 23:10 ?        00:00:00 /sbin/init
root       151     1  0  3533  3164   1 23:10 tty9     00:00:00 /bin/sh
root       173     1  0  9670 16064   2 23:10 ?        00:00:00 /usr/lib/systemd/systemd-journald
root       211     1  0  8928  3936   3 23:10 ?        00:00:00 /usr/lib/systemd/systemd-udevd
root       295     1  0  3821  2632   1 23:10 ?        00:00:00 /usr/lib/systemd/systemd-logind
root       406     1  0  1044    76   3 23:13 ?        00:00:00 ./test-crash-systemd
root       477   151  0  3500  2632   3 23:13 tty9     00:00:00 /bin/bash ./rdbg
**************systemctl status **************
* localhost
    State: stopping
    Jobs: 35 queued
Failed: 1 units
    Since: Mon 2016-03-28 23:10:40 CEST; 2min 27s ago
CGroup: /
        |-init.scope
        | `-1 /sbin/init
        |-system.slice
        | |-systemd-journald.service
        | | `-173 /usr/lib/systemd/systemd-journald
        | |-systemd-logind.service
        | | `-295 /usr/lib/systemd/systemd-logind
        | |-systemd-udevd.service
        | | `-211 /usr/lib/systemd/systemd-udevd
        | `-debug-shell.service
        |   |-151 /bin/sh
        |   |-477 /bin/bash ./rdbg
        |   |-478 /bin/bash /dbg
        |   `-481 systemctl status
        `-user.slice
            `-user-0.slice
            `-session-c1.scope
                `-406 ./test-crash-systemd
********** loginctl session-status c1 *******
Failed to get session: Connection reset by peer
************* loginctl list-sessions ********
Failed to list sessions: Connection reset by peer
************** loginctl session-status ******
Could not get properties: Connection reset by peer

There are couple of thing very strange :

  • loginctl which no longer works (normal during shutdown ?)
  • No trace of systemd-coredump : So why systemd is waiting ?
  • Why ./test-crash-systemd is not closed/terminated

Ok, I get the associated log, and I think I get it. These lines are all from the same timestamp and this is the result of the systemctl poweroff

dbus[290]: [system] Activating via systemd: service name='org.freedesktop.PolicyKit1' unit='polkit.service'
systemd[1]: Starting Authorization Manager...
polkitd[411]: Started polkitd version 0.113
polkitd[411]: Loading rules from directory /etc/polkit-1/rules.d
polkitd[411]: Loading rules from directory /usr/share/polkit-1/rules.d
polkitd[411]: Finished loading, compiling and executing 1 rules
dbus[290]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
systemd[1]: Started Authorization Manager.
polkitd[411]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
polkitd[411]: Registered Authentication Agent for unix-process:407:14409 (system bus name :1.4 [/usr/bin/pkttyagent --notify-fd 4 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale fr_FR.UTF-8)
systemd-logind[295]: System is rebooting.
systemd[1]: Stopping Session c1 of user root.
login[375]: pam_unix(login:session): session closed for user root
kernel: test-crash-syst[406]: segfault at 137a000 ip 00000000004005f8 sp 00007ffd7d2a7c10 error 6 in test-crash-systemd[400000+1000]
systemd[1]: Stopping User Manager for UID 0...
polkitd[411]: Unregistered Authentication Agent for unix-process:407:14409 (system bus name :1.4, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale fr_FR.UTF-8) (disconnected from bus)
systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
systemd[1]: Stopped target Sound Card.
systemd[1]: Stopped target Timers.
systemd[1]: Stopped Daily Cleanup of Temporary Directories.
systemd[1]: Stopping Authorization Manager...
systemd[1]: Stopped target Graphical Interface.
systemd[1]: Stopped target Multi-User System.
systemd[1]: Stopped Daily rotation of log files.
systemd[1]: Stopped Daily verification of password and group files.
systemd[1]: Stopping VirtualBox Guest Service...
kernel: 00:02:19.379119 control  Guest control service stopped
kernel: 00:02:19.379325 control  Guest control worker returned with rc=VINF_SUCCESS
kernel: 00:02:19.379526 main     Session 0 is about to close ...
kernel: 00:02:19.379758 main     Stopping all guest processes ...
kernel: 00:02:19.380001 main     Closing all guest files ...
VBoxService[312]: 00:02:19.379119 control  Guest control service stopped
VBoxService[312]: 00:02:19.379325 control  Guest control worker returned with rc=VINF_SUCCESS
VBoxService[312]: 00:02:19.379526 main     Session 0 is about to close ...
VBoxService[312]: 00:02:19.379758 main     Stopping all guest processes ...
VBoxService[312]: 00:02:19.380001 main     Closing all guest files ...
systemd[1]: Stopped target Login Prompts.
systemd[1]: Stopping Getty on tty2...
systemd[1]: Stopping D-Bus System Message Bus...
systemd[1]: Stopping OpenSSH Daemon...
systemd[1]: Stopped Daily man-db cache update.
systemd[379]: Stopped target Default.
sshd[373]: Received signal 15; terminating.
systemd[379]: Stopped target Basic System.
systemd[379]: Stopped target Timers.
systemd[379]: Stopped target Sockets.
systemd-logind[295]: Failed to abandon session scope: Transport endpoint is not connected
systemd[1]: Starting Generate shutdown-ramfs...
systemd[379]: Closed D-Bus User Message Bus Socket.
systemd[379]: Reached target Shutdown.
systemd[379]: Starting Exit the Session...
systemd[379]: Stopped target Paths.
systemd[1]: Stopped D-Bus System Message Bus.
systemd[1]: Stopped OpenSSH Daemon.
systemd[1]: Stopped Getty on tty2.
systemd[1]: Stopped Authorization Manager.
systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive.
systemd[1]: systemd-coredump.socket: Failed to queue service startup job (Maybe the service file is missing or not a non-template unit?): Transaction is destructive.
systemd[1]: systemd-coredump.socket: Unit entered failed state.
systemd[385]: pam_unix(systemd-user:session): session closed for user root
systemd[1]: Removed slice system-getty.slice.
systemd[379]: Received SIGRTMIN+24 from PID 426 (kill).
systemd[1]: Stopped SSH Key Generation.
systemd[1]: Stopped target Network.
systemd[1]: Stopping dhcpcd on enp0s3...
systemd[1]: Stopped User Manager for UID 0.
dhcpcd[432]: sending signal TERM to pid 372
dhcpcd[432]: waiting for pid 372 to exit
dhcpcd[372]: received SIGTERM, stopping
dhcpcd[372]: enp0s3: removing interface
dhcpcd[372]: dhcpcd exited
mkinitcpio[424]: ==> Starting build: none
mkinitcpio[424]:   -> Running build hook: [sd-shutdown]
dhcpcd[432]: sending signal TERM to pid 372
dhcpcd[432]: waiting for pid 372 to exit
systemd[1]: Stopped dhcpcd on enp0s3.
systemd[1]: Removed slice system-dhcpcd.slice.
mkinitcpio[424]: ==> Build complete.
systemd[1]: Started Generate shutdown-ramfs.
VBoxService[312]: 00:02:19.635609 main     Ended.
kernel: 00:02:19.635609 main     Ended.
systemd[1]: Stopped VirtualBox Guest Service.

Here my thought :

  • systemd-coredump is called by the kernel
  • systemd-coredump save basic information and pass the fileno to /run/systemd/coredump
  • The systemd-coredump.socket should trig the systemd-coredump.service which should save the coredump by reading /run/systemd/coredump socket
  • systemd-coredump.socket failed, so the fileno of the crashed process is still open, and the process cannot be terminated
  • systemd waits...
@xxxperimental

Would it be safe to somehow disable or avoid systemd-coredump, so this issue doesn't come up?

@benjarobin
Contributor

Well I still have issues even if coredump is disabled...

  • /proc/sys/kernel/core_pattern contains |/bin/false
  • * hard core 0 is inside /etc/security/limits.conf
  • => ulimit -c return 0

I will try to investigate a little bit more this end of day.

@xxxperimental

Perhaps there is a way to make it time-out and terminate whatever processes are still running?

@nabildanial

I have this problem only when I am rebooting/shutdown from GNOME Flashback (Compiz). The hanging doesn't occur in GNOME 3. Arch Linux.

@ebiggers
Contributor

I've bisected the bug to the following commit:

commit 3c171f0b1ec3ce1b98777cca7330727b9ebfd17d
Author: Lennart Poettering <lennart@poettering.net>
Date:   Mon Feb 8 21:16:08 2016 +0100

    coredump: rework coredumping logic

This change, which was apparently merged only 24 hours before the final release of systemd 229 (!!!), made systemd-coredump work in two parts: one part directly invoked by the kernel, and one part activated by a UNIX domain socket.

For reasons that are not yet entirely clear to me, systemd refuses to start the socket-activated process during the shutdown sequence, and the crashing process hangs waiting for the coredump to complete. See, for example, the following log messages:

Mar 29 23:53:55 zzz systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive.
Mar 29 23:53:55 zzz systemd[1]: systemd-coredump.socket: Failed to queue service startup job (Maybe the service file is missing or not a
Mar 29 23:53:55 zzz systemd[1]: systemd-coredump.socket: Unit entered failed state.

The bug can be reproduced by starting any program that crashes when it receives SIGTERM, such as the example program that benjarobin posted, then trying to reboot.

The bug can be worked around by creating a file /etc/sysctl.d/50-coredump.conf with the following contents:

kernel.core_pattern=core

That causes the kernel to write coredumps directly, bypassing the buggy systemd code.

It also should be noted that some people may be encountering 90-second hangs for other reasons, such as processes ignoring SIGTERM. The bug I have been encountering, though, has been caused specifically by the bug in systemd-coredump described above, introduced in systemd 229.

@poettering, any thoughts on fixing this?

@ebiggers ebiggers changed the title from Regression in systemd 229: system hangs for 90 seconds during shutdown to Regression in systemd-coredump 229: system hangs for 90 seconds during shutdown Mar 30, 2016
@benjarobin
Contributor

@ebiggers Well ... I have already explained (#2691 (comment)) why there is a problem with systemd-core dump.
We already knew a month ago that the problem was related to systemd-core dump, we just do not know why (now there a beginning of an explanation). But thank you for the time spent, anyway you came to roughly the same conclusion as mine :-)

@Potomac
Potomac commented Mar 30, 2016

the workaround found by ebiggers ( /etc/sysctl.d/50-coredump.conf file ) works,

I no longer have the message "A stop job is running for Session c2 for user...."

I use archlinux, systemd 229-3

@benjarobin
Contributor

I highly recommend not to do that, you may have random core file created on your filesystem.
Instead just disable coredump, there is 2 ways :

  • ln -s /dev/null /etc/sysctl.d/50-coredump.conf
  • echo '* hard core 0' >> /etc/security/limits.conf
@Potomac
Potomac commented Mar 30, 2016

@benjarobin : we can set a custom path for storing cores files, with the parameter kernel.core_pattern :

http://unix.stackexchange.com/questions/192716/how-to-set-the-core-dump-file-location-and-name

for example in /etc/sysctl.d/50-coredump.conf :
kernel.core_pattern= /var/crash/core-%e-%s-%u-%g-%p-%t

%e is the filename
%g is the gid the process was running under
%p is the pid of the process
%s is the signal that caused the dump
%t is the time the dump occurred
%u is the uid the process was running under

@Potomac
Potomac commented Apr 4, 2016

despite the workaround found by Ebiggers I still have the bug ( "a stop job for session C2 user" )

it's a random bug, for example it can occur after 10 reboots ( that's why I thought it was solved, we need to test with a lot of reboot/shutdown, the bug will be considered as solved only when no bug occurs after at least 10~15 reboots )

@ebiggers ebiggers added a commit to ebiggers/systemd that referenced this issue Apr 8, 2016
@ebiggers ebiggers coredump: allow systemd-coredump to run during shutdown
Ever since systemd-coredump was made a socket-activatable service in
systemd 229, it has been unable to run during system shutdown.

This is a major problem.  If a process crashes when it receives SIGTERM,
then the kernel attempts to send the coredump to systemd-coredump through
a file descriptor.  systemd-coredump then passes the file descriptor onto
the socket.  However, systemd is then unable to start
systemd-coredump@.service, resulting in the file descriptor being kept
open.  This means that the crashing process does not terminate, and
systemd must wait its default timeout of 90 seconds before sending
SIGKILL.

This is not a hypothetical case, as I initially encountered it with the
chromium browser.  systemd cannot assume that applications do not crash.

To fix this, relax the requirements for systemd-coredump@.service to be
started.

Fixes #2691.
0e05732
@ebiggers ebiggers added a commit to ebiggers/systemd that referenced this issue Apr 8, 2016
@ebiggers ebiggers coredump: allow systemd-coredump to run during shutdown
Ever since systemd-coredump was made a socket-activated service in
systemd 229, it has been unable to run during system shutdown.

This is a problem if any processes crash during the shutdown sequence.
If a process crashes when it receives SIGTERM, then the kernel tries to
send the coredump to systemd-coredump.  systemd-coredump passes the
coredump file descriptor onto the socket.  However, systemd is then
unable to start systemd-coredump@.service due to that unit conflicting
shutdown.target.  This causes the file descriptor to be kept open, making
the crashing process stay alive and unresponsive to SIGTERM.  systemd
then must wait its default timeout of 90 seconds before sending SIGKILL,
significantly delaying the shutdown.

This is not a hypothetical case; I initially encountered it with the
chromium browser.  systemd cannot assume that applications do not crash.

To fix this, relax the requirements for systemd-coredump@.service to be
started.

Fixes #2691.
75b29cf
@ebiggers
Contributor
ebiggers commented Apr 8, 2016

Well, seeing as the systemd maintainers have been unresponsive to this bug, despite it being open six weeks, having the most comments of any open bug, and being a regression in the latest release, I have attempted to fix it myself.

See pull request #2993.

The fix relaxes the dependencies of systemd-coredump@.service, allowing it to be started during the shutdown sequence.

This fixes the problem for me. However, I do not know that it is the best solution; an alternative might be to disable coredumps during shutdown.

Note that as I mentioned before, not everyone who may be encountering a 90 second wait during shutdown is necessarily encountering this exact bug with systemd-coredump. The 90 second wait is apparently something that happens whenever processes are unresponsive to SIGTERM.

@omergoktas

ln -s /dev/null /etc/sysctl.d/50-coredump.conf

it worked, thanks

@poettering poettering added this to the v230 milestone Apr 18, 2016
@diegoviola

Weird, I switched back to Firefox (from Chromium) and the issue went away.

@ebiggers ebiggers added a commit to ebiggers/systemd that referenced this issue Apr 23, 2016
@ebiggers ebiggers coredump: allow systemd-coredump to run during shutdown
Ever since systemd-coredump was made a socket-activated service in
systemd 229, it has been unable to run during system shutdown.

This is a problem if any processes crash during the shutdown sequence.
If a process crashes when it receives SIGTERM, then the kernel tries to
send the coredump to systemd-coredump.  systemd-coredump passes the
coredump file descriptor onto the socket.  However, systemd is then
unable to start systemd-coredump@.service due to that unit conflicting
shutdown.target.  This causes the file descriptor to be kept open, making
the crashing process stay alive and unresponsive to SIGTERM.  systemd
then must wait its default timeout of 90 seconds before sending SIGKILL,
significantly delaying the shutdown.

This is not a hypothetical case; I initially encountered it with the
chromium browser.  systemd cannot assume that applications do not crash.

Fix the problem by making the following changes:

- Define a new slice coredump.slice with DefaultDependencies=No and place
  systemd-coredump@.service in it.  This is necessary because by default
  templated services are placed in a per-template slice which conflicts
  shutdown.target, thereby causing them to be stopped at shutdown.  Note
  that because the '-' character is special in slice names, I could not
  call the new slice systemd-coredump.slice.  Also, an alternative would
  have been to run the coredump services in system.slice; however, this
  was recommended against because coredump handling can be expensive.

- Remove Conflicts=shutdown.target and Before=shutdown.target from
  systemd-remount-fs.service.  This prevents this early boot service,
  which is ordered before systemd-coredump@.service, from being stopped
  at shutdown.  Stopping this service was a no-op anyway, since it is
  oneshot.

- Finally, remove Conflicts=shutdown.target and Before=shutdown.target
  from systemd-coredump@.service.  In combination with the other changes,
  this allows systemd-coredump@.service to be started during shutdown.

Fixes #2691.
8f289ec
@evverx evverx added a commit to evverx/systemd that referenced this issue Apr 24, 2016
@evverx evverx tests: add test for systemd#2691 0ae219b
@poettering poettering added a commit to poettering/systemd that referenced this issue Apr 29, 2016
@poettering poettering core: make sure to close connection fd when we fail to activate a per…
…-connection service

Fixes: #2993 #2691
a28ee75
@poettering poettering added the has-pr label Apr 29, 2016
@poettering
Member

My proposed fix is now waiting in #3148. Please have a look!

@poettering poettering added a commit to poettering/systemd that referenced this issue Apr 29, 2016
@poettering poettering core: make sure to close connection fd when we fail to activate a per…
…-connection service

Fixes: #2993 #2691
3e7a1f5
@poettering
Member

#3148 was applied now. Still would like to have some feedback if it fixes the coredump issues though. Will close this issue for now, but please report back should you still encounter problems even with this patch applied! thanks!

@poettering poettering closed this Apr 29, 2016
@ebiggers
Contributor

Thanks Lennart! I just tested the latest systemd master, and the problem no longer occurs. The systemd-coredump service fails as before, but it does not impede shutdown.

@poettering
Member

@ebiggers thanks for testing! And sorry this took so long!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment