[227 regression] Boot occasionally fails with "Connection timed out" #1505

Closed
martinpitt opened this Issue Oct 9, 2015 · 107 comments

Comments

Projects
None yet
Contributor

martinpitt commented Oct 9, 2015

https://bugs.debian.org/801354 reported that 227 introduced a rather major boot regression: You often get failures like

Oct 09 09:00:19 autopkgtest systemd-logind[616]: Failed to enable subscription: Connection timed out
Oct 09 09:00:19 autopkgtest systemd-logind[616]: Failed to fully start up daemon: Connection timed out
Oct 09 09:00:19 autopkgtest systemd[1]: Failed to subscribe to activation signal: Connection timed out
Oct 09 09:00:44 autopkgtest systemd[1]: Failed to register name: Connection timed out
Oct 09 09:00:44 autopkgtest systemd[1]: Failed to set up API bus: Connection timed out
Oct 09 09:01:56 autopkgtest su[764]: pam_systemd(su:session): Failed to create session: Connection timed out
Oct 09 09:01:58 autopkgtest lightdm[1578]: pam_systemd(lightdm-greeter:session): Failed to create session: Connection timed out

(same for lightdm, etc.) I now found a way to reproduce this in a VM, so I'll go bisect hunting.

As for the pam_systemd thing: I wonder if that's related to the regression that the user systemd instance is now trying to unmount things during boot:

systemd[1653]: Reached target Default.
systemd[1653]: boot.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /boot.
systemd[1653]: run-user-1000.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /run/user/1000.
systemd[1653]: dev-dm\x2d2.swap: Unit entered failed state.
systemd[1653]: sys-kernel-debug.mount: Mount process exited, code=exited status=1
systemd[1653]: Failed unmounting /sys/kernel/debug.
[...]
umount[1657]: umount: /boot: umount failed: Operation not permitted
umount[1658]: umount: /run/user/1000: umount failed: Operation not permitted
swapoff[1659]: swapoff: Not superuser.

This might be something entirely different (and then mostly cosmetical), or be the cause for failing to create a user session.

@martinpitt martinpitt added this to the v228 milestone Oct 9, 2015

@martinpitt martinpitt self-assigned this Oct 9, 2015

Contributor

martinpitt commented Oct 9, 2015

For the record:

[user systemd trying to unmount stuff] might something entirely different (and then mostly cosmetical)

It's actually not -- as https://bugs.debian.org/801361 points out, this will actually unmount stuff if you log in as root.

I have the bisect running now, which will show if the unmounts and the timeouts have one and the same root cause. If not, I'll open a separate issue for the unmounts.

Contributor

martinpitt commented Oct 9, 2015

Found the culprit: It's a5bd3c3 from PR #1215. Reverting that fixes the connection timeouts and everything is hunky-dory again.

Curiously, pretty much the exact same symptom happened with 219 in Februrary -- do you guys still remember http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html ? Back then it was fixed in 6414444 which already (more or less accidentally) fixed the CMSG_SPACE allocation. The justification back then was different though (http://lists.freedesktop.org/archives/systemd-devel/2015-April/031364.html), so this isn't just the old bug back.

@maciejaszek , @dvdhrm , any idea about how the real fix looks like? Thanks!

@martinpitt martinpitt removed their assignment Oct 9, 2015

Contributor

martinpitt commented Oct 9, 2015

The "user systemd tries to unmount" stuff is unrelated to this, I now reported issue #1507 about that.

Contributor

martinpitt commented Oct 9, 2015

FTR, I actually did see this occasionally when running CI tests for the Ubuntu trunk builds, but as the Debian ones (which I do more often) worked fine I didn't pay enough attention to them. Sorry for the process fail, it's always frustrating when we release with a major bug that we could have spotted before. I'll be more suspicious when the "boot smoke" test fails in the future!

Owner

zonque commented Oct 9, 2015

Thanks, @martinpitt for bisecting this! Could give this patch a try please, and see if it fixes the regression? zonque@9f58f91

Owner

zonque commented Oct 9, 2015

Hmm, wait. No, that doesn't explain it.

Contributor

martinpitt commented Oct 9, 2015

@zonque: No surprise, but FTR: no difference with that patch.

Owner

poettering commented Oct 9, 2015

@martinpitt soo, how and why precisely does the sendmsg() fail and in which process? Do you have an strace of the process maybe, so that we can have a look?

manover pushed a commit to manover/systemd that referenced this issue Oct 11, 2015

Revert "sd_pid_notify_with_fds: fix computing msg_controllen"
It causes connection errors from various services on boot.

systemd/systemd#1505
Closes: #801354
Contributor

martinpitt commented Oct 12, 2015

@poettering: I'll see whether I can come up with an early systemd unit which attaches strace to pid 1 early. Things like logind etc. do a Subscribe D-Bus call to pid 1 (which fails with "Connection timed out"), so stracing logind etc. is uninteresting. I did try to add some log_warnings to sd_pid_notify_with_fds(), but the initial ones (before a successful sendmsg()) mysteriously never appeared in the journal or stderr; I only ever saw successful calls. strace wouldn't show us the interesting numbers for the control header allocation, just the error code. But I'll keep digging.

@maciejaszek : Your original PR #1215 looks wrong. "CMSG_SPACE(0) may return value other than 0" is intended as it contains padding for alignment:

 #define CMSG_SPACE(len) (CMSG_ALIGN (len) \
                    + CMSG_ALIGN (sizeof (struct cmsghdr)))

I think this padding must at least be part of the allocation below:

msghdr.msg_control = alloca(msghdr.msg_controllen);

Thus the original code before that PR looked right. However, you said that you got some EINVAL in some cases (without giving further detail/logs).

cmsg(3) is a bit confusing and contradictory. It says that msg.msg_controllen needs to be set twice, first "with the length of the control message buffer" without explaining that further (but sum of all CMSG_SPACEs is certainly plausible), and after writing all the elements it needs to be "set to the sum of the CMSG_SPACE() of the length of all control messages in the buffer." This is what the original code (before PR #1215) already did, and it has worked for quite a number of releases.

But the example on the manpage actually sets it to the sum of the CMSG_LENs, not the sum of the CMSG_SPACEs. So it might be that the description in the manpage is wrong and the example is right, and the manpage should instead say "Finally, the msg_controllen field of the msghdr should be set to the sum of the CMSG_LEN() of the length of all control messages in the buffer." I tested this theory with http://paste.ubuntu.com/12761809/ but that didn't help, I still get the connection timeouts.

So if we instead assume that the description is right and the example is wrong, we keep msg_controllen as the sum of CMSG_SPACEs, but in the special case of have_pid or n_fds == 0 we keep the extra padding for the allocation. That would be http://paste.ubuntu.com/12761826/ but it still fails.

So going back, the only code that actually works is the one with PR #1215 reverted, and I don't know why @maciejaszek got EINVALs there.

Owner

zonque commented Oct 12, 2015

@martinpitt if either n_fds == 0 or !have_pid, we only attach one cmsg to the message. Hence, msg.msg_controllen must not contain the size of the 2nd, unused message header, and MSG_NXTHDR is never called, which is correct. So #1215 does the right thing, but the code is somewhat obfuscated. I still don't know what the actual problem is though.

Contributor

martinpitt commented Oct 12, 2015

Wrt. stracing, this actually works quite well:

[Unit]
Description=strace pid 1
DefaultDependencies=no

[Service]
ExecStart=/usr/bin/strace -e socket,sendmsg,close -vvs1024 -o /run/pid1.trace -p 1

[Install]
WantedBy=sysinit.target

First I ran this stracing with current git head, i. e. with this bug.

From a failed boot the journal is http://paste.ubuntu.com/12761903/ and the strace output is http://paste.ubuntu.com/12761900/ . Note that there are zero instances of msg_controllen not being zero, and sendmsg() only ever fails with EAGAIN (10 times) or EPIPE (21 times)

From a successful boot the journal is http://paste.ubuntu.com/12761945/ and the strace is http://paste.ubuntu.com/12761943/ . Note that there are no EAGAIN errors, just 19 EPIPE ones (and no other error codes). Just like above all msg_controllens are 0.

AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here. The second variant is documented as "(Internet domain datagram sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use." This also sounds implausible.

So I'm afraid I can't really make sense of the EAGAIN here...

Contributor

martinpitt commented Oct 12, 2015

I re-ran with the reverted patch, so that the "Connection timed out" errors disappear. strace is http://paste.ubuntu.com/12762156/. As expected there are no more EAGAIN errors, but also there is still no single sendmsg() call with a msg_controllen != 0. I ran this in a loop, and I never got a single EAGAIN. So I suppose the cases where control messages are actually being sent are from the client side (logind, journald, etc.). I'll get some straces there to compare.

With current git head (i. e. without reverting), the EAGAINs coincide exactly with the failed boots with connection errors (in pid 1).

Contributor

maciejaszek commented Oct 12, 2015

@martinpitt: bug occurred when I tried to pass fds, but have_pid was set to 0. After my change everything booted without any problems. I'm looking at the code, but it would be good to test changes - do you have any image on which I can reproduce this error?

Contributor

martinpitt commented Oct 12, 2015

I do have an image, but it's 2.1 GB. I started uploading it now, but it'll take some two hours.

This is more or less a standard ubuntu cloud image (http://cloud-images.ubuntu.com/wily/current/wily-server-cloudimg-amd64-disk1.img) with installing some extra stuff (network-manager, policykit-1, lightdm), upgrading to systemd 227 and enabling persistant journal, and then rebooting it in a loop. I also got this behaviour on full desktop images back then. The whole machinery to do that (autopkgtest etc.) is not that simple to reproduce on other distros. But it seems to me that enabling persistent journal is somehow a key ingredient.

In the previous case (http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html) we eventually got bug reports on pretty much every distro (Arch, Fedora, Debian, Ubuntu, etc.), but it seems no developer except me could reliably reproduce this.. It's such an iffy heisenbug.

Contributor

martinpitt commented Oct 12, 2015

When this bug happens, stracing logind shows 27 sendmsg() calls, all of which succeed. There is exactly one call with ancilliary data:

sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\4\0\0\0\17\0\0\0(\0\0\0\5\1u\0\21\0\0\0\6\1s\0\5\0\0\0:1.11\0\0\0\10\1g\0\1h\0\0\t\1u\0\1\0\0\0", 56}, {"\0\0\0\0", 4}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {17}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 60

Whereas with the reverted patch we get multiple calls but they all have exactly the same control length "20", and again there are no errors:

sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\4\0\0\0\r\0\0\0(\0\0\0\5\1u\0\35\0\0\0\6\1s\0\4\0\0\0:1.5\0\0\0\0\10\1g\0\1h\0\0\t\1u\0\1\0\0\0", 56}, {"\0\0\0\0", 4}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {17}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 60
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0\37\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.12\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c1\0\0\"\0\0\0/org/freedesktop/login1/session/c1\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0/\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.15\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c2\0\0\"\0\0\0/org/freedesktop/login1/session/c2\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0?\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.18\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c3\0\0\"\0\0\0/org/freedesktop/login1/session/c3\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0O\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.21\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c4\0\0\"\0\0\0/org/freedesktop/login1/session/c4\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1`\0\0\0_\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.24\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c5\0\0\"\0\0\0/org/freedesktop/login1/session/c5\0\0\r\0\0\0/run/user/109\0\0\0\0\0\0\0m\0\0\0\5\0\0\0seat0\0\0\0\7\0\0\0\0\0\0\0", 96}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
sendmsg(11, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1X\0\0\0s\0\0\0000\0\0\0\5\1u\0\2\0\0\0\6\1s\0\5\0\0\0:1.25\0\0\0\10\1g\0\10soshusub\0\0\0\t\1u\0\1\0\0\0", 64}, {"\2\0\0\0c6\0\0\"\0\0\0/org/freedesktop/login1/session/c6\0\0\v\0\0\0/run/user/0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 88}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {18}}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 152

Sorry, this makes no sense at all to me, at this point I'm just monkey-patching around..

Owner

dvdhrm commented Oct 12, 2015

Just for clarity, a5bd3c3 looks correct. The previous code was definitely not correct. Maybe there is still something wrong, but I checked all the kernel CMSG macros and internal handling, and it looks fine. Please correct me, if I'm wrong.

Furthermore, the log-messages don't mention a failure in sendmsg(2). Instead, what I see is lightdm calling pam, calling logind, calling pid1, calling AddMatch on dbus-daemon. The latter fails and the error code is passed through the chain to lightdm. Maybe this is not a direct chain, but rather an activation chain. But that's probably irrelevant.
This makes me wonder, why AddMatch on dbus-daemon fails with ETIMEDOUT. This might be worth investigating.

Anyway, we cannot ignore that reverting a5bd3c3 fixes your issue. This somehow smells like stack corruption to me.. I'll see whether valgrind throws something interesting.

Furthermore, can you give some details how you reproduced this? Is this 32bit? 64bit? 32bit on 64bit? x86? ARM? etc.. That is, trying to figure out why none of us sees it on their production system.

Contributor

martinpitt commented Oct 12, 2015

@dvdhrm: We got at least three different reporters hours after we uploaded 227 to Debian sid. People there used i386 (32 bit) and x86_64, lightdm or gdm3 etc., and as this also kills journald, logind, rfkill I don't believe this is dependent on a particular login manager or even architecture.

However, I could never reproduce it on my Debian test images. I just tried taking a standard x86_64 Ubunt desktop VM install, upgrade to systemd 227, enable persistent journal, and reboot often; but I couldn't trigger it like that. Half a year ago it pretty much felt exactly the same, and I got it once or twice with manual tests on a desktop VM, but it was too frustrating to reproduce that way as this seems to be highly dependent on timing, sun rays, local air pressure, and what not.

The upload of my aupkgtest VM where this is reasonably easy to reproduce finally finished: http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img (2.1 GB)

qemu-system-x86_64 -enable-kvm -m 2048 -drive file=adt-wily+systemd227.img,if=virtio -net nic,model=virtio -net user,hostfwd=tcp::2222-:22 -nographic -serial stdio -monitor none -snapshot

As this was based on a minimal VM, lightdm doesn't actually start up, so there's no usable graphic output. The VM can be driven by the serial console, or (once it starts up), over ssh. The above QEMU command starts it with a console on stdio, and you can use ssh -p 2222 ubuntu@localhost to log into it with ssh.

This VM mostly just needs to be rebooted a couple of times (like 5), and then it reliably produces that hang for me. User is "ubuntu", password "ubuntu", sudo works without password.

Contributor

martinpitt commented Oct 12, 2015

Some missing info:

  • It is enough to replace these three binaries for testing a fix/change: /lib/systemd/systemd{,-logind,-journald} .
  • systemd was configured like that: ./configure CFLAGS='-g -O0 -ftrapv' --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib --with-rootprefix= --with-rootlibdir=/lib --enable-split-usr PYTHON=python3
Contributor

maciejaszek commented Oct 12, 2015

I've written some test service, which calls sd_pid_notify_with_fds and straced pid 1:

recvmsg(16, {msg_name(0)=NULL, msg_iov(1)=[{"FDSTORE=1", 4096}], msg_controllen=56, [{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, {pid=330, uid=0, gid=0}}, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, [13]}], msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 9
open("/proc/330/cgroup", O_RDONLY|O_CLOEXEC) = 15

// ...

epoll_ctl(4, EPOLL_CTL_ADD, 13, {0, {u32=3773119200, u64=94329844873952}}) = 0
recvmsg(16, 0x7ffc0f5f1b60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

It looks like manager is receiving it properly, adds to watched fds and then tries to receive something one more time, which fails.

Contributor

martinpitt commented Oct 12, 2015

For the record: a5bd3c3 changed both the allocation and the value of msg_controllen. As @dvdhrm suspected a memory error I tried with

-                msghdr.msg_control = alloca(msghdr.msg_controllen);
+                msghdr.msg_control = alloca(msghdr.msg_controllen + 100);

which keeps the current control length, but allocates some extra space in case it overflows due to padding. With this the bug is still present, so I don't believe it's just a short allocation.

As for my earlier "AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here" → this is obviously wrong. The socket itself is not opened with SOCK_NONBLOCK, and sendmsg() is not called with MSG_DONTWAIT in the code -- but in the strace we clearly see MSG_DONTWAIT|MSG_NOSIGNAL. This explains the EAGAIN.

Are these sendmsg() really intended to be non-blocking? If so, where is that coming from, I don't see that in the code? And then we actually need to handle the EAGAIN case, as otherwise it falls through to the "try with our own ucred instead" case and eventually fails completely. If the non-blocking send is not intended, that's a likely thing to fix then?

Owner

zonque commented Oct 12, 2015

Hmm, there's a nasty detail in CMSG_NXTHDR(), but that should have resulted in an assertion. Anyway, could you try zonque@41b112f ?

Owner

dvdhrm commented Oct 12, 2015

Seriously? This loop fixes it?

Contributor

martinpitt commented Oct 12, 2015

@zonque: That's it, you are a star! No more hangs with the alloca0. Almost no EAGAINs now either (only a few from sending log messages to the journal). Turns out the EAGAIN/DONTWAIT bit was a red herring, as these were from journal sendmsg. I guess I got a lot of those as journald was hanging when the bug happened.

Please just fix the "sd-daemin" typo in the commit log :-)

Thanks!

Contributor

martinpitt commented Oct 12, 2015

David Herrmann [2015-10-12 6:12 -0700]:

Seriously? This loop fixes it?

I deleted the comment. I got a successful run with it, but it seems
this was just pure luck. Yay race conditions, sorry for the false
alarm.

Owner

zonque commented Oct 12, 2015

I still wonder why that didn't result in an triggered assertion on the sender side.

@dvdhrm dvdhrm closed this in #1540 Oct 12, 2015

Contributor

martinpitt commented Oct 12, 2015

Argh .. I just re-ran the test (with the alloca0, and the second time it failed again, also during a third run. While the alloca0 seems to help (I now got failures after (roughly) the 7th reboot iteration, instead of the 2nd already) it still looks exactly as before once the bug hits.

@martinpitt martinpitt reopened this Oct 12, 2015

Owner

zonque commented Oct 12, 2015

@martinpitt, what does strace show on your test machine with this simple program?

int main(int argc, char **argv) {
    int x[2];

    pipe(x);
    sd_pid_notify_with_fds(0, 0, "bla", x, 2);

    return 0;
}
Owner

zonque commented Oct 12, 2015

(you have to run that from a unit file, so that NOTIFY_SOCKET env var is set)

Contributor

martinpitt commented Oct 13, 2015

@zonque: I use this complete .c program:

/* gcc -o testsdn testsdn.c -lsystemd */
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-daemon.h>

int main(int argc, char **argv) {
    int x[2];

    printf("test sd_pid_notify_with_fds: $NOTIFY_SOCKET=%s\n", getenv("NOTIFY_SOCKET"));
    pipe(x);
    sd_pid_notify_with_fds(0, 0, "bla", x, 2);

    return 0;
}

and this test unit:

[Unit]
Description=Test sd_notify()
DefaultDependencies=no
After=systemd-journald.service
Conflicts=shutdown.target
Before=shutdown.target

[Service]
Type=notify
ExecStart=/usr/bin/strace -fvvs1024 -o /run/sdn.trace /home/ubuntu/testsdn

[Install]
WantedBy=sysinit.target

With current master I get

socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 5
sendmsg(5, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"bla", 3}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {3, 4}}, msg_flags=0}, MSG_NOSIGNAL) = 3
close(5)

With master+reverted patch I get this instead:

sendmsg(5, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"bla", 3}], msg_controllen=40, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {3, 4}}, msg_flags=0}, MSG_NOSIGNAL) = -1 EINVAL (Invalid argument)

which I guess is what @maciejaszek 's patch fixed. I. e. the extra 16 bytes in msg_controllen are the aligned CMSG_SPACE(sizeof(struct cmsghdr)).

So far, no surprises here.

Owner

poettering commented Oct 13, 2015

A couple of questions: did anyone manage to reproduce this issue on anything that's not Ubuntu?

As this causes AddMatch in dbus-daemon to timeout, what precisely is dbus-daemon doing in this case? What does a backtrace of dbus-daemon show when this happens?

Does downgrading dbus-daemon to something older have any effect on the issue?

A wild guess of mine might be that there's some confusion with the fds in PID 1, and it starts treating some dbus fd as notify fd or vice versa... @martinpitt if you strace logind and dbus-daemon simultaneously, do you see the AddMatch message transferred properly? Do you see it being sent by logind, and received by dbus-daemon? int the strace output, can you verify that the right fds are used when reading/writing the fds?

Contributor

mbiebl commented Oct 13, 2015

@poettering we at least had several Debian unstable users as well who were affected by this

Contributor

dongsupark commented Oct 13, 2015

I can also reproduce this error on a VM with Fedora 23, systemd 227.
With the vanilla systemd v227 (without the fix zonque/systemd@40f4423), it hangs on starting systemd-logind, once out of 5 trials.
OTOH with the commit zonque/systemd@40f4423, it hangs also on starting systemd-logind, once out of 12~15 trials. That looks definitely better than before, but the issue still persists, I suppose.

[   28.364190] systemd[1]: Failed to register name: Connection timed out
[   28.375244] systemd[1]: Failed to set up API bus: Connection timed out

Oct 13 15:46:42 myfedora2 systemd[1]: systemd-logind.service: Main process exited, code=exited, status=1/FAILURE
Oct 13 15:46:42 myfedora2 systemd[1]: Failed to start Login Service.
Oct 13 15:46:42 myfedora2 systemd[1]: systemd-logind.service: Unit entered failed state.
Oct 13 15:46:42 myfedora2 systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
Oct 13 15:46:42 myfedora2 systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
Oct 13 15:46:42 myfedora2 systemd[1]: Stopped Login Service.

And I also tested with the Ubuntu VM by @martinpitt, which failed once in 5 trials. So I don't think it's a distro-specific issue.

Owner

zonque commented Oct 13, 2015

I don't get it. @dongsupark, @martinpitt, could you boot your system with rw init= pointing to a shell script with the following content:

#!/bin/sh
exec /usr/bin/valgrind --leak-check=full --log-file=/valgrind.out /usr/lib/systemd/systemd

... and then paste the valgrind log somewhere?

coling commented Oct 13, 2015

Just for reference (don't want to muddy the waters) but I've also had issues with logind on v227. Several times I've found I've been unable to unlock my session and switching to VTs fails to start a getty process. Emergency shell shows logind as failed and list-jobs shows we're waiting for the logind restart job. All very confusing. I've never been able to get very much debug out of it but when it happens again I'll see what I can dig out. Suspect it's related. Didn't seem to happen with 226. Not much to add I appreciate - like I say, will see what I can dig up next time it happens.

Member

lnykryn commented Oct 14, 2015

A couple of questions: did anyone manage to reproduce this issue on anything that's not Ubuntu?

I think I've hit the same issue on F22 with 227 systemd. But it happened just twice. There was weird delay when dbus was starting and then logind failed.

Contributor

martinpitt commented Oct 14, 2015

Anyway, we cannot ignore that reverting a5bd3c3 fixes your issue.

It seems sufficiently clear now (after the stracing above) that this commit was indeed correct. Reverting merely has the effect that sd_pid_notify_with_fds() goes back to being silently broken before -- i. e. it never actually triggered the code paths which now run and cause these timeout issues. So we need to see what these actually are. @zonque: I don't believe that this is a valgrind-ish class problem, so I haven't done this for now.

if you strace logind and dbus-daemon simultaneously, do you see the AddMatch message transferred properly? Do you see it being sent by logind, and received by dbus-daemon?

In fact I don't. http://paste.ubuntu.com/12779425/ is the entire strace of logind, and the only appearance of D-Bus is the (successful) connect()/AUTH, but then it never gets a response and terminates:

connect(11, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0
sendmsg(11, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45
recvmsg(11, 0x7ffdf6d8c220, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=11, events=POLLIN}], 1, {24, 985173000}, NULL, 8) = 0 (Timeout)
sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/login/logind.c\nCODE_LINE=628\nCODE_FUNCTION=manager_connect_bus\nERRNO=110\nSYSLOG_IDENTIFIER=systemd-logind\n", 149}, {"MESSAGE=", 8}, {"Failed to add match for JobRemoved: Connection timed out", 56}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 214
sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/login/logind.c\nCODE_LINE=1149\nCODE_FUNCTION=main\nERRNO=110\nSYSLOG_IDENTIFIER=systemd-logind\n", 135}, {"MESSAGE=", 8}, {"Failed to fully start up daemon: Connection timed out", 53}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 197
sendmsg(12, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"STOPPING=1\nSTATUS=Shutting down...", 34}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 34

So indeed this points towards a D-BUS problem. The strace of dbus-daemon (http://paste.ubuntu.com/12779475/) ends with

newfstatat(AT_FDCWD, "/run/systemd/system/", {st_dev=makedev(0, 18), st_ino=198, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=40, st_atime=2015/10/14-11:25:56, st_mtime=2015/10/14-11:25:56, st_ctime=2015/10/14-11:25:56}, AT_SYMLINK_NOFOLLOW) = 0
close(4)                                = 0
write(2, "Failed to start message bus: No socket received.\n", 49) = 49

which sounds like a problem with handing over dbus.socket's proxy Unix socket to dbus-daemon once it actually starts?

FTR, http://paste.ubuntu.com/12779487/ is the strace -e socket,sendmsg,recvmsg,close of pid 1. I don't see that much there, just that both dbus.socket and dbus.service started, and then

recvmsg(44, 0x7ffe5bf4b300, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(33, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=3\nCODE_FILE=src/core/dbus.c\nCODE_LINE=782\nCODE_FUNCTION=bus_setup_api\nERRNO=110\nSYSLOG_IDENTIFIER=systemd\n", 133}, {"MESSAGE=", 8}, {"Failed to subscribe to NameOwnerChanged signal: Connection timed out", 68}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 210

But I'll get a full strace of pid 1, and will see to modify strace1.service to run even earlier (with some additional Before=s).

Owner

poettering commented Oct 14, 2015

Ah! The "No socket received" is highly interesting! This is where we should continue debugging!

Could you put together a script that writes all env vars and the long contents of /proc/self/fd/ and /proc/self/fdinfo into a file and then execs dbus-daemon? Then, please change dbus.service to invoke this script rather than the actual binary. With that we should be able to see what precisely dbus-daemon gets passed.

Contributor

martinpitt commented Oct 14, 2015

http://people.canonical.com/~pitti/tmp/straces.tar.xz has the full straces including pid1. The latter is now 100 MB, so pastebin is unwieldy.

dbus.trace already contains the env vars (["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "LISTEN_PID=643", "LISTEN_FDS=1", "LISTEN_FDNAMES=dbus.socket"]).

I put this into the test setup:

    cat <<EOF > /dbuswrap
#!/bin/sh
env > /run/dbus.txt
ls -lR /proc/\$\$/fd/ >> /run/dbus.txt
grep -r . /proc/\$\$/fdinfo >> /run/dbus.txt
exec /usr/bin/strace -vvs1024 -o /run/dbus.trace /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
EOF
    chmod 755 /dbuswrap
    sed -i '/ExecStart=/ s!^.*$!ExecStart=/dbuswrap!' /lib/systemd/system/dbus.service

With that I get

LISTEN_PID=627
LISTEN_FDS=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LISTEN_FDNAMES=dbus.socket
PWD=/
/proc/627/fd/:
total 0
lr-x------ 1 root root 64 Oct 14 12:12 0 -> /dev/null
l-wx------ 1 root root 64 Oct 14 12:12 1 -> /run/dbus.txt
lr-x------ 1 root root 64 Oct 14 12:12 10 -> /dbuswrap
lrwx------ 1 root root 64 Oct 14 12:12 11 -> socket:[12232]
lrwx------ 1 root root 64 Oct 14 12:12 2 -> socket:[12232]
lrwx------ 1 root root 64 Oct 14 12:12 3 -> socket:[11457]
/proc/627/fdinfo/0:pos: 0
/proc/627/fdinfo/0:flags:   0100000
/proc/627/fdinfo/0:mnt_id:  20
/proc/627/fdinfo/1:pos: 0
/proc/627/fdinfo/1:flags:   0102001
/proc/627/fdinfo/1:mnt_id:  22
/proc/627/fdinfo/2:pos: 0
/proc/627/fdinfo/2:flags:   02
/proc/627/fdinfo/2:mnt_id:  8
/proc/627/fdinfo/3:pos: 0
/proc/627/fdinfo/3:flags:   02
/proc/627/fdinfo/3:mnt_id:  8
/proc/627/fdinfo/10:pos:    250
/proc/627/fdinfo/10:flags:  02100000
/proc/627/fdinfo/10:mnt_id: 23
/proc/627/fdinfo/11:pos:    0
/proc/627/fdinfo/11:flags:  02000002
/proc/627/fdinfo/11:mnt_id: 8

and the corresponding strace http://paste.ubuntu.com/12779676/ . I can't make much sense of that though, can you? Grepping for the socket:[..] stuff after the startup yields nothing, so I'll adjust the script to grep /proc for these sockets immediately, to see what the other end is.

Owner

poettering commented Oct 14, 2015

@martinpitt the strace cannot work that way, you need to specify at least "-D", as otherwise the LISTEN_PID will not match the PID of dbus-daemon, and then dbus-daemon rightfully ignores any passed fds.

Owner

poettering commented Oct 14, 2015

@martinpitt hmm, if you had the strace line like this earier already, maybe the "No socket received" thing is simply result of that?

Contributor

johannbg commented Oct 14, 2015

This downstream bug might be related to this issue as well
https://bugzilla.redhat.com/show_bug.cgi?id=1271394

Contributor

martinpitt commented Oct 14, 2015

Thanks for pointing that out! That also explains why the VMs didn't finish booting at all any more with the dbus stracing. New test setup:

    cat <<EOF > /dbuswrap
#!/bin/sh
env > /run/dbus.txt
ls -l /proc/\$\$/fd/ /proc/1/fd/ >> /run/dbus.txt
grep -r . /proc/\$\$/fdinfo /proc/1/fdinfo >> /run/dbus.txt
exec /usr/bin/strace -Dvvs1024 -o /run/dbus.trace /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
EOF
    chmod 755 /dbuswrap
    sed -i 's!ExecStart=!&/usr/bin/strace -Dvvs1024 -o /run/logind.trace !' /lib/systemd/system/systemd-logind.service
    sed -i '/ExecStart=/ s!^.*$!ExecStart=/dbuswrap!' /lib/systemd/system/dbus.service

dbus is now actually running and logind does the AddMatch things.

logind strace: http://paste.ubuntu.com/12780100/
dbus strace: http://paste.ubuntu.com/12780105/
dbus.txt from above with fd and env info: http://paste.ubuntu.com/12780129/

Owner

poettering commented Oct 14, 2015

@martinpitt OK, so if the AddMatch works now and much of the issues at hand are trigegred throught the wrong strace, what precisely remains now?

Does the AddMatch work correctly now, in all cases? So what's not working now, what is this bug about precisely now?

Contributor

martinpitt commented Oct 14, 2015

@poettering: The wrong straces happened since #1505 (comment) . strace now shows that logind and dbus are running, but systemd still doesn't consider them running:

systemd[1]: systemd-logind.service: Start operation timed out. Terminating.
systemd[1]: systemd-logind.service: Failed with result 'timeout'

This is something that the newly working sd_pid_notify_with_fds() now triggers. Before that call was broken (failing with EINVAL) and thus apparently hiding this bug.

Can you reproduce the hangs with the test VM? (See #1505 (comment)) This doesn't contain any strace or other debug annotation, I add all that after booting.

Owner

poettering commented Oct 14, 2015

@martinpitt logind is Type=dbus, and hence systemd waits for the org.freedesktop.login1 name to appear on the bus before it is considered started. When this hangs, has the name properly appeared?

Note that logind also sends READY=1 with sd_notify, but PID 1 does not use that for state changes, if Type= is not notify...

when this hangs, can you use "systemd-analyze dump" to the full state dump of pid1, and then check the "NotifyState=" field for the logind.service unit? What does it say?

Contributor

martinpitt commented Oct 14, 2015

I just realized that the previous logind strace was useless, as logind keeps restarting itself and thus the trace is always just the latest one without the interesting error message. Another /^Restart/d in the seddery takes care of that, now it starts/fails only once.

the logind strace (http://paste.ubuntu.com/12780571/) shows some successful D-Bus calls for the AddMatch

sendmsg(12, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\226\0\0\0\2\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\10\0\0\0AddMatch\0\0\0\0\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\1s\0\0", 144}, {"\221\0\0\0type='signal',sender='org.freedesktop.systemd1',interface='org.freedesktop.systemd1.Manager',member='JobRemoved',path='/org/freedesktop/systemd1'\0", 150}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 294
recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\t\0\0\0\2\0\0\0\215\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\f\0\0\0NameAcquired\0\0\0\0\6\1s\0\4\0\0\0:1.0\0\0\0\0\10\1g\0\1s\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\4\0\0\0:1.0\0", 145}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 145
recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0\3\0\0\0005\0\0\0\6\1s\0\4\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{":1.0\0\0\0\0\5\1u\0\2\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", 48}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 48

but finally one with a timeout:

sendmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\6\0\0\0\221\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\t\0\0\0Subscribe\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0", 168}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 168
recvmsg(12, 0x7ffcaabb9450, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=12, events=POLLIN}], 1, {24, 999978000}, NULL, 8) = 0 (Timeout)

after that it shuts down:

sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/login/logind.c\nCODE_LINE=671\nCODE_FUNCTION=manager_connect_bus\nSYSLOG_IDENTIFIER=systemd-logind\n", 139}, {"MESSAGE=", 8}, {"Failed to enable subscription: Connection timed out", 51}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 199
sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=src/login/logind.c\nCODE_LINE=1149\nCODE_FUNCTION=main\nERRNO=110\nSYSLOG_IDENTIFIER=systemd-logind\n", 135}, {"MESSAGE=", 8}, {"Failed to fully start up daemon: Connection timed out", 53}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 197
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 13
sendmsg(13, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"STOPPING=1\nSTATUS=Shutting down...", 34}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 34
close(13)                               = 0
exit_group(1)
Contributor

martinpitt commented Oct 14, 2015

while boot is hanging, systemd-analyze is failing as well with "Failed issue method call: Connection timed out". If the dump output is still useful after it failed, here it is:

-> Unit systemd-logind.service:
       NotifyAccess: main
       NotifyState: stopping
       BusName: org.freedesktop.login1
       Bus Name Good: no

So trying anything dbus-y during that time is difficult, and strace seem to be a better bet..

Owner

poettering commented Oct 14, 2015

hmm, you can make systemd log the same output to the journal via SIGUSR2 to PID 1. I am really interested in getting this output at the moment where the hang happens.

Also, could you get a backtrace of dbus daemon, and PID1 when this hangs?

Owner

zonque commented Oct 14, 2015

FWIW, I was finally able to trigger this with F23 in a VM, but it took ~20 reboot cycles. So this is definitely not an Ubuntu specific issue.

Contributor

martinpitt commented Oct 15, 2015

FTR, I'm not using kdbus in these tests, and I'm sure most/all of the Debian bug reporters don't either.

@zonque: not sure whether you already tried, but v225+cherry-pick a5bd3c3 also causes this bug. Which corroborates that fixing sd_notify* merely exposed this bug, but it was already present before.

So what I could do now is to start a bisect from, say 219 or so up to 225 with cherry-picking a5bd3c3 on each iteration. Or did you already do that? I. e. somehow you got towards bbc2908, which bisection did you do?

Owner

zonque commented Oct 15, 2015

@martinpitt, still confusing, but sure, let's try. I didn't try this yet but bisected from v225 upwards. Cherry-picking a5bd3c3 won't work before 6414444, however, so v220 seems like a good starting point, and maybe it's a good idea to just test the tagged releases first.

Contributor

martinpitt commented Oct 15, 2015

Indeed v220+a5bd3c3 fails (after 14 iterations). 6414444 was indeed the commit which fixed http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html, i. e. when the symptoms were exactly the same. 6414444 doesn't apply to 219 and earlier any more, and I don't think going back in time is going to be useful -- it's pretty clear by now that the sd_notify handling is somehow either the cause or more likely the trigger of this, and the underlying bug has been lurking for a long time already.

Contributor

dongsupark commented Oct 15, 2015

@dongsupark do stock v225 or v226 (which both have bbc2908 as well) work for you?

@zonque hard to tell. What I can tell for sure is that both v226 and v227 had the hanging issue. That's what I stumbled across last week. I already tried to go back to v225, but I'm not sure I tested v225 enough to be tell it's working at all. For that I'd need to do the reboot-test again with v225.

Contributor

martinpitt commented Oct 20, 2015

I did some further experiments wrt. @zonque's analysis in #1505 (comment) .

First, this diff that abort()s on have_pid and logs cases where n_fds > 0 || have_pid, i. e. commit a5bd3c3 is relevant: http://paste.ubuntu.com/12875520/ . There are no SIGABRTs and on a failed boot I get some fifty

systemd-journald[230]: XXX sd_pid_notify_with_fds(pid=0, n_fds=1): control_len 24

but no other messages; IOW, all invocations of sd_pid_notify_with_fds() either don't go through that code path (no pid and no fds) or have exactly one fd.

So commit a5bd3c3 should have zero effective change in the first line (for n_fds); just to confirm that we are not completely crazy, I ran the tests with

-                        (n_fds > 0 ? CMSG_SPACE(sizeof(int) * n_fds) : 0) +
+                        CMSG_SPACE(sizeof(int) * n_fds) +

and as expected it still fails.

Then the effective change of that commit is that we stop allocating a second header for 0 bytes, i. e.

-                        CMSG_SPACE(0);
+                        0

right? And indeed with just

-                        (have_pid ? CMSG_SPACE(sizeof(struct ucred)) : 0);
+                        CMSG_SPACE(0);

things work again. This is indeed highly o_O -- the first CMSG_SPACE should already allocate space for the struct cmsghdr, and CMSG_FIRSTHDR does not refer "beyond" the first cmsghdr struct, so having extra space shouldn't matter at all. Some further logging shows that sizeof(struct cmsg_hdr) == 16, sizeof(int) * n_fds == 4, and msghdr.msg_controllen == 24, which all makes sense given /usr/include/x86_64-linux-gnu/bits/socket.h and extra 4 bytes for alignment.

Owner

poettering commented Oct 20, 2015

Smells like a stack corruption issue then, i.e. where the location of things on stack starts to matter. A pity that valgrind wasn't able to track this one down for us though...

Maybe a build using --enable-address-sanitizer will catch something

Contributor

martinpitt commented Oct 20, 2015

I tried malloc0() instead of alloca0() and added a free(msghdr.msg_control) after finish:, but exactly the same result. Also, as pointed out in #1505 (comment) merely increasing the allocation size doesn't help at all, one really needs to bump/change msg_controllen to have any noticeable effect. We never got any kind of crash or even assertion for this stuff.

Already established, but not very explicitly, and thus repeating for the summary:

  • With current git head (both when the hang happens and when not), all sendmsg() calls from sd_pid_notify_with_fds() (one from pid 1, one from logind, lots from journald) succeed, there are no errors. I both added successful and failure logging to that function and additionally abort() on sendmsg() failure.
  • With a5bd3c3 reverted, journald indeed does get several EINVAL failures of sendmsg(), with pid == 0, n_fds==1 and msg_controllen == 40 (i. e. the "good" 24 plus the extra 16 for the unnecessary second header).

I really continue to be convinced that this is not a memory corruption issue at all, we don't have a single point of evindence for that, but lots of evidence that it isn't. This bug has been here for quite some time, and commit a5bd3c3 merely uncovered it. I. e. the fact that previously journald failed to notify pid 1 about the fd and now it succeeds exposes some lockup issue.

Owner

zonque commented Oct 20, 2015

I really continue to be convinced that this is not a memory corruption issue at all, we don't have a single point of evindence for that, but lots of evidence that it isn't. This bug has been here for quite some time, and commit a5bd3c3 merely uncovered it. I. e. the fact that previously journald failed to notify pid 1 about the fd and now it succeeds exposes some lockup issue.

Right, with parameter failures on the sender side, the receiver continues to work. I also did some more tests and I'm starting to believe this is a kernel issue with file descriptor passing over sockets that eventually (modulo some conditions that I don't know yet) confuses the state of the receiver. I'm currently writing a standalone test program to reproduce this, but I didn't have any luck so far. I'll keep you posted.

Owner

zonque commented Oct 20, 2015

Right, with parameter failures on the sender side, the receiver continues to work.

To elaborate a bit more: even when the received fds are closed immediately after recvmsg() in PID1, without any further handling, storing or anything, the bug is still reproducible. On the sender's side (journald), the error condition from sendmsg() is currently not checked and doesn't have any side effects in the code, other than the fd not being stored in PID1.

Contributor

benjarobin commented Oct 22, 2015

Right now the only tool that use the pid argument is src/notify/notify.c, and I think nobody used it. And as said @martinpitt adding an abort on the condition have_pid, do not trig any error.

And since nobody is using the pid argument (always 0), the code before 6414444 was valid (as long as the pid arguement is not used) ! The msg_controllen doesn't need to be an aligned size.

The commit 6414444 breaks the code, it doesn't fix it, this commit just hide the problem. The code was fixed again in the commit a5bd3c3

So if there is any bug inside systemd, the bug is prior to v219

Contributor

martinpitt commented Oct 22, 2015

Commit 6414444 was right in the sense that before this we got connection timeouts which looked pretty much exactly like now; back then it was triggered by 13790ad which first made use of this functionality.

FWIW, that's not how I read man cmsg(3): it uses msg.msg_controllen = sizeof buf and char buf[CMSG_SPACE(sizeof myfds)]; it would be odd to not include the size for the first struct cmsghdr?

Contributor

benjarobin commented Oct 22, 2015

CMSG_LEN(len) => (CMSG_ALIGN(sizeof(struct cmsghdr)) + (len))
CMSG_SPACE(len) => (CMSG_ALIGN(sizeof(struct cmsghdr)) + CMSG_ALIGN(len))

sizeof(struct cmsghdr) = 16 so this is already aligned =>
CMSG_ALIGN(sizeof(struct cmsghdr)) = 16

So in our case, with only one fd and no pid => only one struct cmsghdr :

  • msg.msg_controllen = 20 before 6414444
  • msg.msg_controllen = 20 + 4 + 16 = 40 after 6414444 (which is wrong) !
  • msg.msg_controllen = (20 + 4) after a5bd3c3 (the + 4 is for alignment)

The alignment is optional and only needed if we want more than one cmsghdr

Contributor

benjarobin commented Oct 22, 2015

The commit 6414444 disable the functionality added by 13790ad since sendmsg always failed when trying to pass a fd, the fd of journald is no longer passed

Edit: Is what I say is true ? Is fd passing worked between 6414444 and a5bd3c3 ?

Owner

zonque commented Oct 22, 2015

@benjarobin the code is correct as it stands right now. Quoting the man pages:

To create ancillary data, first initialize the msg_controllen member of the msghdr with the length of the control message buffer. [...] In each control message, initialize cmsg_len (with CMSG_LEN()), the other cmsghdr header fields, and the data portion using CMSG_DATA(). Finally, the msg_controllen field of the msghdr should be set to the sum of the CMSG_SPACE() of the length of all control messages in the buffer.

Also, the code does the right thing and successfully transmits the fds (and the right number of fds) to PID1, so that's not the issue. The problem is there is a side-effect of that action, which causes PID1 to trip up eventually. The code we're looking at here only allows the other effect to happen.

Contributor

benjarobin commented Oct 22, 2015

@zonque Sorry if I misspoke, I wanted to say that the current code is correct and fully valid, and the code before 6414444 was valid as long the pid argument was not used. The kernel doesn't check and doesn't care that msg_controllen is a multiple of 8.

=> The purpose of my messages was to say that the bug is prior to 6414444

With the commit 6414444 when we try to transmits the fds, we also send an other uninitialized cmsghdr with no data. I have no idea what the kernel do when the kernel "parse" the second cmsghdr, which contains random data.
Which is funny, with an invalid / corrupted second cmsghdr there is no bug.

Contributor

benjarobin commented Oct 22, 2015

@zonque @martinpitt I edited / improved / fixed my previous messages, sorry again I misspoke

Contributor

martinpitt commented Oct 22, 2015

Edit: Is what I say is true ? fd passing is working between 6414444 and a5bd3c3

It's actually the other way round: it was broken before a5bd3c3, and started working with a5bd3c3. This then unconvered the underlying hang with passing fds.

Contributor

benjarobin commented Oct 22, 2015

@martinpitt My English is so bad... That was a question, sorry..., So we are agree with that, fd passing is not working between 6414444 and a5bd3c3
So my conclusion is right, the bug is prior 6414444 ! Are you agree with that ?

Owner

dvdhrm commented Oct 23, 2015

@benjarobin You cannot conclude that. The trigger and the cause might not be the same commit.

Owner

poettering commented Oct 28, 2015

hmm, so i am trying to reproduce this now, and truned off kdbus again now. I am trying to reproduce this in a container, but I have no luck... Did anyone manage to reproduce this in nspawn?

Owner

poettering commented Oct 28, 2015

hmm, so i tried this with baremetal now too, rebooted 15 times in a row, and dbus and logind were fine each time. I am trying this with current git on F23. Anyone can tell me how to reproduce this on a system like that?

Owner

poettering commented Oct 28, 2015

@cinquecento btw, please do not dump arbitrary logs in this bug report (or any), unless this is requested. It's hard enough to follow this bug report, as long as it is, and given that there were other issues unrelated to the actual bug being discussed here, and dumping huge log files just makes it even harder.

Owner

zonque commented Oct 28, 2015

hmm, so i tried this with baremetal now too, rebooted 15 times in a row, and dbus and logind were fine each time. I am trying this with current git on F23. Anyone can tell me how to reproduce this on a system like that?

As I said, it takes around 40-200 reboots to trigger in my case. I'm so far only testing with a VM image.

@poettering poettering closed this in #1707 Oct 28, 2015

@zonque zonque reopened this Oct 28, 2015

Owner

zonque commented Oct 28, 2015

#1707 is not a fix for #1505, unfortunately. Re-opening hence.

Contributor

benjarobin commented Oct 28, 2015

I did run some tests and get a current Call Trace of all process using sysrq-trigger, here the result :

Contributor

benjarobin commented Oct 28, 2015

Here the backtrace obtained with gdb of systemd dbus-daemon systemd-logind systemd-journal systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-3.tar.xz

Contributor

martinpitt commented Oct 29, 2015

As per our conversation I updated http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img to have gdb, debug symbols for libc, dbus, and libcap, and I removed the apt proxy config.

There are two snapshots now: "upstream" is with systemd, systemd-logind, and systemd-journald from current master, and "revert-a5bd3c3" has that commit reverted (in spirit). The VM defaults to "upstream". Note that this reversion actually stopped working reliably, some commit in the last days broke this as a workaround (perhaps #1707); it was meant to provide a reliable first boot before test stuff can be deployed, so it's not that useful any more. With that this VM fails quickly: usually at the first or second boot, but in my runs tonight it never survived more than 5.

Note that there is no snapshot yet of the running state when this happens. I need to look into whether and how this can be done with qemu (didn't see this in https://en.wikibooks.org/wiki/QEMU/Monitor or the manpages). But the hang reproduces very often, so this is maybe not that important.

I also did one more experiment: http://paste.ubuntu.com/12995227/ (resulting in the journal http://paste.ubuntu.com/12995347/). I think we stared at the sendmsg() side long enough now, so I looked at the receiving end. I ensured that all ctrl messages received fall into the two existing cases "fd array" or "ucred", i. e. we don't receive any unhandled/unknown messages.

BTW: manager_dispatch_notify_fd does not actually seem to handle a pid? I know we don't use that feature in systemd itself, but as the sender API supports it now, shouldn't control messages with a pid be handled somehow? But this is just a tangent...

An interesting (I think) observation is that even with completely ignoring received fd arrays (as in the above patch) we still get the bug. So the problem is not with their handling and triggering further notifications further down in the code (I though this might lead to deadlocks due to cyclic notifications perhaps); it almost seems like the mere act of receiving an fd array and doing nothing with it already causes this. @zonque had some theory of this being some kernel bug above, maybe this corroborates this?

Contributor

benjarobin commented Oct 29, 2015

The backtrace obtained with gdb a little bit later of systemd, dbus-daemon, systemd-logind, systemd-journal, systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-4.tar.xz
I don't have time now to analyse it, I hope it will help you

journald is not able to write the content of the log, journald just hang. journald is unstuck when I just sent SIGKILL to all process using Magic SysRq key

Contributor

benjarobin commented Oct 29, 2015

The analyse of the deadlock with systemd 227 + commit b215b0e
We do have a nice deadlock loop :
[journald] --- sd_pid_notify_with_fds ---> [systemd] --- sd_bus_call_method --> [dbus-daemon] --- vsyslog --> [journald]

[logind] is stuck since the process try to communicate with systemd (notify) or dbus, and these processes hang since stuck in a deadlock loop

Owner

zonque commented Oct 29, 2015

@benjarobin hmm, could you elaborate a little more, please? The notify socket is a non-blocking DGRAM socket. Even if we bail from manager_dispatch_notify_fd without doing anything (ignore and dispose any message immediately in case it contains FDSTORE=), we still see the issue. And in my gdb and strace debug attempts, I don't see any such deadlocks. What am I missing?

Contributor

benjarobin commented Oct 29, 2015

@zonque Well, I just show the fact (take for example the 7AKu3T-d3.log of the 1505-4.tar.xz archive), journald is stuck inside the sd_pid_notify_with_fds function. Yes the process shouldn't be stuck like that, I have no idea why all backtrace show this process stuck inside this function.

It's very easy to reproduce the problem on my computer, it's much harder to boot normally on this computer. With systemd 226 I do not have any problem.

Contributor

benjarobin commented Oct 29, 2015

@zonque The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message.

Owner

zonque commented Oct 29, 2015

The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message.

This is confusing. Our messages are tiny, and way smaller that PIPE_BUF, so this should never happen. Are you really seeing this call block in strace or gdb? Does journald hang in your case? Because it doesn't in my setup.

Contributor

benjarobin commented Oct 29, 2015

@zonque Yes I do see this call block with gdb (did not setup strace). Did you check the archives which contain the test script and the result log ?
Could you please share your test setup ?

I currently trying to reproduce the problem the problem with systemd patch with SOCK_NONBLOCK for sendmsg

Owner

dvdhrm commented Oct 29, 2015

@benjarobin But systemd (pid1) never does sd_bus_call_method(). In other words, pid1 never does a synchronous method call. So I cannot see why pid1 is stuck in your case? If it weren't stuck, then it should still dispatch notify messages and journal would continue as usual.

Contributor

benjarobin commented Oct 29, 2015

@dvdhrm Well maybe you are wrong. Did you check the backtrace ? If anybody have a problem to access/download the archives lets me know

src/core/manager.c:2039 is the code of pid1 ? I don't know anything about the code of systemd, I may have done a mistake, that why I give you the test script and the result log

Owner

dvdhrm commented Oct 29, 2015

@benjarobin, manager.c:2039 is the main-loop, so this is not really correct. Anyway, I can see that AddMatch and friends are blocking DBus calls, and they will indeed dead-lock if dbus-daemon logs and journald blocks on pid1. @poettering might wanna comment on this.

Contributor

benjarobin commented Oct 29, 2015

I am not able to reproduce the hang with this code :
Systemd 227 + b215b0e + inside sd_pid_notify_with_fds() the socket call use SOCK_NONBLOCK and sendmsg use MSG_DONTWAIT

I did setup a auto-reboot of the computer on boot success, and everything looks fine...

But the patch applied is not a solution, we shouldn't drop sd_pid_notify_with_fds() call if the kernel buffer is full

Owner

dvdhrm commented Oct 29, 2015

@benjarobin, you might be on to something. Talking with @zonque about this a bit more, imagine the following:

system boots up, everything goes on as normal. dbus-daemon is started up and trigger pid1 bus activation. pid1 receives the call and triggers API bus registration. pid1 goes into bus_init_api(), which calls sd_bus_add_match() synchronously on dbus-daemon. This eventually fails, causing bus_init_api() to fail without ever setting m->api_bus. Thus, we end up with an ETIMEDOUT error on sd_bus_add_match() in the logs (as described by @martinpitt), and we also get the behavior @zonque and I experienced, where pid1 is fully working via the registered API, but never sends out any bus-events on the api-bus (because the its vtables are registered and working, but the api_bus pointer is not set).

Now the remaining question is: why does sd_bus_add_match() time out? And more importantly, why is that related to the sd_notify_*() call in the journal?

Some facts that might help: dbus-daemon logs synchronously. Hence, that log might be full, thus dbus-daemon is blocking on the log-stream to the journal. The journal might be blocking (who knows why) on pid1, and pid1 blocks on dbus-daemon via sd_bus_add_match(). The latter times out at some point, causing the problems as described.

I still don't get why the journal blocks, though. The sd_notify call allocates a new socket and only sends a single DGRAM message. This should never block as the kernel has a small, but existent internal buffer per dgram socket.

Also weird: why does a sleep() call in front of the sd_notify solve the issue?

I think the blocking sd_bus_add_match might indeed be the underlying cause, but I'm still not sure how the full pseudo-deadlock circle goes.

Owner

zonque commented Oct 29, 2015

Ok, @benjarobin, you deserve a price for this. I profiled the sd_pid_notify_with_fds() call in journald, an in the failed case, it in fact spends >25s in that call. This explains the deadlock, now we need to think about a solution. Of course, after that timeout, everything goes back to normal, and hence none of the tasks remain in and hung state. Thanks a million, this one was driving me nuts.

Contributor

benjarobin commented Oct 29, 2015

Thanks to @martinpitt who gave me the base idea for the test script. And thanks to my computer who have a reproduce rate close to 90%
But if anyone could explain to me why sd_pid_notify_with_fds() is blocking ?

Owner

dvdhrm commented Oct 29, 2015

@benjarobin: Simple. pid1 has a single DGRAM socket where it receives sd_notify() messages from all other processes. The receive-queue is limited to 16 messages by default. If there're 16 unit startups in parallel, the queue is full an journald blocks on the sd_notify().

In the same time pid1 blocks on AddMatch, blocks on dbus-daemon, blocks on logging.

Fix is probably to make dbus-daemon logging async, and to make journald sd_notify async as well.

Owner

poettering commented Nov 1, 2015

My patch to fix this properly waits for you in #1745 now. Please have a look. But note that #1737 should be reviewed/merged first, as #1745 builds on it.

@poettering poettering closed this in e22aa3d Nov 2, 2015

ssahani pushed a commit to ssahani/systemd that referenced this issue Nov 3, 2015

journald: never block when sending messages on NOTIFY_SOCKET socket
Otherwise we might run into deadlocks, when journald blocks on the
notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and
dbus-daemon blocks on logging to journald. Break this cycle by making
sure that journald never ever blocks on PID 1.

Note that this change disables support for event loop watchdog support,
as these messages are sent in blocking style by sd-event. That should
not be a big loss though, as people reported frequent problems with the
watchdog hitting journald on excessively slow IO.

Fixes: #1505.

ssahani pushed a commit to ssahani/systemd that referenced this issue Nov 3, 2015

journald: never block when sending messages on NOTIFY_SOCKET socket
Otherwise we might run into deadlocks, when journald blocks on the
notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and
dbus-daemon blocks on logging to journald. Break this cycle by making
sure that journald never ever blocks on PID 1.

Note that this change disables support for event loop watchdog support,
as these messages are sent in blocking style by sd-event. That should
not be a big loss though, as people reported frequent problems with the
watchdog hitting journald on excessively slow IO.

Fixes: #1505.

ssahani pushed a commit to ssahani/systemd that referenced this issue Nov 3, 2015

journald: never block when sending messages on NOTIFY_SOCKET socket
Otherwise we might run into deadlocks, when journald blocks on the
notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and
dbus-daemon blocks on logging to journald. Break this cycle by making
sure that journald never ever blocks on PID 1.

Note that this change disables support for event loop watchdog support,
as these messages are sent in blocking style by sd-event. That should
not be a big loss though, as people reported frequent problems with the
watchdog hitting journald on excessively slow IO.

Fixes: #1505.

ssahani pushed a commit to ssahani/systemd that referenced this issue Nov 3, 2015

journald: never block when sending messages on NOTIFY_SOCKET socket
Otherwise we might run into deadlocks, when journald blocks on the
notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and
dbus-daemon blocks on logging to journald. Break this cycle by making
sure that journald never ever blocks on PID 1.

Note that this change disables support for event loop watchdog support,
as these messages are sent in blocking style by sd-event. That should
not be a big loss though, as people reported frequent problems with the
watchdog hitting journald on excessively slow IO.

Fixes: #1505.

ssahani pushed a commit to ssahani/systemd that referenced this issue Nov 3, 2015

journald: never block when sending messages on NOTIFY_SOCKET socket
Otherwise we might run into deadlocks, when journald blocks on the
notify socket on PID 1, and PID 1 blocks on IPC to dbus-daemon and
dbus-daemon blocks on logging to journald. Break this cycle by making
sure that journald never ever blocks on PID 1.

Note that this change disables support for event loop watchdog support,
as these messages are sent in blocking style by sd-event. That should
not be a big loss though, as people reported frequent problems with the
watchdog hitting journald on excessively slow IO.

Fixes: #1505.

DeX77 added a commit to frugalware/frugalware-current that referenced this issue Nov 7, 2015

systemd-227-5-x86_64
* relbump
* fixes issues about "Connection timed out"
  see systemd/systemd#1505
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment