Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
[227 regression] Boot occasionally fails with "Connection timed out" #1505
Comments
martinpitt
added
the
regression
label
Oct 9, 2015
martinpitt
added this to the v228 milestone
Oct 9, 2015
martinpitt
self-assigned this
Oct 9, 2015
martinpitt
added
the
release-critical
label
Oct 9, 2015
|
For the record:
It's actually not -- as https://bugs.debian.org/801361 points out, this will actually unmount stuff if you log in as root. I have the bisect running now, which will show if the unmounts and the timeouts have one and the same root cause. If not, I'll open a separate issue for the unmounts. |
|
Found the culprit: It's a5bd3c3 from PR #1215. Reverting that fixes the connection timeouts and everything is hunky-dory again. Curiously, pretty much the exact same symptom happened with 219 in Februrary -- do you guys still remember http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html ? Back then it was fixed in 6414444 which already (more or less accidentally) fixed the @maciejaszek , @dvdhrm , any idea about how the real fix looks like? Thanks! |
martinpitt
removed their assignment
Oct 9, 2015
|
The "user systemd tries to unmount" stuff is unrelated to this, I now reported issue #1507 about that. |
|
FTR, I actually did see this occasionally when running CI tests for the Ubuntu trunk builds, but as the Debian ones (which I do more often) worked fine I didn't pay enough attention to them. Sorry for the process fail, it's always frustrating when we release with a major bug that we could have spotted before. I'll be more suspicious when the "boot smoke" test fails in the future! |
|
Thanks, @martinpitt for bisecting this! Could give this patch a try please, and see if it fixes the regression? zonque@9f58f91 |
|
Hmm, wait. No, that doesn't explain it. |
|
@zonque: No surprise, but FTR: no difference with that patch. |
|
@martinpitt soo, how and why precisely does the sendmsg() fail and in which process? Do you have an strace of the process maybe, so that we can have a look? |
poettering
added
the
needs-reporter-feedback
label
Oct 9, 2015
manover
pushed a commit
to manover/systemd
that referenced
this issue
Oct 11, 2015
|
@poettering: I'll see whether I can come up with an early systemd unit which attaches strace to pid 1 early. Things like logind etc. do a @maciejaszek : Your original PR #1215 looks wrong. "CMSG_SPACE(0) may return value other than 0" is intended as it contains padding for alignment:
I think this padding must at least be part of the allocation below:
Thus the original code before that PR looked right. However, you said that you got some
But the example on the manpage actually sets it to the sum of the So if we instead assume that the description is right and the example is wrong, we keep So going back, the only code that actually works is the one with PR #1215 reverted, and I don't know why @maciejaszek got |
|
@martinpitt if either |
|
Wrt. stracing, this actually works quite well:
First I ran this stracing with current git head, i. e. with this bug. From a failed boot the journal is http://paste.ubuntu.com/12761903/ and the strace output is http://paste.ubuntu.com/12761900/ . Note that there are zero instances of From a successful boot the journal is http://paste.ubuntu.com/12761945/ and the strace is http://paste.ubuntu.com/12761943/ . Note that there are no AFAICS the socket isn't marked as nonblocking, so the usual meaning of So I'm afraid I can't really make sense of the |
|
I re-ran with the reverted patch, so that the "Connection timed out" errors disappear. strace is http://paste.ubuntu.com/12762156/. As expected there are no more With current git head (i. e. without reverting), the EAGAINs coincide exactly with the failed boots with connection errors (in pid 1). |
|
@martinpitt: bug occurred when I tried to pass fds, but have_pid was set to 0. After my change everything booted without any problems. I'm looking at the code, but it would be good to test changes - do you have any image on which I can reproduce this error? |
|
I do have an image, but it's 2.1 GB. I started uploading it now, but it'll take some two hours. This is more or less a standard ubuntu cloud image (http://cloud-images.ubuntu.com/wily/current/wily-server-cloudimg-amd64-disk1.img) with installing some extra stuff (network-manager, policykit-1, lightdm), upgrading to systemd 227 and enabling persistant journal, and then rebooting it in a loop. I also got this behaviour on full desktop images back then. The whole machinery to do that (autopkgtest etc.) is not that simple to reproduce on other distros. But it seems to me that enabling persistent journal is somehow a key ingredient. In the previous case (http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html) we eventually got bug reports on pretty much every distro (Arch, Fedora, Debian, Ubuntu, etc.), but it seems no developer except me could reliably reproduce this.. It's such an iffy heisenbug. |
|
When this bug happens, stracing logind shows 27
Whereas with the reverted patch we get multiple calls but they all have exactly the same control length "20", and again there are no errors:
Sorry, this makes no sense at all to me, at this point I'm just monkey-patching around.. |
|
Just for clarity, a5bd3c3 looks correct. The previous code was definitely not correct. Maybe there is still something wrong, but I checked all the kernel CMSG macros and internal handling, and it looks fine. Please correct me, if I'm wrong. Furthermore, the log-messages don't mention a failure in sendmsg(2). Instead, what I see is lightdm calling pam, calling logind, calling pid1, calling AddMatch on dbus-daemon. The latter fails and the error code is passed through the chain to lightdm. Maybe this is not a direct chain, but rather an activation chain. But that's probably irrelevant. Anyway, we cannot ignore that reverting a5bd3c3 fixes your issue. This somehow smells like stack corruption to me.. I'll see whether valgrind throws something interesting. Furthermore, can you give some details how you reproduced this? Is this 32bit? 64bit? 32bit on 64bit? x86? ARM? etc.. That is, trying to figure out why none of us sees it on their production system. |
|
@dvdhrm: We got at least three different reporters hours after we uploaded 227 to Debian sid. People there used i386 (32 bit) and x86_64, lightdm or gdm3 etc., and as this also kills journald, logind, rfkill I don't believe this is dependent on a particular login manager or even architecture. However, I could never reproduce it on my Debian test images. I just tried taking a standard x86_64 Ubunt desktop VM install, upgrade to systemd 227, enable persistent journal, and reboot often; but I couldn't trigger it like that. Half a year ago it pretty much felt exactly the same, and I got it once or twice with manual tests on a desktop VM, but it was too frustrating to reproduce that way as this seems to be highly dependent on timing, sun rays, local air pressure, and what not. The upload of my aupkgtest VM where this is reasonably easy to reproduce finally finished: http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img (2.1 GB)
As this was based on a minimal VM, lightdm doesn't actually start up, so there's no usable graphic output. The VM can be driven by the serial console, or (once it starts up), over ssh. The above QEMU command starts it with a console on stdio, and you can use This VM mostly just needs to be rebooted a couple of times (like 5), and then it reliably produces that hang for me. User is "ubuntu", password "ubuntu", sudo works without password. |
|
Some missing info:
|
|
I've written some test service, which calls sd_pid_notify_with_fds and straced pid 1:
It looks like manager is receiving it properly, adds to watched fds and then tries to receive something one more time, which fails. |
|
For the record: a5bd3c3 changed both the allocation and the value of
which keeps the current control length, but allocates some extra space in case it overflows due to padding. With this the bug is still present, so I don't believe it's just a short allocation. As for my earlier "AFAICS the socket isn't marked as nonblocking, so the usual meaning of EAGAIN doesn't apply here" → this is obviously wrong. The socket itself is not opened with Are these |
|
Hmm, there's a nasty detail in |
|
Seriously? This loop fixes it? |
|
@zonque: That's it, you are a star! No more hangs with the Please just fix the "sd-daemin" typo in the commit log :-) Thanks! |
|
David Herrmann [2015-10-12 6:12 -0700]:
I deleted the comment. I got a successful run with it, but it seems |
martinpitt
removed
the
needs-reporter-feedback
label
Oct 12, 2015
zonque
referenced this issue
Oct 12, 2015
Merged
sd-daemon: wipe out memory before using CMSG_NXTHDR() #1540
|
I still wonder why that didn't result in an triggered assertion on the sender side. |
dvdhrm
closed this
in
#1540
Oct 12, 2015
|
Argh .. I just re-ran the test (with the |
martinpitt
reopened this
Oct 12, 2015
|
@martinpitt, what does int main(int argc, char **argv) {
int x[2];
pipe(x);
sd_pid_notify_with_fds(0, 0, "bla", x, 2);
return 0;
} |
|
(you have to run that from a unit file, so that |
|
@zonque: I use this complete .c program:
and this test unit:
With current master I get
With master+reverted patch I get this instead:
which I guess is what @maciejaszek 's patch fixed. I. e. the extra 16 bytes in msg_controllen are the aligned So far, no surprises here. |
|
A couple of questions: did anyone manage to reproduce this issue on anything that's not Ubuntu? As this causes AddMatch in dbus-daemon to timeout, what precisely is dbus-daemon doing in this case? What does a backtrace of dbus-daemon show when this happens? Does downgrading dbus-daemon to something older have any effect on the issue? A wild guess of mine might be that there's some confusion with the fds in PID 1, and it starts treating some dbus fd as notify fd or vice versa... @martinpitt if you strace logind and dbus-daemon simultaneously, do you see the AddMatch message transferred properly? Do you see it being sent by logind, and received by dbus-daemon? int the strace output, can you verify that the right fds are used when reading/writing the fds? |
|
@poettering we at least had several Debian unstable users as well who were affected by this |
|
I can also reproduce this error on a VM with Fedora 23, systemd 227.
And I also tested with the Ubuntu VM by @martinpitt, which failed once in 5 trials. So I don't think it's a distro-specific issue. |
|
I don't get it. @dongsupark, @martinpitt, could you boot your system with
... and then paste the valgrind log somewhere? |
coling
commented
Oct 13, 2015
|
Just for reference (don't want to muddy the waters) but I've also had issues with logind on v227. Several times I've found I've been unable to unlock my session and switching to VTs fails to start a getty process. Emergency shell shows logind as failed and list-jobs shows we're waiting for the logind restart job. All very confusing. I've never been able to get very much debug out of it but when it happens again I'll see what I can dig out. Suspect it's related. Didn't seem to happen with 226. Not much to add I appreciate - like I say, will see what I can dig up next time it happens. |
I think I've hit the same issue on F22 with 227 systemd. But it happened just twice. There was weird delay when dbus was starting and then logind failed. |
It seems sufficiently clear now (after the stracing above) that this commit was indeed correct. Reverting merely has the effect that
In fact I don't. http://paste.ubuntu.com/12779425/ is the entire strace of logind, and the only appearance of D-Bus is the (successful)
So indeed this points towards a D-BUS problem. The strace of dbus-daemon (http://paste.ubuntu.com/12779475/) ends with
which sounds like a problem with handing over FTR, http://paste.ubuntu.com/12779487/ is the
But I'll get a full strace of pid 1, and will see to modify |
|
Ah! The "No socket received" is highly interesting! This is where we should continue debugging! Could you put together a script that writes all env vars and the long contents of /proc/self/fd/ and /proc/self/fdinfo into a file and then execs dbus-daemon? Then, please change dbus.service to invoke this script rather than the actual binary. With that we should be able to see what precisely dbus-daemon gets passed. |
|
http://people.canonical.com/~pitti/tmp/straces.tar.xz has the full straces including pid1. The latter is now 100 MB, so pastebin is unwieldy. dbus.trace already contains the env vars ( I put this into the test setup:
With that I get
and the corresponding strace http://paste.ubuntu.com/12779676/ . I can't make much sense of that though, can you? Grepping for the socket:[..] stuff after the startup yields nothing, so I'll adjust the script to grep /proc for these sockets immediately, to see what the other end is. |
|
@martinpitt the strace cannot work that way, you need to specify at least "-D", as otherwise the LISTEN_PID will not match the PID of dbus-daemon, and then dbus-daemon rightfully ignores any passed fds. |
|
@martinpitt hmm, if you had the strace line like this earier already, maybe the "No socket received" thing is simply result of that? |
|
This downstream bug might be related to this issue as well |
|
Thanks for pointing that out! That also explains why the VMs didn't finish booting at all any more with the dbus stracing. New test setup:
dbus is now actually running and logind does the AddMatch things. logind strace: http://paste.ubuntu.com/12780100/ |
|
@martinpitt OK, so if the AddMatch works now and much of the issues at hand are trigegred throught the wrong strace, what precisely remains now? Does the AddMatch work correctly now, in all cases? So what's not working now, what is this bug about precisely now? |
|
@poettering: The wrong straces happened since #1505 (comment) . strace now shows that logind and dbus are running, but systemd still doesn't consider them running:
This is something that the newly working Can you reproduce the hangs with the test VM? (See #1505 (comment)) This doesn't contain any strace or other debug annotation, I add all that after booting. |
|
@martinpitt logind is Type=dbus, and hence systemd waits for the org.freedesktop.login1 name to appear on the bus before it is considered started. When this hangs, has the name properly appeared? Note that logind also sends READY=1 with sd_notify, but PID 1 does not use that for state changes, if Type= is not notify... when this hangs, can you use "systemd-analyze dump" to the full state dump of pid1, and then check the "NotifyState=" field for the logind.service unit? What does it say? |
|
I just realized that the previous logind strace was useless, as logind keeps restarting itself and thus the trace is always just the latest one without the interesting error message. Another the logind strace (http://paste.ubuntu.com/12780571/) shows some successful D-Bus calls for the AddMatch
but finally one with a timeout:
after that it shuts down:
|
|
while boot is hanging, systemd-analyze is failing as well with "Failed issue method call: Connection timed out". If the dump output is still useful after it failed, here it is:
So trying anything dbus-y during that time is difficult, and strace seem to be a better bet.. |
|
hmm, you can make systemd log the same output to the journal via SIGUSR2 to PID 1. I am really interested in getting this output at the moment where the hang happens. Also, could you get a backtrace of dbus daemon, and PID1 when this hangs? |
|
FWIW, I was finally able to trigger this with F23 in a VM, but it took ~20 reboot cycles. So this is definitely not an Ubuntu specific issue. |
|
FTR, I'm not using kdbus in these tests, and I'm sure most/all of the Debian bug reporters don't either. @zonque: not sure whether you already tried, but v225+cherry-pick a5bd3c3 also causes this bug. Which corroborates that fixing So what I could do now is to start a bisect from, say 219 or so up to 225 with cherry-picking a5bd3c3 on each iteration. Or did you already do that? I. e. somehow you got towards bbc2908, which bisection did you do? |
|
@martinpitt, still confusing, but sure, let's try. I didn't try this yet but bisected from v225 upwards. Cherry-picking a5bd3c3 won't work before 6414444, however, so v220 seems like a good starting point, and maybe it's a good idea to just test the tagged releases first. |
|
Indeed v220+a5bd3c3 fails (after 14 iterations). 6414444 was indeed the commit which fixed http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html, i. e. when the symptoms were exactly the same. 6414444 doesn't apply to 219 and earlier any more, and I don't think going back in time is going to be useful -- it's pretty clear by now that the sd_notify handling is somehow either the cause or more likely the trigger of this, and the underlying bug has been lurking for a long time already. |
@zonque hard to tell. What I can tell for sure is that both v226 and v227 had the hanging issue. That's what I stumbled across last week. I already tried to go back to v225, but I'm not sure I tested v225 enough to be tell it's working at all. For that I'd need to do the reboot-test again with v225. |
coling
referenced this issue
Oct 16, 2015
Closed
[227 regression] rfkill socket job hangs and blocks other jobs (still present in 228) #1579
|
I did some further experiments wrt. @zonque's analysis in #1505 (comment) . First, this diff that
but no other messages; IOW, all invocations of So commit a5bd3c3 should have zero effective change in the first line (for n_fds); just to confirm that we are not completely crazy, I ran the tests with
and as expected it still fails. Then the effective change of that commit is that we stop allocating a second header for 0 bytes, i. e.
right? And indeed with just
things work again. This is indeed highly o_O -- the first |
|
Smells like a stack corruption issue then, i.e. where the location of things on stack starts to matter. A pity that valgrind wasn't able to track this one down for us though... |
thparamboll
commented
Oct 20, 2015
|
Maybe a build using --enable-address-sanitizer will catch something |
|
I tried Already established, but not very explicitly, and thus repeating for the summary:
I really continue to be convinced that this is not a memory corruption issue at all, we don't have a single point of evindence for that, but lots of evidence that it isn't. This bug has been here for quite some time, and commit a5bd3c3 merely uncovered it. I. e. the fact that previously journald failed to notify pid 1 about the fd and now it succeeds exposes some lockup issue. |
Right, with parameter failures on the sender side, the receiver continues to work. I also did some more tests and I'm starting to believe this is a kernel issue with file descriptor passing over sockets that eventually (modulo some conditions that I don't know yet) confuses the state of the receiver. I'm currently writing a standalone test program to reproduce this, but I didn't have any luck so far. I'll keep you posted. |
To elaborate a bit more: even when the received fds are closed immediately after |
|
Right now the only tool that use the pid argument is And since nobody is using the pid argument (always 0), the code before 6414444 was valid (as long as the pid arguement is not used) ! The The commit 6414444 breaks the code, it doesn't fix it, this commit just hide the problem. The code was fixed again in the commit a5bd3c3 So if there is any bug inside systemd, the bug is prior to v219 |
|
Commit 6414444 was right in the sense that before this we got connection timeouts which looked pretty much exactly like now; back then it was triggered by 13790ad which first made use of this functionality. FWIW, that's not how I read man cmsg(3): it uses |
|
So in our case, with only one fd and no pid => only one
The alignment is optional and only needed if we want more than one |
|
@benjarobin the code is correct as it stands right now. Quoting the man pages:
Also, the code does the right thing and successfully transmits the fds (and the right number of fds) to PID1, so that's not the issue. The problem is there is a side-effect of that action, which causes PID1 to trip up eventually. The code we're looking at here only allows the other effect to happen. |
|
@zonque Sorry if I misspoke, I wanted to say that the current code is correct and fully valid, and the code before 6414444 was valid as long the pid argument was not used. The kernel doesn't check and doesn't care that msg_controllen is a multiple of 8. => The purpose of my messages was to say that the bug is prior to 6414444 With the commit 6414444 when we try to transmits the fds, we also send an other uninitialized cmsghdr with no data. I have no idea what the kernel do when the kernel "parse" the second cmsghdr, which contains random data. |
|
@zonque @martinpitt I edited / improved / fixed my previous messages, sorry again I misspoke |
|
@martinpitt My English is so bad... That was a question, sorry..., So we are agree with that, fd passing is not working between 6414444 and a5bd3c3 |
|
@benjarobin You cannot conclude that. The trigger and the cause might not be the same commit. |
|
hmm, so i am trying to reproduce this now, and truned off kdbus again now. I am trying to reproduce this in a container, but I have no luck... Did anyone manage to reproduce this in nspawn? |
|
hmm, so i tried this with baremetal now too, rebooted 15 times in a row, and dbus and logind were fine each time. I am trying this with current git on F23. Anyone can tell me how to reproduce this on a system like that? |
|
@cinquecento btw, please do not dump arbitrary logs in this bug report (or any), unless this is requested. It's hard enough to follow this bug report, as long as it is, and given that there were other issues unrelated to the actual bug being discussed here, and dumping huge log files just makes it even harder. |
As I said, it takes around 40-200 reboots to trigger in my case. I'm so far only testing with a VM image. |
dvdhrm
referenced this issue
Oct 28, 2015
Merged
core: fix priority ordering in notify-handling #1707
poettering
closed this
in
#1707
Oct 28, 2015
zonque
reopened this
Oct 28, 2015
|
I did run some tests and get a current Call Trace of all process using sysrq-trigger, here the result :
|
|
Here the backtrace obtained with gdb of systemd dbus-daemon systemd-logind systemd-journal systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-3.tar.xz |
|
As per our conversation I updated http://people.canonical.com/~pitti/tmp/adt-wily+systemd227.img to have gdb, debug symbols for libc, dbus, and libcap, and I removed the apt proxy config. There are two snapshots now: "upstream" is with systemd, systemd-logind, and systemd-journald from current master, and "revert-a5bd3c3" has that commit reverted (in spirit). The VM defaults to "upstream". Note that this reversion actually stopped working reliably, some commit in the last days broke this as a workaround (perhaps #1707); it was meant to provide a reliable first boot before test stuff can be deployed, so it's not that useful any more. With that this VM fails quickly: usually at the first or second boot, but in my runs tonight it never survived more than 5. Note that there is no snapshot yet of the running state when this happens. I need to look into whether and how this can be done with qemu (didn't see this in https://en.wikibooks.org/wiki/QEMU/Monitor or the manpages). But the hang reproduces very often, so this is maybe not that important. I also did one more experiment: http://paste.ubuntu.com/12995227/ (resulting in the journal http://paste.ubuntu.com/12995347/). I think we stared at the BTW: An interesting (I think) observation is that even with completely ignoring received fd arrays (as in the above patch) we still get the bug. So the problem is not with their handling and triggering further notifications further down in the code (I though this might lead to deadlocks due to cyclic notifications perhaps); it almost seems like the mere act of receiving an fd array and doing nothing with it already causes this. @zonque had some theory of this being some kernel bug above, maybe this corroborates this? |
|
The backtrace obtained with gdb a little bit later of systemd, dbus-daemon, systemd-logind, systemd-journal, systemd-udevd, running with systemd 227 + cherry-pick of commit ref #1707 : http://benjarobin.free.fr/Divers/benjarobin-systemd-1505-4.tar.xz journald is not able to write the content of the log, journald just hang. journald is unstuck when I just sent SIGKILL to all process using Magic SysRq key |
|
The analyse of the deadlock with systemd 227 + commit b215b0e [logind] is stuck since the process try to communicate with systemd (notify) or dbus, and these processes hang since stuck in a deadlock loop |
|
@benjarobin hmm, could you elaborate a little more, please? The notify socket is a non-blocking DGRAM socket. Even if we bail from |
|
@zonque Well, I just show the fact (take for example the 7AKu3T-d3.log of the 1505-4.tar.xz archive), journald is stuck inside the sd_pid_notify_with_fds function. Yes the process shouldn't be stuck like that, I have no idea why all backtrace show this process stuck inside this function. It's very easy to reproduce the problem on my computer, it's much harder to boot normally on this computer. With systemd 226 I do not have any problem. |
|
@zonque The socket of sendmsg is not opened with SOCK_NONBLOCK, so the DGRAM socket can block if there is not enough room to store the message. |
This is confusing. Our messages are tiny, and way smaller that |
|
@zonque Yes I do see this call block with gdb (did not setup strace). Did you check the archives which contain the test script and the result log ? I currently trying to reproduce the problem the problem with systemd patch with SOCK_NONBLOCK for sendmsg |
|
@benjarobin But systemd (pid1) never does sd_bus_call_method(). In other words, pid1 never does a synchronous method call. So I cannot see why pid1 is stuck in your case? If it weren't stuck, then it should still dispatch notify messages and journal would continue as usual. |
|
@dvdhrm Well maybe you are wrong. Did you check the backtrace ? If anybody have a problem to access/download the archives lets me know
|
|
@benjarobin, manager.c:2039 is the main-loop, so this is not really correct. Anyway, I can see that AddMatch and friends are blocking DBus calls, and they will indeed dead-lock if dbus-daemon logs and journald blocks on pid1. @poettering might wanna comment on this. |
|
I am not able to reproduce the hang with this code : I did setup a auto-reboot of the computer on boot success, and everything looks fine... But the patch applied is not a solution, we shouldn't drop sd_pid_notify_with_fds() call if the kernel buffer is full |
|
@benjarobin, you might be on to something. Talking with @zonque about this a bit more, imagine the following: system boots up, everything goes on as normal. dbus-daemon is started up and trigger pid1 bus activation. pid1 receives the call and triggers API bus registration. pid1 goes into Now the remaining question is: why does Some facts that might help: dbus-daemon logs synchronously. Hence, that log might be full, thus dbus-daemon is blocking on the log-stream to the journal. The journal might be blocking (who knows why) on pid1, and pid1 blocks on dbus-daemon via I still don't get why the journal blocks, though. The Also weird: why does a sleep() call in front of the I think the blocking |
|
Ok, @benjarobin, you deserve a price for this. I profiled the |
|
Thanks to @martinpitt who gave me the base idea for the test script. And thanks to my computer who have a reproduce rate close to 90% |
|
@benjarobin: Simple. pid1 has a single DGRAM socket where it receives sd_notify() messages from all other processes. The receive-queue is limited to 16 messages by default. If there're 16 unit startups in parallel, the queue is full an journald blocks on the sd_notify(). In the same time pid1 blocks on AddMatch, blocks on dbus-daemon, blocks on logging. Fix is probably to make dbus-daemon logging async, and to make journald sd_notify async as well. |
martinpitt commentedOct 9, 2015
https://bugs.debian.org/801354 reported that 227 introduced a rather major boot regression: You often get failures like
(same for lightdm, etc.) I now found a way to reproduce this in a VM, so I'll go bisect hunting.
As for the pam_systemd thing: I wonder if that's related to the regression that the user systemd instance is now trying to unmount things during boot:
This might be something entirely different (and then mostly cosmetical), or be the cause for failing to create a user session.