Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
timeouts, event source returned error #2200
Comments
zonque
referenced this issue
Dec 21, 2015
Merged
sd-event: improve debugging of event source errors #2204
|
Strange that nobody else reported this. Unfortunately, the event source that reported the error does not have a description set, which makes it hard to guess which one it is that's failing. You could apply the patch in #2204 and see what the error message says. Another, probably better option is to install the debug symbols, attach |
zonque
added
the
sd-event
label
Dec 21, 2015
robinro
commented
Dec 21, 2015
|
Thanks @zonque for the suggestions. The fact that nobody reported this really is weird. Our setup is rather ordinary so my best guess is that this bug is really non-obvious due to the long time till it shows up. Maybe we speed it up due to some The upgrade to v228 before was not complete because the I now have gdb running to get the state when the
Unluckily I killed systemd on one of the systems by accident (debugging pid 1 needs more care than the gdb use I did so far), so now I only have 2 systems with broken systemd that I can run gdb on. Those only have errors about twice per day so far and I didn't catch it yet. I'll post an update as soon as I get anything non-trivial out of it. In the log there are also lines |
robinro
commented
Dec 23, 2015
|
Since hooking up gdb to pid 1 the errors and timeouts didn't show up again. I'll keep you updated when they return. |
|
I took a look at the log you initially posted and there're some weird things happening in it. First, systemd-journald was killed because it didn't send back the "keep-alive ping" to PID1 in time. It would be interesting to understand why it got stuck. You can grep the string " watchdog timeout!" in your log and you will see that it happens a couple of time. systemd-logind got restarted for the same reason, not sure if restarting logind is supported though. First occurence is:
After journald was restarted, journald stopped logging at "Dec 19 22:59:21" and resumed 19 min later:
so there's a big "hole" in your log, maybe it's related to the restart of journald. Quickly after journald resumed, OOM killer was triggered:
So your system seems to run out of memory (due to then "scan" process?) after some times which causes all those weird behaviours. |
probably the same behaviour than the one described in issue #1353 |
|
We only track bugs in the most recent versions of systemd upstream. Given that you say that this does not occur in 228 I will hence close this bug now. |
poettering
closed this
Dec 23, 2015
|
@poettering hm, the bug reporter said " After the upgrade to v228 I had problems with the debug symbols and so far didn't get output." Doesn't quite sound to me like it necessarily didn't happen under v228. |
|
BTW:
See https://gist.github.com/ohsix/3b7cfd9ef8ef82477133#gistcomment-1656622
|
robinro
commented
Dec 28, 2015
|
@fbuihuu The out of memory event was some computational job that got out of control. I believe this is unrelated to the systemd issue. The other affected machines didn't have any #1353 sounds like an idea. If I see it correctly the watchdog runs every 20 seconds here. Maybe under some unexpected load this is not sufficient. We use swap extensively so our typical usage is similar to #1353. On the other hand one system was completely idle and still dies. But maybe there was some NFS issue at that time so it still hung more than 20 seconds. |
robinro
commented
Dec 28, 2015
|
@poettering @mbiebl
Because the problem appears only after a long time with many machines I can't really test v228 without deploying it on all machines, which is unrealistic for us at the moment. |
robinro
commented
Dec 28, 2015
|
I have a new guess at the issue. Pretty much the only non-standard thing we do with systemd is two custom timers. One calls a oneshot once a day and conflicts with the other. After reading #1981 I thought it might be possible the conflict setting in combination with timers might trigger something similar. Sadly so far I was not able to produce some timer/conflict setup that triggers the bug. Nevertheless I cleaned up the timers and we'll see whether this gets rid of future timeouts. |
robinro
commented
Jan 4, 2016
|
It really was the timers. At least we didn't have a system failure since I changed them. Sadly I was not able to get a reproduceable setup that quickly segfaults systemd. My best guess it's somewhere with handling conflicting timers. The setup that broke systemd was:
The setup has multiple flaws: OnUnitActive with oneshot leads to the point in time moving thorugh the day (the runtime is added to the starting time). The approx. two weeks it took for the bug to appear were the time it took job2 to start simultaneously with job1. The conflict is ugly as it kills the other service. As a fix we use OnCalendar and run the jobs such they don't overlap. Running the timers ridiculously often and simultaneously was not enough to trigger the bug. Maybe there is some over condition needed. Thanks everyone who helped with this somewhat dubious bug. I learned a log about debugging systemd in the process and now feel confident to run gdb on pid 1 |
robinro commentedDec 20, 2015
Problem
After about 2 weeks since last reboot,
systemctl statusstarts to fail withFailed to get properties: Connection timed outat higher rates until making systemd completely unresponsive.System
openSUSE 42.1
systemd v210 (upgraded to v228 after the bug happened)
settings: 42.1 defaults + log-level debug, filter on pam-session output, some custom services that run with timers or continuously
Details
The bug start to appear after about 2 weeks of continuous runtime on about 10% of our machines and seems to be independent of hardware or use.
With
log-level=debugon each failedsystemctl statusI getEvent source 0x7df93e0 returned error, disabling: Connection timed out. The log is full of (maybe 10 per second)The log for one of the affected machines is at https://gist.githubusercontent.com/robinro/87f0b8889eabf6a1cbc2/raw/db375fdf8f34b76e9fd8ae52d22e635d634ea6d8/systemd.log
The machine was running 3 weeks with modest desktop use. The testjob (
systemd status dbusevery 15 mins) first failed at 02:45:00, one can also search for postfix to find the line.Based on an IRC with ohsix I tried the steps discussed at https://gist.github.com/ohsix/3b7cfd9ef8ef82477133. With v210 I got that s->type is 0. After the upgrade to v228 I had problems with the debug symbols and so far didn't get output.
This bug is maybe in our setup as v210 is in wide distribution and I couldn't find somebody else with similar problems. Please let me know how to best debug this further.
This is also reported at https://bugzilla.opensuse.org/show_bug.cgi?id=958346