timeouts, event source returned error #2200

Closed
robinro opened this Issue Dec 20, 2015 · 12 comments

Comments

Projects
None yet
6 participants

robinro commented Dec 20, 2015

Problem

After about 2 weeks since last reboot, systemctl status starts to fail with Failed to get properties: Connection timed out at higher rates until making systemd completely unresponsive.

System

openSUSE 42.1
systemd v210 (upgraded to v228 after the bug happened)
settings: 42.1 defaults + log-level debug, filter on pam-session output, some custom services that run with timers or continuously

Details

The bug start to appear after about 2 weeks of continuous runtime on about 10% of our machines and seems to be independent of hardware or use.
With log-level=debug on each failed systemctl status I get Event source 0x7df93e0 returned error, disabling: Connection timed out. The log is full of (maybe 10 per second)

Dec 20 20:51:28 hostXXX systemd[1]: Got notification message for unit systemd-logind.service
Dec 20 20:51:28 hostXXX systemd[1]: systemd-logind.service: Got notification message from PID 8293 (WATCHDOG=1...)
Dec 20 20:51:28 hostXXX systemd[1]: systemd-logind.service: got WATCHDOG=1
Dec 20 20:51:28 hostXXX systemd[1]: Got notification message for unit systemd-journald.service
Dec 20 20:51:28 hostXXX systemd[1]: systemd-journald.service: Got notification message from PID 1669 (WATCHDOG=1...)
Dec 20 20:51:28 hostXXX systemd[1]: systemd-journald.service: got WATCHDOG=1

The log for one of the affected machines is at https://gist.githubusercontent.com/robinro/87f0b8889eabf6a1cbc2/raw/db375fdf8f34b76e9fd8ae52d22e635d634ea6d8/systemd.log
The machine was running 3 weeks with modest desktop use. The testjob (systemd status dbus every 15 mins) first failed at 02:45:00, one can also search for postfix to find the line.

Based on an IRC with ohsix I tried the steps discussed at https://gist.github.com/ohsix/3b7cfd9ef8ef82477133. With v210 I got that s->type is 0. After the upgrade to v228 I had problems with the debug symbols and so far didn't get output.

This bug is maybe in our setup as v210 is in wide distribution and I couldn't find somebody else with similar problems. Please let me know how to best debug this further.

This is also reported at https://bugzilla.opensuse.org/show_bug.cgi?id=958346

Owner

zonque commented Dec 21, 2015

Strange that nobody else reported this. Unfortunately, the event source that reported the error does not have a description set, which makes it hard to guess which one it is that's failing.

You could apply the patch in #2204 and see what the error message says. Another, probably better option is to install the debug symbols, attach gdb to PID 1, add a breakpoint in sd-event.c:2289 (for v228, that should be correct), and then wait for the event to occur. Enter p s->io.callback to resolve the address of the source handler. That should give us a hint where to dig further.

@zonque zonque added the sd-event label Dec 21, 2015

robinro commented Dec 21, 2015

Thanks @zonque for the suggestions. The fact that nobody reported this really is weird. Our setup is rather ordinary so my best guess is that this bug is really non-obvious due to the long time till it shows up. Maybe we speed it up due to some */15 timers that most people don't have.

The upgrade to v228 before was not complete because the daemon-reexec didn't succeed. I only realized this when trying to run gdb. Running daemon-reexec 10 times made it actually work and with v228 running the "Event source" errors are gone and also all timeouts are gone. So upgrading to v228 (assuming daemon-reexec works at some point) seems to be a workaround for the issue.

I now have gdb running to get the state when the Event source errors are printed.
For others trying to do the same, it's as simple as

zypper ar http://download.opensuse.org/debug/distribution/leap/42.1/repo/oss/ debug
zypper in systemd-debug{info,source}
gdb --pid=1
b sd-event.c:1924
c

Unluckily I killed systemd on one of the systems by accident (debugging pid 1 needs more care than the gdb use I did so far), so now I only have 2 systems with broken systemd that I can run gdb on. Those only have errors about twice per day so far and I didn't catch it yet. I'll post an update as soon as I get anything non-trivial out of it.

In the log there are also lines systemd[1]: Failed to send unit change signal for session-3292.scope: No buffer space available. Where do those come from? The first half is in dbus-unit.c, but the actual error is somewhere outside in dbus?
The only occurence in dbus I found is in dbus/dbus-sysdeps-win.c, which doesn't sound like it should be running here.

robinro commented Dec 23, 2015

Since hooking up gdb to pid 1 the errors and timeouts didn't show up again. I'll keep you updated when they return.

Contributor

fbuihuu commented Dec 23, 2015

I took a look at the log you initially posted and there're some weird things happening in it.

First, systemd-journald was killed because it didn't send back the "keep-alive ping" to PID1 in time. It would be interesting to understand why it got stuck. You can grep the string " watchdog timeout!" in your log and you will see that it happens a couple of time. systemd-logind got restarted for the same reason, not sure if restarting logind is supported though. First occurence is:

Dec 19 22:54:09 hostXXX systemd[1]: systemd-journald.service watchdog timeout!

After journald was restarted, journald stopped logging at "Dec 19 22:59:21" and resumed 19 min later:

Dec 19 22:59:21 hostXXX systemd[1]: Got notification message for unit systemd-logind.service
Dec 19 23:18:14 hostXXX systemd-journal[29620]: Permanent journal is using 640.0M (max allowed 4.0G, trying to leave 4.0G free of 32.2G available → current limit 4.0G).

so there's a big "hole" in your log, maybe it's related to the restart of journald.

Quickly after journald resumed, OOM killer was triggered:

Dec 19 23:18:15 hostXXX kernel: qmgr invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
...
Dec 19 23:18:15 hostXXX kernel: Out of memory: Kill process 22270 (scan) score 410 or sacrifice child
Dec 19 23:18:15 hostXXX kernel: Killed process 22270 (scan) total-vm:5394560kB, anon-rss:1139176kB, file-rss:0kB

So your system seems to run out of memory (due to then "scan" process?) after some times which causes all those weird behaviours.

Contributor

fbuihuu commented Dec 23, 2015

First, systemd-journald was killed because it didn't send back the "keep-alive ping" to PID1 in time. It > would be interesting to understand why it got stuck.

probably the same behaviour than the one described in issue #1353

Owner

poettering commented Dec 23, 2015

We only track bugs in the most recent versions of systemd upstream. Given that you say that this does not occur in 228 I will hence close this bug now.

@poettering poettering closed this Dec 23, 2015

Contributor

mbiebl commented Dec 24, 2015

@poettering hm, the bug reporter said " After the upgrade to v228 I had problems with the debug symbols and so far didn't get output." Doesn't quite sound to me like it necessarily didn't happen under v228.

Member

evverx commented Dec 24, 2015

BTW:

Based on an IRC with ohsix I tried the steps discussed at https://gist.github.com/ohsix/3b7cfd9ef8ef82477133

See https://gist.github.com/ohsix/3b7cfd9ef8ef82477133#gistcomment-1656622

you'll need debug symbols, various ways to do that, on fedora dnf debuginfo-install package will install pretty much everything required for a specific package

It's not easy on Ubuntu.
See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1039998/comments/12
See also: https://gist.github.com/evverx/bd0a6748f3c6254d2021#perf-probe

robinro commented Dec 28, 2015

@fbuihuu
Thanks for the detailed look at the logs.

The out of memory event was some computational job that got out of control. I believe this is unrelated to the systemd issue. The other affected machines didn't have any Out of memory event since the last reboot. Anyhow, the out of memory event only affects the scan program and should not harm anything else, especially not kill systemd/journald/...

#1353 sounds like an idea. If I see it correctly the watchdog runs every 20 seconds here. Maybe under some unexpected load this is not sufficient. We use swap extensively so our typical usage is similar to #1353. On the other hand one system was completely idle and still dies. But maybe there was some NFS issue at that time so it still hung more than 20 seconds.

robinro commented Dec 28, 2015

@poettering @mbiebl
I'm not sure the bug behind this problem is fixed in systemd v228.
What I know so far:

  • with v210 running systemd dies on our machines, starting about 2 weeks into their runtime it's a couple of them every week
  • there are different stages of 'dies', at first it starts throwing time outs on some interactions (we now monitor the answer of systemctl status) this then increases to no answer to anything at all. In the early stages one can still do systemctl daemon-reexec, at least after a couple of tries, at the later stage this also times out
  • in the early stage, an upgrade to v228 gets rid of the timeouts. I'm not sure this is because it fixes the root of the problem or because it "empties the queue" or whatever is causing the timeouts.
  • in the late stage of the bug, an upgrade of the running system is not possible, because daemon-reexec fails. The only way I know out of this phase is a hard reboot (via sysrq).

Because the problem appears only after a long time with many machines I can't really test v228 without deploying it on all machines, which is unrealistic for us at the moment.

robinro commented Dec 28, 2015

I have a new guess at the issue. Pretty much the only non-standard thing we do with systemd is two custom timers. One calls a oneshot once a day and conflicts with the other. After reading #1981 I thought it might be possible the conflict setting in combination with timers might trigger something similar. Sadly so far I was not able to produce some timer/conflict setup that triggers the bug. Nevertheless I cleaned up the timers and we'll see whether this gets rid of future timeouts.

robinro commented Jan 4, 2016

It really was the timers. At least we didn't have a system failure since I changed them.

Sadly I was not able to get a reproduceable setup that quickly segfaults systemd. My best guess it's somewhere with handling conflicting timers.

The setup that broke systemd was:

  • job1
    • service: typically runtime 3 secs, might go up to 30 mins with load
    • timer: OnBootSec 10 min, OnUnitActiveSec: 15 min
  • jobs2
    • service: oneshot, TimeoutStartSec=300, typical runtime 30 secs, can go up to several minutes (until timeout), conflicts=job1.service
    • timer: OnBootSec=34mins, OnUnitActiveSec=24hours

The setup has multiple flaws: OnUnitActive with oneshot leads to the point in time moving thorugh the day (the runtime is added to the starting time). The approx. two weeks it took for the bug to appear were the time it took job2 to start simultaneously with job1. The conflict is ugly as it kills the other service. As a fix we use OnCalendar and run the jobs such they don't overlap.

Running the timers ridiculously often and simultaneously was not enough to trigger the bug. Maybe there is some over condition needed.

Thanks everyone who helped with this somewhat dubious bug. I learned a log about debugging systemd in the process and now feel confident to run gdb on pid 1 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment