New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd-resolve general protection fault #18427
Comments
Additional error messages generated since the original bug was posted a few hours ago:
|
Please install dbgsym package for systemd and then either generate a backtrace by installing systemd-coredump or directly running systemd-resolved under gdb. https://wiki.debian.org/HowToGetABacktrace has a few hints. |
@mbiebl - Sure thing, no problem. I've installed |
@poettering I'm using the Since restarting the service to capture a crash dump, I haven't seen any crashes, yet. Is there a DNS query test case to reproduce those crash bugs? |
Yeah, I was planning to do that today, but I got stuck on some other issue.
That sounds wrong. We have had this bug for a while in Fedora, with many many people being hit, and generally the result is that resolved crashes and is immediately restarted. Apart from the interrupted query and noise in the logs, this isn't visible to users. The daemon getting "stuck" after a crash is unexpected. |
Somewhat surprisingly, I haven't had any systemd-resolved crashes since enabling the systemd-coredump support as requested. |
I've now had another set of crashes with an upgraded systemd package (
More of these in dmesg:
I've attached the output of Is there other information that would be useful for me to collect? |
There are three asserts in Your backtrace has only function names and addresses which would require too much work to resolve for non-debian people. For one of the cases with SIGABRT, not SIGSEGV, please attach gdb and print a backtrace: $ sudo coredumpctl debug <pid> [tab completion of the pid should work here]
...
(gdb) bt full
...
(gdb) up 3
(gdb) print *q |
Here's pid 447:
|
For completeness and before I clear the corefiles, I wanted to share at least a small amount of the information contained within. I think the two SIGSEV corefiles roughly look like they're crashing after a timeout on a query. The query is routed over an upstream TLS resolver as I indicated in my configuration for systemd-resolved.
|
I was thinking of trying to capture the network traffic that leads to either a SIGABRT or a SIGSEGV for systemd-resolved. However, when looking at another corefile (SIGSEV), I have the impression that it would be non-trivial to capture interesting packets in a pcap as I think both bugs are triggered by a query (or the lack of a response?) inside of a TLS session rather than a query or an answer on the UDP listener on my local network segment. Does that seem to be the case to anyone else? If it is likely inside of an encrypted connection... is there a systemd way to log the TLS session keys to allow for decryption of the TLS connection? Part of me hopes the answer is no, but if it is possible, I'm willing to try something else to track these issues down. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
We generally operate on the assumption that a source is "gone" as soon as we unref it. This is generally true because we have the only reference. But if something else holds the reference, our unref doesn't really stop the source and it could fire again. In particular, on_query_timeout() is called with DnsQuery* as userdata, and it calls dns_query_stop() which invalidates that pointer. If it was ever called again, we'd be accessing already-freed memory. I don't see what would hold the reference. sd-event takes a temporary reference, but on the sd_event object, not on the individual sources. And our sources are non-floating, so there is no reference from the sd_event object to the sources. For systemd#18427.
I'm pretty sure this is fixed by #18832. |
We generally operate on the assumption that a source is "gone" as soon as we unref it. This is generally true because we have the only reference. But if something else holds the reference, our unref doesn't really stop the source and it could fire again. In particular, on_query_timeout() is called with DnsQuery* as userdata, and it calls dns_query_stop() which invalidates that pointer. If it was ever called again, we'd be accessing already-freed memory. I don't see what would hold the reference. sd-event takes a temporary reference, but on the sd_event object, not on the individual sources. And our sources are non-floating, so there is no reference from the sd_event object to the sources. For systemd#18427. (cherry picked from commit 9793530)
We generally operate on the assumption that a source is "gone" as soon as we unref it. This is generally true because we have the only reference. But if something else holds the reference, our unref doesn't really stop the source and it could fire again. In particular, on_query_timeout() is called with DnsQuery* as userdata, and it calls dns_query_stop() which invalidates that pointer. If it was ever called again, we'd be accessing already-freed memory. I don't see what would hold the reference. sd-event takes a temporary reference, but on the sd_event object, not on the individual sources. And our sources are non-floating, so there is no reference from the sd_event object to the sources. For systemd#18427. (cherry picked from commit 9793530) (cherry picked from commit 78a43c3)
systemd version the issue has been seen with
systemd 247 (247.2-5)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified
Used distribution
Debian GNU/Linux Bullseye
Linux kernel version used (
uname -a
)Linux apu4d4 5.10.0-1-amd64 #1 SMP Debian 5.10.4-1 (2020-12-31) x86_64 GNU/Linux
CPU architecture issue was seen on
x86_64
Expected behaviour you didn't see
systemd-resolved should stay running and automatically repair itself so DNS services do not need a manual restart.
Unexpected behaviour you saw
systemd-resolved stopped replying to clients on the local LAN.
The kernel message log had the following entries:
Steps to reproduce the problem
The /etc/systemd/resolved.conf file:
With this configuration,
systemd-resolved
requires a near daily restart (systemctl restart systemd-resolved
) after it becomes unresponsive. Today was the first time that I observed log entries that appear to have been created around the same time that the resolved daemon became unresponsive to local LAN queries.The text was updated successfully, but these errors were encountered: