Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-resolved: sporadic DNSSEC «missing-key» failures for signed FQDNs #12545

Open
toreanderson opened this issue May 12, 2019 · 1 comment
Open

Comments

@toreanderson
Copy link
Contributor

@toreanderson toreanderson commented May 12, 2019

systemd version the issue has been seen with

systemd-241-8.git9ef65cb.fc30.x86_64

Used distribution

Fedora 30

Expected behaviour you didn't see

Consistent successful lookups of DNSSEC-signed FQDNs

Unexpected behaviour you saw

Sporadic lookup failures of DNSSEC-signed FQDNs due to «missing-key»

Steps to reproduce the problem

Start out with a /etc/systemd/resolved.conf that enables DNSSEC and specifies upstream DNS servers that are known to support DNSSEC (note that I can easily reproduce the issue with other DNS servers too, so this does not appear to be an upstream DNS server issue):

[Resolve]
DNS=1.0.0.1 1.1.1.1
DNSSEC=yes

For the record, here's the output from resolvectl status with the above configuration.

Restart systemd-resolved and make two queries for an FQDN known to have valid DNSSEC signatures. In this case, I'm using ring.nlnog.net, which has perfect test results both on Verisign Labs's DNSSEC Analyzer and on DNSViz.

(The issue occurs with other unrelated FQDNs as well, e.g., www.jernia.no. However, not all FQDNs suffer from the issue - I've been unable to reproduce the issue for www.ripe.net, for example.)

Observe how the first query tend to fail with «resolve call failed: DNSSEC validation failed: missing-key», while the subsequent query tend to go work fine:

$ date; sudo systemctl restart systemd-resolved.service; sleep 3; resolvectl query ring.nlnog.net; sleep 3; resolvectl query ring.nlnog.net     
su. 12. mai 10:21:49 +0200 2019
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: 95.211.149.24                  -- link: wlp2s0

-- Information acquired via protocol DNS in 279.3ms.
-- Data is authenticated: yes

I am attaching a PCAP file containing the traffic between the upstream DNS server and systemd-resolved captured during the above console output. Frames 1-16 are from the initial failing query issued at 10:21:52, while frames 17-24 are from the successful subsequent query at 10:21:56. I am also attaching debug-level journal output from systemd-resolved taken in the same time span.

Note that the first lookup does not always fail, so a few attempts might be necessary to reproduce.

Non-debug log messages from systemd-resolved vary slightly from when a query fails, sometimes it will log four lines, like so:

mai 12 10:41:42 sloth.fud.no systemd-resolved[15103]: DNSSEC validation failed for question nlnog.net IN DNSKEY: missing-key
mai 12 10:41:42 sloth.fud.no systemd-resolved[15103]: DNSSEC validation failed for question ring.nlnog.net IN DS: missing-key
mai 12 10:41:42 sloth.fud.no systemd-resolved[15103]: DNSSEC validation failed for question ring.nlnog.net IN DNSKEY: missing-key
mai 12 10:41:42 sloth.fud.no systemd-resolved[15103]: DNSSEC validation failed for question ring.nlnog.net IN A: missing-key

Other times, it will only log two lines, like so:

mai 12 10:43:15 sloth.fud.no systemd-resolved[15591]: DNSSEC validation failed for question ring.nlnog.net IN DNSKEY: missing-key
mai 12 10:43:15 sloth.fud.no systemd-resolved[15591]: DNSSEC validation failed for question ring.nlnog.net IN A: missing-key

This can also be demonstrated by issuing 100 queries in rapid succession after flushing the caches. Typically I will get 0, 1 or 2 failures initially; the remainder of the queries will be successfully.

$ sudo resolvectl flush-caches; for i in {1..100}; do resolvectl query ring.nlnog.net >/dev/null && echo OK || echo FAIL; done | uniq -c
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      1 FAIL
     99 OK
$ sudo resolvectl flush-caches; for i in {1..100}; do resolvectl query ring.nlnog.net >/dev/null && echo OK || echo FAIL; done | uniq -c
    100 OK
$ sudo resolvectl flush-caches; for i in {1..100}; do resolvectl query ring.nlnog.net >/dev/null && echo OK || echo FAIL; done | uniq -c
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      1 FAIL
     99 OK
$ sudo resolvectl flush-caches; for i in {1..100}; do resolvectl query ring.nlnog.net >/dev/null && echo OK || echo FAIL; done | uniq -c
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      2 FAIL
     98 OK

However, if I first disable caching by setting Cache=no in /etc/systemd/resolved.conf, the results are much more erratic:

$ for i in {1..100}; do resolvectl query ring.nlnog.net >/dev/null && echo OK || echo FAIL; done | uniq -c
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      3 FAIL
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      2 OK
      1 FAIL
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      4 OK
      1 FAIL
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      2 OK
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      5 FAIL
ring.nlnog.net: resolve call failed: DNSSEC validation failed: missing-key
      1 OK
[...]

It would seem that caching successfully masks the failures except for the first ones.

@kpfleming
Copy link
Contributor

@kpfleming kpfleming commented Jul 18, 2020

I'm seeing this too, on only one machine, using systemd 241-7~deb10u4 (Debian Buster).

In my case the name being resolved is repo.powerdns.com. The systemd-resolved journal shows missing-key entries for "IN DNSKEY powerdns.com", "IN A repo.powerdns.com", and "IN AAAA repo.powerdns.com". The system has PowerDNS Recursor 4.3.2-1pdns.buster running on 127.0.0.1 and systemd-resolved is configured to use it. Sending the same queries directly to pdns-recursor using 'dig' produces the correct results.

I can't get any queries for this name to resolve without producing the 'missing-key' result. As a workaround I've stopped using the systemd-resolved stub resolver on this system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants