New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd-resolv. Fails on SERVFAIL instead of trying another upstream server. #7147
Comments
See 201d995 for an explanation why we cache SERVFAIL |
And this 12 year old explanation around glibc which ended up in the same situation which resulted in the agreement to try the next server. https://bugzilla.redhat.com/show_bug.cgi?id=160914 From my view point. Though it might be able to be explained. It does not lead to good end user experience not being able to get to "google" because 1 of 4 upstream windows dns server temporary flaked out without restarting systemd-resolve. I also tried to flush the cache of systemd using "systemd-resolve --flush-caches". But again it did not try the next server. In my understand any upstream server may behave like this at any time and be temporary. Examples would include: Out of memory limits. Query rate limits, Maximum connection limits, Database access problems. It should not cache this failure indefinably or for a long period of time. The RFC(https://tools.ietf.org/html/rfc2308) restricts this to a period of 5 minutes in section 7.1. I am quite sure elsewhere in the rfc's when a resolver see's a SERV FAIL response. It is responsible for sending queries to other server. In my unique case the upstream server actually sent a SERVFAIL to every query sent to the server for every domain. Do you really expect that this is the expected behaviour or a robust system? |
@mistralol How did you fully flush the cache to recover from this? |
Currently, we accept SERVFAIL after downgrading fully, cache it and move on. Let's extend this a bit: after downgrading fully, if the SERVFAIL logic continues to be an issue, then use a different DNS server if there are any. Fixes: systemd#7147
Fix waiting in #7591. Would be greatly appreciated if you could test this against your server setup, if they are still showing this behaviour? |
Currently, we accept SERVFAIL after downgrading fully, cache it and move on. Let's extend this a bit: after downgrading fully, if the SERVFAIL logic continues to be an issue, then use a different DNS server if there are any. Fixes: systemd#7147
Submission type
systemd version the issue has been seen with
systemd --version
systemd 232
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN
Used distribution
Ubuntu
In case of bug report: Expected behaviour you didn't see
DNS Lookup failure. systemd-resolv did not move onto the next dns server on failure. It simply cached the resolved item indefinatly from a upstream DNS server that return SERVFAIL.
systemd-resolve --flush-caches. Also failed to correct the issue.
Only after restarting systemd-resolv was it then able to correctly resolv domain names again.
In case of bug report: Unexpected behaviour you saw
SERVFAIL is returned from systemd-resolv. However in this case 3 of the 4 name servers work and will return the correct answer. The 4th returning SERVFAIL. SERV fail is a soft error. The resolver should move onto the next DNS server on its list. systemd-resolv fails to do this. NXDomain is a hard failure and should pass this onto the client performing the dns lookup.
DNS Config looks like
Link 3 (enp0s31f6)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
LLMNR setting: yes
MulticastDNS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNS Servers: 10.66.0.198
10.51.50.3
10.51.4.61
10.51.50.2
Manual lookup gets the following.
$ host www.google.co.uk 10.66.0.198
Using domain server:
Name: 10.66.0.198
Address: 10.66.0.198#53
Aliases:
www.google.co.uk has address 216.58.209.131
www.google.co.uk has IPv6 address 2a00:1450:400f:804::2003
$ host www.google.co.uk 10.51.50.3
Using domain server:
Name: 10.51.50.3
Address: 10.51.50.3#53
Aliases:
Host www.google.co.uk not found: 2(SERVFAIL)
$ host www.google.co.uk 10.51.4.61
Using domain server:
Name: 10.51.4.61
Address: 10.51.4.61#53
Aliases:
www.google.co.uk has address 172.217.20.99
www.google.co.uk has IPv6 address 2a00:1450:4007:80c::2003
$ host www.google.co.uk 10.51.50.2
Using domain server:
Name: 10.51.50.2
Address: 10.51.50.2#53
Aliases:
www.google.co.uk has address 216.58.192.35
www.google.co.uk has IPv6 address 2607:f8b0:4008:805::2003
In case of bug report: Steps to reproduce the problem
With 1 of multiple dns servers which is starting.
$ host www.google.co.uk
Restarting systemd-resolv instantly resolves the error if it selects a different upstream dns server on restart.
The text was updated successfully, but these errors were encountered: