Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-resolv. Fails on SERVFAIL instead of trying another upstream server. #7147

Closed
mistralol opened this issue Oct 20, 2017 · 4 comments
Closed
Labels
resolve RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@mistralol
Copy link

Submission type

  • Bug report

systemd version the issue has been seen with

systemd --version
systemd 232
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

Used distribution

Ubuntu

In case of bug report: Expected behaviour you didn't see

DNS Lookup failure. systemd-resolv did not move onto the next dns server on failure. It simply cached the resolved item indefinatly from a upstream DNS server that return SERVFAIL.

systemd-resolve --flush-caches. Also failed to correct the issue.

Only after restarting systemd-resolv was it then able to correctly resolv domain names again.

In case of bug report: Unexpected behaviour you saw

SERVFAIL is returned from systemd-resolv. However in this case 3 of the 4 name servers work and will return the correct answer. The 4th returning SERVFAIL. SERV fail is a soft error. The resolver should move onto the next DNS server on its list. systemd-resolv fails to do this. NXDomain is a hard failure and should pass this onto the client performing the dns lookup.

DNS Config looks like

Link 3 (enp0s31f6)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
LLMNR setting: yes
MulticastDNS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNS Servers: 10.66.0.198
10.51.50.3
10.51.4.61
10.51.50.2

Manual lookup gets the following.

$ host www.google.co.uk 10.66.0.198
Using domain server:
Name: 10.66.0.198
Address: 10.66.0.198#53
Aliases:

www.google.co.uk has address 216.58.209.131
www.google.co.uk has IPv6 address 2a00:1450:400f:804::2003

$ host www.google.co.uk 10.51.50.3
Using domain server:
Name: 10.51.50.3
Address: 10.51.50.3#53
Aliases:

Host www.google.co.uk not found: 2(SERVFAIL)

$ host www.google.co.uk 10.51.4.61
Using domain server:
Name: 10.51.4.61
Address: 10.51.4.61#53
Aliases:

www.google.co.uk has address 172.217.20.99
www.google.co.uk has IPv6 address 2a00:1450:4007:80c::2003

$ host www.google.co.uk 10.51.50.2
Using domain server:
Name: 10.51.50.2
Address: 10.51.50.2#53
Aliases:

www.google.co.uk has address 216.58.192.35
www.google.co.uk has IPv6 address 2607:f8b0:4008:805::2003

In case of bug report: Steps to reproduce the problem

With 1 of multiple dns servers which is starting.

$ host www.google.co.uk

host www.google.co.uk not found: 2(SERVFAIL)

Restarting systemd-resolv instantly resolves the error if it selects a different upstream dns server on restart.

@poettering
Copy link
Member

See 201d995 for an explanation why we cache SERVFAIL

@poettering poettering added resolve RFE 🎁 Request for Enhancement, i.e. a feature request labels Oct 23, 2017
@mistralol
Copy link
Author

And this 12 year old explanation around glibc which ended up in the same situation which resulted in the agreement to try the next server. https://bugzilla.redhat.com/show_bug.cgi?id=160914

From my view point. Though it might be able to be explained. It does not lead to good end user experience not being able to get to "google" because 1 of 4 upstream windows dns server temporary flaked out without restarting systemd-resolve. I also tried to flush the cache of systemd using "systemd-resolve --flush-caches". But again it did not try the next server.

In my understand any upstream server may behave like this at any time and be temporary. Examples would include: Out of memory limits. Query rate limits, Maximum connection limits, Database access problems.

It should not cache this failure indefinably or for a long period of time. The RFC(https://tools.ietf.org/html/rfc2308) restricts this to a period of 5 minutes in section 7.1. I am quite sure elsewhere in the rfc's when a resolver see's a SERV FAIL response. It is responsible for sending queries to other server.

In my unique case the upstream server actually sent a SERVFAIL to every query sent to the server for every domain. Do you really expect that this is the expected behaviour or a robust system?

@the-maldridge
Copy link

@mistralol How did you fully flush the cache to recover from this?

poettering added a commit to poettering/systemd that referenced this issue Dec 8, 2017
Currently, we accept SERVFAIL after downgrading fully, cache it and move
on. Let's extend this a bit: after downgrading fully, if the SERVFAIL
logic continues to be an issue, then use a different DNS server if there
are any.

Fixes: systemd#7147
@poettering
Copy link
Member

Fix waiting in #7591. Would be greatly appreciated if you could test this against your server setup, if they are still showing this behaviour?

poettering added a commit to poettering/systemd that referenced this issue Dec 12, 2017
Currently, we accept SERVFAIL after downgrading fully, cache it and move
on. Let's extend this a bit: after downgrading fully, if the SERVFAIL
logic continues to be an issue, then use a different DNS server if there
are any.

Fixes: systemd#7147
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolve RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

3 participants