Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optinally, begin every transaction with first DNS server in list instead of last working one #5755

Open
1 of 2 tasks
diego-treitos opened this issue Apr 18, 2017 · 131 comments
Open
1 of 2 tasks
Labels
resolve RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@diego-treitos
Copy link

diego-treitos commented Apr 18, 2017

Submission type

  • Bug report
  • Request for enhancement (RFE)

NOTE: Do not submit anything other than bug reports or RFEs via the issue tracker!

systemd version the issue has been seen with

Version 232

NOTE: Do not submit bug reports about anything but the two most recently released systemd versions upstream!

Used distribution

Ubuntu 17.04

In case of bug report: Expected behaviour you didn't see

When having 2 nameservers like:

192.168.0.1
8.8.8.8

Defined in that order in /etc/resolv.conf I would expect to have the same behaviour than in resolv.conf: First use 192.168.0.1 and if for some reason it is not available, use 8.8.8.8.

I am seeing that systemd-resolved is switching nameservers randomly

Abr 18 16:40:01 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:01 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:06 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:06 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:11 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:16 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:16 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:21 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 19:16:09 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.

In case of bug report: Unexpected behaviour you saw

Random nameserver use

In case of bug report: Steps to reproduce the problem

just have 2 nameservers and use systemd-resolved service

@poettering
Copy link
Member

poettering commented Apr 24, 2017

resolved will always begin with the first configured DNS service, and switch to any other only after failures to contact it. If you turn on debug logging in resolved (by setting the SYSTEMD_LOG_LEVEL=debug env var for it), then you'll see the precise reason it switched over. Switching over can have many reasons: the IP route to the destination is missing, the server might simple not respond at all, or only with an error...

To turn on debug loggin, use "systemctl edit systemd-resolved", then write the two lines:

[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

and issue "systemctl restart systemd-resolved", then watch the output with "journalctl -u systemd-resolved -f", and look for the lines announcing the switch and the context before it.

I am pretty sure the output you'll see then will explain enough, hence I am closing this now. Feel free to reopen if it doesn't.

@diego-treitos
Copy link
Author

diego-treitos commented Apr 25, 2017

I cannot reopen the issue, I am afraid.

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).
I am not seeing this in systemd-resolved it seems that when it switches (for whatever reason) it stays with that server and subsequent requests are checked against the primary DNS server.
In my case I have:

Primary DNS: 192.168.0.250
Secondary DNS: 8.8.8.8

The primary DNS works just fine. I never see it offline. Actually, when systemd-resolved switches to 8.8.8.8 I can just test the resolving like this:

$ dig +short router.lar

$ dig +short router.lar @192.168.0.250
192.168.0.1

So here we see that despite the primary DNS server being available, systemd-resolved is not using it.
This is happening in two different computers I have.

I've never had any problems with any of them until I upgraded them the new Ubuntu version that uses systemd-resolved. In one of them, I already disabled systemd-resolved and it works just fine (just like before using systemd-resolved). So clearly there is something wrong with the systemd-resolved behaviour.

Just in case I enabled the debug as you requested and this is what I see for the requests:

Abr 25 11:00:42 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 22949
Abr 25 11:00:42 boi systemd-resolved[5221]: Looking up RR for router.lar IN A.
Abr 25 11:00:42 boi systemd-resolved[5221]: NXDOMAIN cache hit for router.lar IN A
Abr 25 11:00:42 boi systemd-resolved[5221]: Transaction 63967 for <router.lar IN A> on scope dns on eth0/* now complete with <rcode-failure> from cache (unsigned).
Abr 25 11:00:42 boi systemd-resolved[5221]: Freeing transaction 63967.
Abr 25 11:00:42 boi systemd-resolved[5221]: Sending response packet with id 22949 on interface 1/AF_INET.

And this is what I see when it switches (Added >>> to easily see the switch):

>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN AAAA
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 47232 for <go.trouter.io IN AAAA> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP+EDNS0+DO for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using DNS server 8.8.8.8 for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN A
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 29131 for <go.trouter.io IN A> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP for transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 350
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN A.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 30693
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN AAAA.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 63769
Apr 25 07:40:11 boi systemd-resolved[5221]: Looking up RR for browser.pipe.aria.microsoft.com IN A.
Apr 25 07:40:11 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
>>> Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:11 boi systemd-resolved[5221]: Cache miss for browser.pipe.aria.microsoft.com IN A

Looks like it switched each time it has a problem resolving a record and then it keeps using that name server for next requests.

@poettering
Copy link
Member

poettering commented Apr 25, 2017

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).

This is what happens. However, in contrast to classic nss-dns we have memory: when we noticed that a DNS server didn't respond or returned some failure, or for some other reason wasn't working for us, and we skip to the next, then we remember that and the next lookup is attempted with the new one. If that one fails too, then we'll skip to the next one and the next one and so on, until we reach the end of the list and start from the beginning of the list again.

This behaviour has the big advantage that we can build on what we learnt about a DNS server before, and don't waste the same timeout on a DNS server for each lookup should it not respond.

Or to say this differently: If you specify multiple DNS servers, then that's not a way to merge DNS zones or so. It's simply a way to define alternative servers should the first DNS server not work correctly.

If you want to route lookups in specific zones to specific DNS servers, then resolved doesn't really offer a nice way for that. A hack is to define multiple interfaces however, and configure different DNS servers and domains for them.

@poettering
Copy link
Member

poettering commented Apr 25, 2017

Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

@poettering
Copy link
Member

poettering commented Apr 25, 2017

Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.

and here the same thing, when it swicthes back: the response for transaction 47747 wasn't received either, hence resolved tries the other server again, switching back.

@diego-treitos
Copy link
Author

diego-treitos commented Apr 25, 2017

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

Yes I see that. And precisely because it is using UDP it will be easier for some packages to get droped and that the DNS switches.
Surely you see the advantages of the configuration I have in place. In networks like small companies, you may send those nameservers via DHCP to all computers in your network so they have resolution for local and external domains. However, if for some reason the local DNS goes down, all your computers can still resolve internet domains. In other words, it is much easier that your local DNS fails than that the google DNS does, so it is like a strong failover.

With the current systemd implementation you lose that priority in resolving names as it works more like a round-robin, and I understand the advantages of that in many scenarios (quick DNS failover switch).

I think it would be great to have some configuration options on this like:

  • Choose between RR mode or Prioritized mode
  • Number of attempts before switching to next nameserver

Or even to periodically check for primary nameserver availability so you can go back to use it asap.

@diego-treitos
Copy link
Author

diego-treitos commented Apr 25, 2017

@poettering
Copy link
Member

poettering commented Apr 25, 2017

BTW, odd thing is that it looks easier to switch to the external nameserver, when this is never able to resolve the local domains.

Not sure I grok what you are trying to say? Note that if a DNS lookup results in a NODATA or NXDOMAIN reply, then that's considered final, and no other DNS server is tried. Again, defining multiple DNS servers is not a way to merge zones, it's a way to deal with unreliable servers, the assumption is always that all DNS servers configured provide the same dataset.

So I think I grok what you are trying to do, but quite frankly, I think that even without resolved involved, this scheme is not reliable, and basically just taking benefit from a specific implementation detail of nss-dns/glibc. You are merging two concepts in what you are trying to do: fallback due to unreliable servers, and "merging" of zones. And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

@thomasleplus
Copy link

thomasleplus commented May 27, 2017

I have a similar situation than @diego-treitos. My company has a single internal DNS and so our DHCP server provides it as primary DNS, and OpenDNS as secondary. If any request to our DNS fails, systemd will switch to OpenDNS and I loose the ability to connect to internal servers. And since OpenDNS doesn't fail, I am never switching back to our DNS unless I disconnect and reconnect my network.

I agree that the proper solution would be having a reliable DNS server or, even better, two internal servers for redundancy. But while I try to convince our sysadmins of that, IMHO it would be nice to have an option.

@diego-treitos
Copy link
Author

diego-treitos commented May 29, 2017

I agree with that. I know that this may not be a direct problem with systemd, but this service is being used to replace a previous one, so I think it would be nice if it could work just like the service it is replacing.

@chrisisbd
Copy link

chrisisbd commented Jun 7, 2017

Yes, I agree that this is a problem. I have just upgraded a system to ubuntu 17.04 and what used to work in 16.04 now no longer works. We need a way to say that the second DNS is only to be used if the first one fails, the first one should always be tried first.

@chrisisbd
Copy link

chrisisbd commented Jun 7, 2017

Here's my output after adding the debug logging, it doesn't seem to make much sense:-

Jun 7 10:36:28 t470 systemd-resolved[2161]: Using system hostname 't470'.
Jun 7 10:36:28 t470 systemd-resolved[2161]: New scope on link *, protocol dns, family *
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 3/wlp4s0
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 2/enp0s31f6
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 1/lo
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=Hello cookie=1 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RequestName cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=4 reply_cookie=2 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=AddMatch cookie=3 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=5 reply_cookie=3 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=3 reply_cookie=0 error=n/a
Jun 7 10:37:38 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1936
Jun 7 10:37:38 t470 systemd-resolved[2161]: Looking up RR for esprimo.zbmc.eu IN A.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Switching to fallback DNS server 8.8.8.8.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Cache miss for esprimo.zbmc.eu IN A
Jun 7 10:37:38 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> scope dns on /.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using feature level UDP+EDNS0+DO+LARGE for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using DNS server 8.8.8.8 for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Sending query packet with id 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Processing query...
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing incoming packet on transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Verified we get a response at feature level UDP+EDNS0+DO from DNS server 8.8.8.8.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Added NXDOMAIN cache entry for esprimo.zbmc.eu IN ANY 1799s
Jun 7 10:37:39 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> on scope dns on / now complete with from network (unsigned).
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1936 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Freeing transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1919
Jun 7 10:37:39 t470 systemd-resolved[2161]: Looking up RR for esprimo IN A.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1919 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing query...

So why does it switch to using 8.8.8.8, it doesn't seem to have even tried 192.168.1.2.

@poettering
Copy link
Member

poettering commented Jun 7, 2017

@chrisisbd The "Switching to fallback DNS server 8.8.8.8." message indicates that you have no DNS servers configured at all, in which case resolved will use compiled-in fallback servers because it tries hard to just work also if you have a locally misconfigured system

@chrisisbd
Copy link

chrisisbd commented Jun 7, 2017

No, I have a working DNS on the LAN which (when I use it from xubuntu 16.04 systems) works perfectly.

The relevant part from 'systemd-resolve --status' is:-

ink 3 (wlp4s0)
Current Scopes: DNS
LLMNR setting: yes
MulticastDNS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNS Servers: 192.168.1.2
8.8.8.8
DNS Domain: zbmc.eu

Most of the time local names resolve OK on the 17.04 system too but it (randomly?) falls back to using the 8.8.8.8 server for no obvious reason.

@amazon750
Copy link

amazon750 commented Jun 8, 2017

Hi Lennart, thanks for all of your work so far. I'm trying to keep using systemd, but you can add me to the list of people for whom the old behaviour seemed to be standardised and useful, and the new behaviour seems like a regression.

the assumption is always that all DNS servers configured provide the same dataset.

That assumption doesn't seem universal. I too have local names that aren't in public DNS, and some local overrides for external names, neither of which work if the failover happens (I only have a secondary server listed for the same reason as these other fellas: to keep internet access working more reliably for the rest of the local users if the primary fails). Under the old system, with the same resolv.conf and the same primary DNS server, things worked as I designed nearly 100% of the time. Now, with systemd, it's become quite unreliable. I hadn't needed to do per-domain request routing before, but I'd be fine with that solution. I also like the suggestion of a switch to choose which behaviour the local admin prefers. Anything would be better, I've been reduced to editing /etc/hosts to relieve some frustration, which I haven't otherwise done in years.

And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

Actually, on thinking about it further, that isn't as good. I would still prefer to use my internal DNS as primary for everything, and have it forward requests that it can't answer. Then again, maybe my preference is a bad practice, and won't be supported. But as mentioned, this all used to work, now it doesn't. If that's by design and won't be changed, that's unfortunate.

@lifeboy
Copy link

lifeboy commented Jun 23, 2017

This is a problem, @poettering. The behaviour is a major change from the expected way and doesn't work in practice. If I specify 3 nameservers, the expectation that the first is always queried first, is settled. You can't change that now unilaterally.

Consider this scenario:

I have a VPN connection to a development environment where I have VM's running various tests. On the gateway on that cluster I run a DNS forwarder (dnsmasq) on pfsense. (192.168.121.1) Here I override the public DNS to resolv to a server on the LAN. This is not uncommon to do and in many corporate environments similar scenarios exist. In addition to the overriding of existing public DNS entries, I also add my own inhouse entries for my test servers.
Now, in addition to this development cluster, we run various production clusters on a similar basis. (192.168.0.1) Again, a DNS forwarder allow the resolution of a domain to LAN address instead of the public address.

Since we don't work in one location and precisely therefor that we use VPN to connect to the various clusters, we need the expected behaviour: Always try to resolve in this order:
192.168.121.1
192.168.0.1
8.8.8.8

What happens with systemd-resolved is this:
Try to resolve abc.com from 192.168.121.1. It resolves. Open tools and work on servers.
In the course of time, some entry for xyz.com doesn't resolve from 192.168.121.1. It does resolve from 192.168.0.1 and xyz.com is not accessible. However, quite soon after that abc.com is not found any more. This is because 192.168.0.1 doesn't have records for abc.com.

The only way to restore this is to clear the dns cache and restart systemd-resolved .

This is not acceptable and at the least we need a way to prevent this automatic jumping to a dns server lower down in the priority list.

@mourednik

This comment was marked as off-topic.

@systemd systemd locked and limited conversation to collaborators Jun 26, 2017
@systemd systemd unlocked this conversation Jul 9, 2017
@keszybz
Copy link
Member

keszybz commented Jul 9, 2017

Hm, we could allow DNS= configuration to specify two tiers of servers (e.g. with the syntax DNS=1.2.3.4 -8.8.8.8 -8.8.4.4), where those minus-prefixed servers would only be used if the non-minus-prefixed servers fail. Not sure about the details — the way I think could work would be to: first, round-robin on the first tier servers, and then fall back to the second tier once all of the first-tier servers have accumulated enough failures, like maybe 5 out of last 10 queries. And after some timeout, let's say 10 minutes, we should switch back.

Of course such a setup is not useful for merging zones (as @poettering wrote) in any reliable way, but it makes it easier to achieve "soft failure", where some local names stop working but the internet is not fully broken when the local nameserver goes down. Also, thanks to automatic switching back after a delay, things would "unbreak" automatically.

@lifeboy
Copy link

lifeboy commented Jul 14, 2017

@keszybz keszybz removed the not-a-bug label Jul 15, 2017
@keszybz
Copy link
Member

keszybz commented Jul 15, 2017

This was already explained above (#5755 (comment)), but I'll try again:

  • we rotate among servers when the current server is not responding to provide reliability.
  • we remember which server is "good" so that there's no initial delay.

Contrary to what you wrote, DNS clients do not cache answers in general. Actually, when programs are short-lived, they cannot cache answers even if they wanted to; every time a program is restarted is starts with a clean slate. The place where caching is performed is inside of systemd-resolved (or in another external cache, like nscd, sssd, etc., but with systemd-resolved running the idea is that you don't need those).

With DNSSEC, the delay from nonresponding name server becomes even more noticeable. We might want to adjust caching details, but it's a central feature of systemd-resolved functionality and it's not going away. (Both the memory of "last good" server, and previously queried resource records). So if you want something to change to accommodate your use case, help design the solution like proposed (#5755 (comment)) so that it works for you.

@keszybz keszybz reopened this Jul 15, 2017
@lifeboy
Copy link

lifeboy commented Jul 16, 2017

I think @keszybz's workaround is not a good idea. It's still not a solution that keeps the established behaviour and add strange new "features" for that that wish to enable them. Why does @poettering insist on breaking things that work just fine?

I'm being forced off systemd more and more and I now see that's a good think. The more people move away, the better.

@keszybz
Copy link
Member

keszybz commented Jul 16, 2017

Well, we keep telling you why, and you keep asking.

@jonesmz
Copy link

jonesmz commented Oct 24, 2022

If you have two DCs then they both serve the same DNS zone data, so it doesn't matter which one systemd-resolved picks because they're equivalent

Reboot all your DCs, or have a network switch go down or reboot, and now all your Linux systems can't authenticate because systemd-resolved thinks 8.8.8.8 will have the same data as a local DNS server.

Its a real-world use case where systemd-resolved is the problem, and the solution is to always start with the first server in the list, or to automatically switch back to the first server when it comes back online.

@tebruno99
Copy link

tebruno99 commented Oct 24, 2022

These DNS issues have been around for many years and every bug that comes out about them digresses into "works for Me" or "doesn't work for Me".

systemd-resolved replaced behavior that worked for everyone for a large number of years for something that works only for some. Restoring the original behavior breaks nothing for the "works for Me" group, and fixes everything for the "doesn't work for Me". Please just fix it so the conflict can end. So We can stop wasting all this energy & frustration in all of these threads/bugs and let everyone embrace systemd-resolved as the great solution it could be.

@paketb0te
Copy link

paketb0te commented Oct 25, 2022

Reboot all your DCs, or have a network switch go down or reboot, and now all your Linux systems can't authenticate because systemd-resolved thinks 8.8.8.8 will have the same data as a local DNS server.

Its a real-world use case where systemd-resolved is the problem, and the solution is to always start with the first server in the list, or to automatically switch back to the first server when it comes back online.

I disagree, IMO the problem is one of misconfiguration - you should not use a Nameserver that serves a different Zone (e.g. a public resolver) as a Fallback.

@mattiaslundstrom
Copy link

mattiaslundstrom commented Oct 25, 2022

This is ridiculous. The non-configurable behavior of systemd-resolved breaks on a quite a few network configurations that exists in practice. What is "correct" configuration according to some theoretical reading of the standard is irrelevant to this practical issue. Why is this such a hard thing to fix?

Like many others I have had to disable systemd-resolved because of this issue. In my case because my employer's network and VPN solution is "incorrectly" configured.

@darkdragon-001
Copy link

darkdragon-001 commented Oct 25, 2022

What is "correct" configuration according to some theoretical reading of the standard is irrelevant to this practical issue.

Standards are there to implement them as is, otherwise they don't have any reason to exist. If you think the standard should be improved, you should go to the standards committee to change it. Here is the wrong place for such requests.

Like many others I have had to disable systemd-resolved because of this issue. In my case because my employer's network and VPN solution is "incorrectly" configured.

VPN is working fine if you correctly configure it (never had any problems with my personal as well as my work VPNs). I even think systemd-resolved is way superior in handling VPNs as it allows DNS based on device and domain. For example I set my company domains as search prefix for my VPN and in turns only my companie's DNS is used for these addresses while I can use is own DNS in a privacy friendly matter for all other queries.

@diego-treitos
Copy link
Author

diego-treitos commented Oct 25, 2022

@paketb0te
Copy link

paketb0te commented Oct 25, 2022

@darkdragon-001 #5755 (comment)

the existence of Split-DNS referenced there (RFC2775, sec 3.8) has, IMO, nothing to do with the issue at hand - in such cases, the fallback server(s) should be other internal nameservers.

Always using the servers in the order they are configured in resolve.conf would mean that we potentially have to wait for a timeout on each single request before trying the next server - which would degrade performance substantially

@jonesmz
Copy link

jonesmz commented Oct 25, 2022

I disagree, IMO the problem is one of misconfiguration - you should not use a Nameserver that serves a different Zone (e.g. a public resolver) as a Fallback.

Ah yes. Because its OK to prevent the users on my network from having any internet access at all if there is a problem with an internal machine.

Its weird how windows deals with this just fine. And dnsmasq. Its only systemd-resolved which can't handle the configuration I have.

@jonesmz
Copy link

jonesmz commented Oct 25, 2022

Always using the servers in the order they are configured in resolve.conf would mean that we potentially have to wait for a timeout on each single request before trying the next server - which would degrade performance substantially

That's why it would be an option, which you don't need to opt into, but I would.

@mattiaslundstrom
Copy link

mattiaslundstrom commented Oct 25, 2022

Like many others I have had to disable systemd-resolved because of this issue. In my case because my employer's network and VPN solution is "incorrectly" configured.

VPN is working fine if you correctly configure it (never had any problems with my personal as well as my work VPNs). I even think systemd-resolved is way superior in handling VPNs as it allows DNS based on device and domain. For example I set my company domains as search prefix for my VPN and in turns only my companie's DNS is used for these addresses while I can use is own DNS in a privacy friendly matter for all other queries.

I am not the admin of this network or the DNS servers. Even if you say that systemd-resolved would work if this environment was configured in a different way than it actually is, that does not help me at all. Also, this type of configuration is not something that is unique to my situation but instead something that is a fairly common situation.

This is the de facto situation for many people and the refusal to accept this reality is baffling to me. Why are you doing this?

I guess I will just continue not using systemd-resolved since it does not work. Which is fine, I guess...

@darkstar
Copy link

darkstar commented Oct 25, 2022

I guess I will just continue not using systemd-resolved since it does not work. Which is fine, I guess...

That's probably the best way forward, if it doesn't work for you for some weird reason. I guess the only reason this issue is not yet "fixed" is that there's only a handful of people who have these kinds of problems, while it seems to work fine for most systemd-resolved users, me included

@jonesmz
Copy link

jonesmz commented Oct 25, 2022

there's only a handful of people who have these kinds of problems, while it seems to work fine for most systemd-resolved users, me included

Then why participate in the github issue? You don't experience the well described issue, other people do. Presumably you don't experience bugs in software features you don't use in most applications.

@mourednik
Copy link

mourednik commented Oct 25, 2022

That's probably the best way forward, if it doesn't work for you for some weird reason. I guess the only reason this issue is not yet "fixed" is that there's only a handful of people who have these kinds of problems, while it seems to work fine for most systemd-resolved users, me included

"Works for me" is not a valid resolution for a defect. That would get you fired in some places. I subscribed to this issue five years ago when I had to uninstall a systemd distro and replace it with Devuan to connect to the VPN at work.

@lilydjwg
Copy link

lilydjwg commented Oct 25, 2022

I had to uninstall a systemd distro and replace it with Devuan to connect to the VPN at work.

You can disable systemd-resolved by editing /etc/nsswitch.conf (and configure your own /etc/resolv.conf). I've noticed Arch Linux added it soon as issues appeared.

@bronger
Copy link

bronger commented Oct 26, 2022

I guess the only reason this issue is not yet "fixed" is that there's only a handful of people who have these kinds of problems, while it seems to work fine for most systemd-resolved users, …

Such conjectures are not helpful in my opinion. Besides, it is questionable: This issue is in the top per mille of all 1700 issues for systemd on GitHub, no matter which criterion you use (number of participants, number of likes, number of comments).

@tmccombs
Copy link

tmccombs commented Oct 26, 2022

Standards are there to implement them as is, otherwise they don't have any reason to exist. If you think the standard should be improved, you should go to the standards committee to change it. Here is the wrong place for such requests.

I don't think systemd-resolved is even currently following the current standards. See my comment above.

VPN is working fine if you correctly configure it (never had any problems with my personal as well as my work VPNs). I even think systemd-resolved is way superior in handling VPNs as it allows DNS based on device and domain.

Just because it works for you doesn't mean it works for everyone. There are many, many ways a VPN can be set up. Consider the case where I have a split tunnel VPN, but want all DNS queries to go to the DNS servers for the VPN normally. However, if the VPN has connectivity problems, or slows down too much, I want DNS to (temporarily) fall back to public DNS servers, so that I'm still able to access any domains that aren't part of the VPN. Also, while there are multiple DNS servers, it is still possible that a misconfiguration, or similar could take them all down simultaneously. And again, in that situation, I would like to use public DNS (or a local DNS server) until the VPN dns servers come back online.

You can disable systemd-resolved by editing /etc/nsswitch.conf (and configure your own /etc/resolv.conf). I've noticed Arch Linux added it soon as issues appeared.

How you disable systemd-resolved is distribution specific. And in some cases quite involved. On Ubuntu disabling systemd-resolved is explicitly "unsupported" and can break other things, such as resolvconf.

@darkstar
Copy link

darkstar commented Oct 26, 2022

The author has already stated that the current behavior is not considered a bug and works as designed. Sure, we can debate endlessly about this, but to me it seems the decision has already been made.

Thus, I see this thread here only for giving people tips on how they can get their configs to work, which I tried to do in multiple posts above.

And, contrary to some popular opinions on the Internet, you actually can remove parts of systemd and replace them by other means that work better in your configuration. e.g. if your local setup somehow requires zone merging from different nameservers, you can use a DNS resolver that supports that while still using a systemd-based distro

@bronger
Copy link

bronger commented Oct 27, 2022

The author has already stated that the current behavior is not considered a bug and works as designed. …, but to me it seems the decision has already been made.

Then, the author should close this thing immediately. Instead, he ran away and leaves the rest of us in uncertainty.

@tmccombs
Copy link

tmccombs commented Oct 27, 2022

The author has already stated that the current behavior is not considered a bug and works as designed. Sure, we can debate endlessly about this, but to me it seems the decision has already been made.

This isn't a bug report, it is a feature request to make this behavior configurable. Perhaps that wasn't the original intent of the filer, but that is the current state. And while the maintainers are opposed to changing the default behavior I haven't seen any indication they are opposed to a configurable option to at least have a recovery mechanism to switch back to a preferred nameserver once it is healthy again. If there is no hope of such a feature, then what is the point of keeping this issue open?

@ZigMeowNyan
Copy link

ZigMeowNyan commented Oct 27, 2022

The author has already stated that the current behavior is not considered a bug and works as designed. Sure, we can debate endlessly about this, but to me it seems the decision has already been made.

This isn't a bug report, it is a feature request to make this behavior configurable.

Systemd's resolved wiki page on Freedesktop.org claims that this is "a fully featured DNS resolver implementation". There are multiple RFCs peppered throughout this issue that discuss the reality of split DNS and how common it is to have both internal and external DNS services, which are expected to serve different addresses. The resolved implementation breaks that standard. The systemd team have deliberately chosen not to implement this functionality, despite it predating resolved and being a common feature of the internet and an integral part of various IETF RFCs which describe and standardize the internet.

So there is indeed a bug. The bug either lies in the uselessly stubborn refusal to acknowledge internet topologies and standards or in the extremely generous description of resolved as a fully featured DNS resolver implementation. This product is marketed as something it isn't. Either fix the gaping pothole or fix the marketing and all wiki descriptions that falsely describe resolved as "fully featured" or "standards compliant".

Resolved cannot be standards-compliant if it's behavior undermines existing standards, and it cannot be fully-featured if there's a feature missing required to implement the standards. These terms require a 100% implementation of all related standards. "Flexible," "partially-compliant" or even "mostly-compliant" would be acceptable terms to describe resolved.

I have little hope for the former at this point, given project behavior, but perhaps with enough persistence and shame we can get the latter. Maybe it will even influence distros to not make resolved a default component. I'm more than fine with all the other pieces, but this one's poor implementation wastes way too much user time on unexpected standards-breaking behavior.

@darkstar
Copy link

darkstar commented Oct 27, 2022

There might be multiple RFCs on split horizon DNS, but none of those in any way mandates or requires the exact mechanism that is being discussed here. Split-horizon DNS works fine with systemd-resolved if you configure it correctly

@ZigMeowNyan
Copy link

ZigMeowNyan commented Oct 27, 2022

There might be multiple RFCs on split horizon DNS, but none of those in any way mandates or requires the exact mechanism that is being discussed here.

You should really read those RFCs before you make authoritative statements like that. It helps you avoid being incorrect.

I personally referenced several RFCs back in February (as did many others) which mandate that you're allowed to specify DNS servers in order of preference. And also that DNS clients are supposed to respect that order of preference. Resolved is both a DNS server and a DNS client, and, as a DNS client, it should be respecting the order of configured DNS servers by default. It's not my fault that they chose to read some RFCs and make incorrect architectural assumptions about it. But it is their fault that they refuse to acknowledge the behavior outlined in the RFCs which mandate how DNS fits into the larger networking picture.

Split-horizon DNS works fine with systemd-resolved if you configure it correctly

No, systemd-resolved can handle split-horizon DNS only if all configured nameservers resolve addresses identically. Which means that resolved cannot handle split DNS - it relies on other services to implement workarounds which hide split DNS to cope with its blatant shortcoming. That's like claiming a car can drive on the water as long as its always inside a boat, or that a broken umbrella works so long as you only use it indoors. It's a deliberate flaw that the project has chosen to blame on others in lieu of correcting. The configuration which you call "correct" is a workaround for this flaw. You're demanding that network administrators of various skill levels and capacities effectively implement a DNS condom to cope with resolved.

Read RFC 2775 starting at section 3.8.

There are many possible variants on this

Oh look, there's no one, true way to implement split DNS. Your adherence to your preferred configuration is preference - not canon.

the basic point is that the correspondence between a given FQDN (fully qualified domain name) and a given IPv4 address is no longer universal

There's not a 1-1 mapping between addresses and FQDNs. Nor is there a mandate that nameservers resolve identically. Load balancing alone, which is also mentioned, may cause a name server to resolve the address differently each time it's asked. Also, the RFC has many other gems, like 3.1's discussion of the intranet and section 4s acknowledgement of how many ways there are to skin the intranet/internet cat.

Split DNS can separate DNS information in software or in hardware. A hardware separation is running separate DNS servers. What's more, DNS software should never make the assumption that all DNS servers will resolve addresses identically. That's why respecting the order of DNS servers was included in the first place. Usage of multiple DNS servers which resolve addresses differently is treated as a permitted and common practice in RFCs. Which means that implementations which fail to handle non-identical name servers are wrong. Choosing to treat name servers as identical shows either ignorance or blatant disregard of officially sanctioned usages and implementations of the service. There are some advantages to implementing your preferred configuration, but also some disadvantages. Which is why the choice has always been left to the network admin. Not the resolved developer.

All of this is detailed in the many, many RFCs which you've failed to read. And you can't just pick and choose the RFCs like that. Networks, the internet, and the services that comprise their functionality do not exist in a vacuum. It's obvious that if your implementation breaks other standards which depend on standardized behavior, it's either wrong or incomplete. DNS is a core service that needs to behave in predictable ways. Implementations which refuse to cooperate shouldn't be allowed to call themselves "fully featured".

@tmccombs
Copy link

tmccombs commented Oct 28, 2022

@ZigMeowNyan I agree with a lot of what you've said, although I do think the current behavior is advantageous in the case where the DNS servers can be treated as identical, because if the "most preferred" nameserver is down you don't have to pay the latency for waiting for a response from that nameserver on every DNS request until it comes back up. And I would guess for the majority of users, the nameservers can be treated as identical, so I think it makes sense to keep that as the default. Of course, trying to switch back to the preferred nameserver after some number of requests or some period of time wouldn't really hurt in that case... I do agree that without some way to specify some kind of order for servers resolved isn't really "fully featured", but it's not like this would be the first time that term has been used loosely ;) , and I wouldn't really call some optimistic wording in a wiki a bug. But that's really all just semantics. The point is that maybe someday, a configurable option will be added to resolved to get the desired behavior. I'm a little doubtful the maintainers will implement it, but if someone else contributed a pull request they might accept it. Maybe someday I might get around to trying my hand at it, but since I've found a solution that doesn't use resolved and I'm completely unfamiliar with the codebase, it isn't a very high priority for me, .

@bronger
Copy link

bronger commented Oct 28, 2022

@poettering Would you accept a pull request that makes this configurable?

@darkdragon-001
Copy link

darkdragon-001 commented Oct 28, 2022

Order of preference is not the only way to implement split DNS. Resolved allows different name servers on different interfaces (e.g. company name server on the VPN interface which is used only when the interface is connected). One can even explicitly list the zones via search domain which can be pushed to the client in many VPN protocols. I think this is superior as it gives more semantics to decide which name server should be used for which FQDN efficiently instead of just an order which can be interpreted differently as this discussion and the various RFSc show.

@tmccombs
Copy link

tmccombs commented Oct 28, 2022

Split dns is not the only reason to want this preference. As I've mentioned before, you may have one DNS server that is faster than the other(s), for example because it is closer. This is the main use case for me. (I should note that being able to send requests to all nameservers in parallel would also solve this, at the cost of more traffic).

And what do you do if you have multiple nonequivalent dns servers that are on the same interface? For example if you have split dns that is on a local network rather than a vpn? An alternative solution to this would be to allow specifying different domains for different nameservers on the same interface.

It also means you have to keep the list of domains for the server in sync between the dns server and your vpn server or dhcp server or client configuration.

And not all vpns have a way to send search domains down. For example wireguard doesn't (although solutions based on it might).

@darkdragon-001
Copy link

darkdragon-001 commented Oct 28, 2022

Split dns is not the only reason to want this preference. As I've mentioned before, you may have one DNS server that is faster than the other(s), for example because it is closer. This is the main use case for me. (I should note that being able to send requests to all nameservers in parallel would also solve this, at the cost of more traffic).

In order to optimize speed, I agree that parallel requests (to N out of M name servers) or regular probing of all name servers for speed might even improve the current situation. Round robin would be another scheme which would fall more into the category load balancing etc. Many of these schemes would make split DNS in a way as requested by many people in this thread even more complicated. We should keep these options open for the future though.

And what do you do if you have multiple nonequivalent dns servers that are on the same interface? For example if you have split dns that is on a local network rather than a vpn? An alternative solution to this would be to allow specifying different domains for different nameservers on the same interface.

I am not sure but could very well imagine that the resolved API already supports this. There is just no automated way to derive such resolved configuration from the network interface configuration as this usually only allows a list of DNS servers and a list of search domains.

It also means you have to keep the list of domains for the server in sync between the dns server and your vpn server or dhcp server or client configuration.

And not all vpns have a way to send search domains down. For example wireguard doesn't (although solutions based on it might).

Many clients might still have this configured manually (or using some other asset deployment/management solution) in such a case as search domains are often used for many internal services anyways.

@tmccombs
Copy link

tmccombs commented Oct 28, 2022

I am not sure but could very well imagine that the resolved API already supports this.

There is not. The best you might be able to do with the current situation is set up a dummy interface for one of the dns servers. However, the most obvious way to do that doesn't work. Maybe it would be possible to set up a virtual interface that forwards traffic for the dns servere to the original interface. But setting that up would be pretty complicated. See #5573

@dbear496
Copy link

dbear496 commented Nov 1, 2022

I see multiple people argue that looking at the name servers in order of preference would increase latency because the most preferred name server, which is potentially down, must be queried every time. However, this does not need to be the case. A potential solution could query the last used name server (as resolved already does) and also query the most preferred name server in parallel. This will allow resolved to quickly detect when the most preferred server comes back online and switch to it when it does.

But for those of us for which the current way of doing things is broken, performance is not as important as getting it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolve RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests