New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-resolved does not keep the order of the DNS servers #5755

Open
diego-treitos opened this Issue Apr 18, 2017 · 44 comments

Comments

@diego-treitos

diego-treitos commented Apr 18, 2017

Submission type

  • Bug report
  • Request for enhancement (RFE)

NOTE: Do not submit anything other than bug reports or RFEs via the issue tracker!

systemd version the issue has been seen with

Version 232

NOTE: Do not submit bug reports about anything but the two most recently released systemd versions upstream!

Used distribution

Ubuntu 17.04

In case of bug report: Expected behaviour you didn't see

When having 2 nameservers like:

192.168.0.1
8.8.8.8

Defined in that order in /etc/resolv.conf I would expect to have the same behaviour than in resolv.conf: First use 192.168.0.1 and if for some reason it is not available, use 8.8.8.8.

I am seeing that systemd-resolved is switching nameservers randomly

Abr 18 16:40:01 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:01 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:06 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:06 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:11 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:16 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 16:40:16 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.
Abr 18 16:40:21 boi systemd-resolved[1692]: Switching to DNS server 192.168.0.1 for interface eth0.
Abr 18 19:16:09 boi systemd-resolved[1692]: Switching to DNS server 8.8.8.8 for interface eth0.

In case of bug report: Unexpected behaviour you saw

Random nameserver use

In case of bug report: Steps to reproduce the problem

just have 2 nameservers and use systemd-resolved service

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Apr 24, 2017

Member

resolved will always begin with the first configured DNS service, and switch to any other only after failures to contact it. If you turn on debug logging in resolved (by setting the SYSTEMD_LOG_LEVEL=debug env var for it), then you'll see the precise reason it switched over. Switching over can have many reasons: the IP route to the destination is missing, the server might simple not respond at all, or only with an error...

To turn on debug loggin, use "systemctl edit systemd-resolved", then write the two lines:

[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

and issue "systemctl restart systemd-resolved", then watch the output with "journalctl -u systemd-resolved -f", and look for the lines announcing the switch and the context before it.

I am pretty sure the output you'll see then will explain enough, hence I am closing this now. Feel free to reopen if it doesn't.

Member

poettering commented Apr 24, 2017

resolved will always begin with the first configured DNS service, and switch to any other only after failures to contact it. If you turn on debug logging in resolved (by setting the SYSTEMD_LOG_LEVEL=debug env var for it), then you'll see the precise reason it switched over. Switching over can have many reasons: the IP route to the destination is missing, the server might simple not respond at all, or only with an error...

To turn on debug loggin, use "systemctl edit systemd-resolved", then write the two lines:

[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

and issue "systemctl restart systemd-resolved", then watch the output with "journalctl -u systemd-resolved -f", and look for the lines announcing the switch and the context before it.

I am pretty sure the output you'll see then will explain enough, hence I am closing this now. Feel free to reopen if it doesn't.

@diego-treitos

This comment has been minimized.

Show comment
Hide comment
@diego-treitos

diego-treitos Apr 25, 2017

I cannot reopen the issue, I am afraid.

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).
I am not seeing this in systemd-resolved it seems that when it switches (for whatever reason) it stays with that server and subsequent requests are checked against the primary DNS server.
In my case I have:

Primary DNS: 192.168.0.250
Secondary DNS: 8.8.8.8

The primary DNS works just fine. I never see it offline. Actually, when systemd-resolved switches to 8.8.8.8 I can just test the resolving like this:

$ dig +short router.lar

$ dig +short router.lar @192.168.0.250
192.168.0.1

So here we see that despite the primary DNS server being available, systemd-resolved is not using it.
This is happening in two different computers I have.

I've never had any problems with any of them until I upgraded them the new Ubuntu version that uses systemd-resolved. In one of them, I already disabled systemd-resolved and it works just fine (just like before using systemd-resolved). So clearly there is something wrong with the systemd-resolved behaviour.

Just in case I enabled the debug as you requested and this is what I see for the requests:

Abr 25 11:00:42 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 22949
Abr 25 11:00:42 boi systemd-resolved[5221]: Looking up RR for router.lar IN A.
Abr 25 11:00:42 boi systemd-resolved[5221]: NXDOMAIN cache hit for router.lar IN A
Abr 25 11:00:42 boi systemd-resolved[5221]: Transaction 63967 for <router.lar IN A> on scope dns on eth0/* now complete with <rcode-failure> from cache (unsigned).
Abr 25 11:00:42 boi systemd-resolved[5221]: Freeing transaction 63967.
Abr 25 11:00:42 boi systemd-resolved[5221]: Sending response packet with id 22949 on interface 1/AF_INET.

And this is what I see when it switches (Added >>> to easily see the switch):

>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN AAAA
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 47232 for <go.trouter.io IN AAAA> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP+EDNS0+DO for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using DNS server 8.8.8.8 for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN A
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 29131 for <go.trouter.io IN A> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP for transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 350
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN A.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 30693
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN AAAA.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 63769
Apr 25 07:40:11 boi systemd-resolved[5221]: Looking up RR for browser.pipe.aria.microsoft.com IN A.
Apr 25 07:40:11 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
>>> Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:11 boi systemd-resolved[5221]: Cache miss for browser.pipe.aria.microsoft.com IN A

Looks like it switched each time it has a problem resolving a record and then it keeps using that name server for next requests.

diego-treitos commented Apr 25, 2017

I cannot reopen the issue, I am afraid.

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).
I am not seeing this in systemd-resolved it seems that when it switches (for whatever reason) it stays with that server and subsequent requests are checked against the primary DNS server.
In my case I have:

Primary DNS: 192.168.0.250
Secondary DNS: 8.8.8.8

The primary DNS works just fine. I never see it offline. Actually, when systemd-resolved switches to 8.8.8.8 I can just test the resolving like this:

$ dig +short router.lar

$ dig +short router.lar @192.168.0.250
192.168.0.1

So here we see that despite the primary DNS server being available, systemd-resolved is not using it.
This is happening in two different computers I have.

I've never had any problems with any of them until I upgraded them the new Ubuntu version that uses systemd-resolved. In one of them, I already disabled systemd-resolved and it works just fine (just like before using systemd-resolved). So clearly there is something wrong with the systemd-resolved behaviour.

Just in case I enabled the debug as you requested and this is what I see for the requests:

Abr 25 11:00:42 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 22949
Abr 25 11:00:42 boi systemd-resolved[5221]: Looking up RR for router.lar IN A.
Abr 25 11:00:42 boi systemd-resolved[5221]: NXDOMAIN cache hit for router.lar IN A
Abr 25 11:00:42 boi systemd-resolved[5221]: Transaction 63967 for <router.lar IN A> on scope dns on eth0/* now complete with <rcode-failure> from cache (unsigned).
Abr 25 11:00:42 boi systemd-resolved[5221]: Freeing transaction 63967.
Abr 25 11:00:42 boi systemd-resolved[5221]: Sending response packet with id 22949 on interface 1/AF_INET.

And this is what I see when it switches (Added >>> to easily see the switch):

>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN AAAA
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 47232 for <go.trouter.io IN AAAA> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP+EDNS0+DO for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using DNS server 8.8.8.8 for transaction 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 47232.
Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
>>> Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.
Apr 25 07:40:06 boi systemd-resolved[5221]: Cache miss for go.trouter.io IN A
Apr 25 07:40:06 boi systemd-resolved[5221]: Transaction 29131 for <go.trouter.io IN A> scope dns on eth0/*.
Apr 25 07:40:06 boi systemd-resolved[5221]: Using feature level UDP for transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Sending query packet with id 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 350
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN A.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:06 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 30693
Apr 25 07:40:06 boi systemd-resolved[5221]: Looking up RR for go.trouter.io IN AAAA.
Apr 25 07:40:06 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Got DNS stub UDP query packet for id 63769
Apr 25 07:40:11 boi systemd-resolved[5221]: Looking up RR for browser.pipe.aria.microsoft.com IN A.
Apr 25 07:40:11 boi systemd-resolved[5221]: Processing query...
Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
>>> Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.
Apr 25 07:40:11 boi systemd-resolved[5221]: Cache miss for browser.pipe.aria.microsoft.com IN A

Looks like it switched each time it has a problem resolving a record and then it keeps using that name server for next requests.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Apr 25, 2017

Member

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).

This is what happens. However, in contrast to classic nss-dns we have memory: when we noticed that a DNS server didn't respond or returned some failure, or for some other reason wasn't working for us, and we skip to the next, then we remember that and the next lookup is attempted with the new one. If that one fails too, then we'll skip to the next one and the next one and so on, until we reach the end of the list and start from the beginning of the list again.

This behaviour has the big advantage that we can build on what we learnt about a DNS server before, and don't waste the same timeout on a DNS server for each lookup should it not respond.

Or to say this differently: If you specify multiple DNS servers, then that's not a way to merge DNS zones or so. It's simply a way to define alternative servers should the first DNS server not work correctly.

If you want to route lookups in specific zones to specific DNS servers, then resolved doesn't really offer a nice way for that. A hack is to define multiple interfaces however, and configure different DNS servers and domains for them.

Member

poettering commented Apr 25, 2017

Regarding the issue, first of all, what I said is that I would expect systemd-resolved to behave just like the plain system resolv.conf file: first try one DNS and then the other (per request).

This is what happens. However, in contrast to classic nss-dns we have memory: when we noticed that a DNS server didn't respond or returned some failure, or for some other reason wasn't working for us, and we skip to the next, then we remember that and the next lookup is attempted with the new one. If that one fails too, then we'll skip to the next one and the next one and so on, until we reach the end of the list and start from the beginning of the list again.

This behaviour has the big advantage that we can build on what we learnt about a DNS server before, and don't waste the same timeout on a DNS server for each lookup should it not respond.

Or to say this differently: If you specify multiple DNS servers, then that's not a way to merge DNS zones or so. It's simply a way to define alternative servers should the first DNS server not work correctly.

If you want to route lookups in specific zones to specific DNS servers, then resolved doesn't really offer a nice way for that. A hack is to define multiple interfaces however, and configure different DNS servers and domains for them.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Apr 25, 2017

Member

Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

Member

poettering commented Apr 25, 2017

Apr 25 07:40:06 boi systemd-resolved[5221]: Timeout reached on transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Retrying transaction 29131.
Apr 25 07:40:06 boi systemd-resolved[5221]: Switching to DNS server 192.168.0.250 for interface eth0.

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Apr 25, 2017

Member

Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.

and here the same thing, when it swicthes back: the response for transaction 47747 wasn't received either, hence resolved tries the other server again, switching back.

Member

poettering commented Apr 25, 2017

Apr 25 07:40:11 boi systemd-resolved[5221]: Timeout reached on transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Retrying transaction 47737.
Apr 25 07:40:11 boi systemd-resolved[5221]: Switching to DNS server 8.8.8.8 for interface eth0.

and here the same thing, when it swicthes back: the response for transaction 47747 wasn't received either, hence resolved tries the other server again, switching back.

@diego-treitos

This comment has been minimized.

Show comment
Hide comment
@diego-treitos

diego-treitos Apr 25, 2017

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

Yes I see that. And precisely because it is using UDP it will be easier for some packages to get droped and that the DNS switches.
Surely you see the advantages of the configuration I have in place. In networks like small companies, you may send those nameservers via DHCP to all computers in your network so they have resolution for local and external domains. However, if for some reason the local DNS goes down, all your computers can still resolve internet domains. In other words, it is much easier that your local DNS fails than that the google DNS does, so it is like a strong failover.

With the current systemd implementation you lose that priority in resolving names as it works more like a round-robin, and I understand the advantages of that in many scenarios (quick DNS failover switch).

I think it would be great to have some configuration options on this like:

  • Choose between RR mode or Prioritized mode
  • Number of attempts before switching to next nameserver

Or even to periodically check for primary nameserver availability so you can go back to use it asap.

diego-treitos commented Apr 25, 2017

This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better.

Yes I see that. And precisely because it is using UDP it will be easier for some packages to get droped and that the DNS switches.
Surely you see the advantages of the configuration I have in place. In networks like small companies, you may send those nameservers via DHCP to all computers in your network so they have resolution for local and external domains. However, if for some reason the local DNS goes down, all your computers can still resolve internet domains. In other words, it is much easier that your local DNS fails than that the google DNS does, so it is like a strong failover.

With the current systemd implementation you lose that priority in resolving names as it works more like a round-robin, and I understand the advantages of that in many scenarios (quick DNS failover switch).

I think it would be great to have some configuration options on this like:

  • Choose between RR mode or Prioritized mode
  • Number of attempts before switching to next nameserver

Or even to periodically check for primary nameserver availability so you can go back to use it asap.

@diego-treitos

This comment has been minimized.

Show comment
Hide comment
@diego-treitos

diego-treitos Apr 25, 2017

diego-treitos commented Apr 25, 2017

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Apr 25, 2017

Member

BTW, odd thing is that it looks easier to switch to the external nameserver, when this is never able to resolve the local domains.

Not sure I grok what you are trying to say? Note that if a DNS lookup results in a NODATA or NXDOMAIN reply, then that's considered final, and no other DNS server is tried. Again, defining multiple DNS servers is not a way to merge zones, it's a way to deal with unreliable servers, the assumption is always that all DNS servers configured provide the same dataset.

So I think I grok what you are trying to do, but quite frankly, I think that even without resolved involved, this scheme is not reliable, and basically just taking benefit from a specific implementation detail of nss-dns/glibc. You are merging two concepts in what you are trying to do: fallback due to unreliable servers, and "merging" of zones. And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

Member

poettering commented Apr 25, 2017

BTW, odd thing is that it looks easier to switch to the external nameserver, when this is never able to resolve the local domains.

Not sure I grok what you are trying to say? Note that if a DNS lookup results in a NODATA or NXDOMAIN reply, then that's considered final, and no other DNS server is tried. Again, defining multiple DNS servers is not a way to merge zones, it's a way to deal with unreliable servers, the assumption is always that all DNS servers configured provide the same dataset.

So I think I grok what you are trying to do, but quite frankly, I think that even without resolved involved, this scheme is not reliable, and basically just taking benefit from a specific implementation detail of nss-dns/glibc. You are merging two concepts in what you are trying to do: fallback due to unreliable servers, and "merging" of zones. And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

@thomasleplus

This comment has been minimized.

Show comment
Hide comment
@thomasleplus

thomasleplus May 27, 2017

I have a similar situation than @diego-treitos. My company has a single internal DNS and so our DHCP server provides it as primary DNS, and OpenDNS as secondary. If any request to our DNS fails, systemd will switch to OpenDNS and I loose the ability to connect to internal servers. And since OpenDNS doesn't fail, I am never switching back to our DNS unless I disconnect and reconnect my network.

I agree that the proper solution would be having a reliable DNS server or, even better, two internal servers for redundancy. But while I try to convince our sysadmins of that, IMHO it would be nice to have an option.

thomasleplus commented May 27, 2017

I have a similar situation than @diego-treitos. My company has a single internal DNS and so our DHCP server provides it as primary DNS, and OpenDNS as secondary. If any request to our DNS fails, systemd will switch to OpenDNS and I loose the ability to connect to internal servers. And since OpenDNS doesn't fail, I am never switching back to our DNS unless I disconnect and reconnect my network.

I agree that the proper solution would be having a reliable DNS server or, even better, two internal servers for redundancy. But while I try to convince our sysadmins of that, IMHO it would be nice to have an option.

@diego-treitos

This comment has been minimized.

Show comment
Hide comment
@diego-treitos

diego-treitos May 29, 2017

I agree with that. I know that this may not be a direct problem with systemd, but this service is being used to replace a previous one, so I think it would be nice if it could work just like the service it is replacing.

diego-treitos commented May 29, 2017

I agree with that. I know that this may not be a direct problem with systemd, but this service is being used to replace a previous one, so I think it would be nice if it could work just like the service it is replacing.

@chrisisbd

This comment has been minimized.

Show comment
Hide comment
@chrisisbd

chrisisbd Jun 7, 2017

Yes, I agree that this is a problem. I have just upgraded a system to ubuntu 17.04 and what used to work in 16.04 now no longer works. We need a way to say that the second DNS is only to be used if the first one fails, the first one should always be tried first.

chrisisbd commented Jun 7, 2017

Yes, I agree that this is a problem. I have just upgraded a system to ubuntu 17.04 and what used to work in 16.04 now no longer works. We need a way to say that the second DNS is only to be used if the first one fails, the first one should always be tried first.

@chrisisbd

This comment has been minimized.

Show comment
Hide comment
@chrisisbd

chrisisbd Jun 7, 2017

Here's my output after adding the debug logging, it doesn't seem to make much sense:-

Jun 7 10:36:28 t470 systemd-resolved[2161]: Using system hostname 't470'.
Jun 7 10:36:28 t470 systemd-resolved[2161]: New scope on link *, protocol dns, family *
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 3/wlp4s0
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 2/enp0s31f6
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 1/lo
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=Hello cookie=1 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RequestName cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=4 reply_cookie=2 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=AddMatch cookie=3 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=5 reply_cookie=3 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=3 reply_cookie=0 error=n/a
Jun 7 10:37:38 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1936
Jun 7 10:37:38 t470 systemd-resolved[2161]: Looking up RR for esprimo.zbmc.eu IN A.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Switching to fallback DNS server 8.8.8.8.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Cache miss for esprimo.zbmc.eu IN A
Jun 7 10:37:38 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> scope dns on /.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using feature level UDP+EDNS0+DO+LARGE for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using DNS server 8.8.8.8 for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Sending query packet with id 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Processing query...
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing incoming packet on transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Verified we get a response at feature level UDP+EDNS0+DO from DNS server 8.8.8.8.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Added NXDOMAIN cache entry for esprimo.zbmc.eu IN ANY 1799s
Jun 7 10:37:39 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> on scope dns on / now complete with from network (unsigned).
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1936 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Freeing transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1919
Jun 7 10:37:39 t470 systemd-resolved[2161]: Looking up RR for esprimo IN A.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1919 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing query...

So why does it switch to using 8.8.8.8, it doesn't seem to have even tried 192.168.1.2.

chrisisbd commented Jun 7, 2017

Here's my output after adding the debug logging, it doesn't seem to make much sense:-

Jun 7 10:36:28 t470 systemd-resolved[2161]: Using system hostname 't470'.
Jun 7 10:36:28 t470 systemd-resolved[2161]: New scope on link *, protocol dns, family *
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 3/wlp4s0
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 2/enp0s31f6
Jun 7 10:36:28 t470 systemd-resolved[2161]: Found new link 1/lo
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=Hello cookie=1 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=1 reply_cookie=1 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RequestName cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=4 reply_cookie=2 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=AddMatch cookie=3 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.283 object=n/a interface=n/a member=n/a cookie=5 reply_cookie=3 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=2 reply_cookie=0 error=n/a
Jun 7 10:36:28 t470 systemd-resolved[2161]: Got message type=signal sender=org.freedesktop.DBus destination=:1.283 object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameAcquired cookie=3 reply_cookie=0 error=n/a
Jun 7 10:37:38 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1936
Jun 7 10:37:38 t470 systemd-resolved[2161]: Looking up RR for esprimo.zbmc.eu IN A.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Switching to fallback DNS server 8.8.8.8.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Cache miss for esprimo.zbmc.eu IN A
Jun 7 10:37:38 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> scope dns on /.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using feature level UDP+EDNS0+DO+LARGE for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Using DNS server 8.8.8.8 for transaction 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Sending query packet with id 53812.
Jun 7 10:37:38 t470 systemd-resolved[2161]: Processing query...
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing incoming packet on transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Verified we get a response at feature level UDP+EDNS0+DO from DNS server 8.8.8.8.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Added NXDOMAIN cache entry for esprimo.zbmc.eu IN ANY 1799s
Jun 7 10:37:39 t470 systemd-resolved[2161]: Transaction 53812 for <esprimo.zbmc.eu IN A> on scope dns on / now complete with from network (unsigned).
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1936 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Freeing transaction 53812.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Got DNS stub UDP query packet for id 1919
Jun 7 10:37:39 t470 systemd-resolved[2161]: Looking up RR for esprimo IN A.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Sending response packet with id 1919 on interface 1/AF_INET.
Jun 7 10:37:39 t470 systemd-resolved[2161]: Processing query...

So why does it switch to using 8.8.8.8, it doesn't seem to have even tried 192.168.1.2.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Jun 7, 2017

Member

@chrisisbd The "Switching to fallback DNS server 8.8.8.8." message indicates that you have no DNS servers configured at all, in which case resolved will use compiled-in fallback servers because it tries hard to just work also if you have a locally misconfigured system

Member

poettering commented Jun 7, 2017

@chrisisbd The "Switching to fallback DNS server 8.8.8.8." message indicates that you have no DNS servers configured at all, in which case resolved will use compiled-in fallback servers because it tries hard to just work also if you have a locally misconfigured system

@chrisisbd

This comment has been minimized.

Show comment
Hide comment
@chrisisbd

chrisisbd Jun 7, 2017

No, I have a working DNS on the LAN which (when I use it from xubuntu 16.04 systems) works perfectly.

The relevant part from 'systemd-resolve --status' is:-

ink 3 (wlp4s0)
Current Scopes: DNS
LLMNR setting: yes
MulticastDNS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNS Servers: 192.168.1.2
8.8.8.8
DNS Domain: zbmc.eu

Most of the time local names resolve OK on the 17.04 system too but it (randomly?) falls back to using the 8.8.8.8 server for no obvious reason.

chrisisbd commented Jun 7, 2017

No, I have a working DNS on the LAN which (when I use it from xubuntu 16.04 systems) works perfectly.

The relevant part from 'systemd-resolve --status' is:-

ink 3 (wlp4s0)
Current Scopes: DNS
LLMNR setting: yes
MulticastDNS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNS Servers: 192.168.1.2
8.8.8.8
DNS Domain: zbmc.eu

Most of the time local names resolve OK on the 17.04 system too but it (randomly?) falls back to using the 8.8.8.8 server for no obvious reason.

@amazon750

This comment has been minimized.

Show comment
Hide comment
@amazon750

amazon750 Jun 8, 2017

Hi Lennart, thanks for all of your work so far. I'm trying to keep using systemd, but you can add me to the list of people for whom the old behaviour seemed to be standardised and useful, and the new behaviour seems like a regression.

the assumption is always that all DNS servers configured provide the same dataset.

That assumption doesn't seem universal. I too have local names that aren't in public DNS, and some local overrides for external names, neither of which work if the failover happens (I only have a secondary server listed for the same reason as these other fellas: to keep internet access working more reliably for the rest of the local users if the primary fails). Under the old system, with the same resolv.conf and the same primary DNS server, things worked as I designed nearly 100% of the time. Now, with systemd, it's become quite unreliable. I hadn't needed to do per-domain request routing before, but I'd be fine with that solution. I also like the suggestion of a switch to choose which behaviour the local admin prefers. Anything would be better, I've been reduced to editing /etc/hosts to relieve some frustration, which I haven't otherwise done in years.

And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

Actually, on thinking about it further, that isn't as good. I would still prefer to use my internal DNS as primary for everything, and have it forward requests that it can't answer. Then again, maybe my preference is a bad practice, and won't be supported. But as mentioned, this all used to work, now it doesn't. If that's by design and won't be changed, that's unfortunate.

amazon750 commented Jun 8, 2017

Hi Lennart, thanks for all of your work so far. I'm trying to keep using systemd, but you can add me to the list of people for whom the old behaviour seemed to be standardised and useful, and the new behaviour seems like a regression.

the assumption is always that all DNS servers configured provide the same dataset.

That assumption doesn't seem universal. I too have local names that aren't in public DNS, and some local overrides for external names, neither of which work if the failover happens (I only have a secondary server listed for the same reason as these other fellas: to keep internet access working more reliably for the rest of the local users if the primary fails). Under the old system, with the same resolv.conf and the same primary DNS server, things worked as I designed nearly 100% of the time. Now, with systemd, it's become quite unreliable. I hadn't needed to do per-domain request routing before, but I'd be fine with that solution. I also like the suggestion of a switch to choose which behaviour the local admin prefers. Anything would be better, I've been reduced to editing /etc/hosts to relieve some frustration, which I haven't otherwise done in years.

And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example

Actually, on thinking about it further, that isn't as good. I would still prefer to use my internal DNS as primary for everything, and have it forward requests that it can't answer. Then again, maybe my preference is a bad practice, and won't be supported. But as mentioned, this all used to work, now it doesn't. If that's by design and won't be changed, that's unfortunate.

@lifeboy

This comment has been minimized.

Show comment
Hide comment
@lifeboy

lifeboy Jun 23, 2017

This is a problem, @poettering. The behaviour is a major change from the expected way and doesn't work in practice. If I specify 3 nameservers, the expectation that the first is always queried first, is settled. You can't change that now unilaterally.

Consider this scenario:

I have a VPN connection to a development environment where I have VM's running various tests. On the gateway on that cluster I run a DNS forwarder (dnsmasq) on pfsense. (192.168.121.1) Here I override the public DNS to resolv to a server on the LAN. This is not uncommon to do and in many corporate environments similar scenarios exist. In addition to the overriding of existing public DNS entries, I also add my own inhouse entries for my test servers.
Now, in addition to this development cluster, we run various production clusters on a similar basis. (192.168.0.1) Again, a DNS forwarder allow the resolution of a domain to LAN address instead of the public address.

Since we don't work in one location and precisely therefor that we use VPN to connect to the various clusters, we need the expected behaviour: Always try to resolve in this order:
192.168.121.1
192.168.0.1
8.8.8.8

What happens with systemd-resolved is this:
Try to resolve abc.com from 192.168.121.1. It resolves. Open tools and work on servers.
In the course of time, some entry for xyz.com doesn't resolve from 192.168.121.1. It does resolve from 192.168.0.1 and xyz.com is not accessible. However, quite soon after that abc.com is not found any more. This is because 192.168.0.1 doesn't have records for abc.com.

The only way to restore this is to clear the dns cache and restart systemd-resolved .

This is not acceptable and at the least we need a way to prevent this automatic jumping to a dns server lower down in the priority list.

lifeboy commented Jun 23, 2017

This is a problem, @poettering. The behaviour is a major change from the expected way and doesn't work in practice. If I specify 3 nameservers, the expectation that the first is always queried first, is settled. You can't change that now unilaterally.

Consider this scenario:

I have a VPN connection to a development environment where I have VM's running various tests. On the gateway on that cluster I run a DNS forwarder (dnsmasq) on pfsense. (192.168.121.1) Here I override the public DNS to resolv to a server on the LAN. This is not uncommon to do and in many corporate environments similar scenarios exist. In addition to the overriding of existing public DNS entries, I also add my own inhouse entries for my test servers.
Now, in addition to this development cluster, we run various production clusters on a similar basis. (192.168.0.1) Again, a DNS forwarder allow the resolution of a domain to LAN address instead of the public address.

Since we don't work in one location and precisely therefor that we use VPN to connect to the various clusters, we need the expected behaviour: Always try to resolve in this order:
192.168.121.1
192.168.0.1
8.8.8.8

What happens with systemd-resolved is this:
Try to resolve abc.com from 192.168.121.1. It resolves. Open tools and work on servers.
In the course of time, some entry for xyz.com doesn't resolve from 192.168.121.1. It does resolve from 192.168.0.1 and xyz.com is not accessible. However, quite soon after that abc.com is not found any more. This is because 192.168.0.1 doesn't have records for abc.com.

The only way to restore this is to clear the dns cache and restart systemd-resolved .

This is not acceptable and at the least we need a way to prevent this automatic jumping to a dns server lower down in the priority list.

@mourednik

This comment has been minimized.

Show comment
Hide comment
@mourednik

mourednik Jun 25, 2017

Hey guys. The only fix appears to be "install Linux without systemd" or install BSD.

I'm not trolling. This is not a joke.

mourednik commented Jun 25, 2017

Hey guys. The only fix appears to be "install Linux without systemd" or install BSD.

I'm not trolling. This is not a joke.

@systemd systemd locked and limited conversation to collaborators Jun 26, 2017

@systemd systemd unlocked this conversation Jul 9, 2017

@keszybz

This comment has been minimized.

Show comment
Hide comment
@keszybz

keszybz Jul 9, 2017

Member

Hm, we could allow DNS= configuration to specify two tiers of servers (e.g. with the syntax DNS=1.2.3.4 -8.8.8.8 -8.8.4.4), where those minus-prefixed servers would only be used if the non-minus-prefixed servers fail. Not sure about the details — the way I think could work would be to: first, round-robin on the first tier servers, and then fall back to the second tier once all of the first-tier servers have accumulated enough failures, like maybe 5 out of last 10 queries. And after some timeout, let's say 10 minutes, we should switch back.

Of course such a setup is not useful for merging zones (as @poettering wrote) in any reliable way, but it makes it easier to achieve "soft failure", where some local names stop working but the internet is not fully broken when the local nameserver goes down. Also, thanks to automatic switching back after a delay, things would "unbreak" automatically.

Member

keszybz commented Jul 9, 2017

Hm, we could allow DNS= configuration to specify two tiers of servers (e.g. with the syntax DNS=1.2.3.4 -8.8.8.8 -8.8.4.4), where those minus-prefixed servers would only be used if the non-minus-prefixed servers fail. Not sure about the details — the way I think could work would be to: first, round-robin on the first tier servers, and then fall back to the second tier once all of the first-tier servers have accumulated enough failures, like maybe 5 out of last 10 queries. And after some timeout, let's say 10 minutes, we should switch back.

Of course such a setup is not useful for merging zones (as @poettering wrote) in any reliable way, but it makes it easier to achieve "soft failure", where some local names stop working but the internet is not fully broken when the local nameserver goes down. Also, thanks to automatic switching back after a delay, things would "unbreak" automatically.

@lifeboy

This comment has been minimized.

Show comment
Hide comment
@lifeboy

lifeboy Jul 14, 2017

lifeboy commented Jul 14, 2017

@keszybz keszybz removed the not-a-bug label Jul 15, 2017

@keszybz

This comment has been minimized.

Show comment
Hide comment
@keszybz

keszybz Jul 15, 2017

Member

This was already explained above (#5755 (comment)), but I'll try again:

  • we rotate among servers when the current server is not responding to provide reliability.
  • we remember which server is "good" so that there's no initial delay.

Contrary to what you wrote, DNS clients do not cache answers in general. Actually, when programs are short-lived, they cannot cache answers even if they wanted to; every time a program is restarted is starts with a clean slate. The place where caching is performed is inside of systemd-resolved (or in another external cache, like nscd, sssd, etc., but with systemd-resolved running the idea is that you don't need those).

With DNSSEC, the delay from nonresponding name server becomes even more noticeable. We might want to adjust caching details, but it's a central feature of systemd-resolved functionality and it's not going away. (Both the memory of "last good" server, and previously queried resource records). So if you want something to change to accommodate your use case, help design the solution like proposed (#5755 (comment)) so that it works for you.

Member

keszybz commented Jul 15, 2017

This was already explained above (#5755 (comment)), but I'll try again:

  • we rotate among servers when the current server is not responding to provide reliability.
  • we remember which server is "good" so that there's no initial delay.

Contrary to what you wrote, DNS clients do not cache answers in general. Actually, when programs are short-lived, they cannot cache answers even if they wanted to; every time a program is restarted is starts with a clean slate. The place where caching is performed is inside of systemd-resolved (or in another external cache, like nscd, sssd, etc., but with systemd-resolved running the idea is that you don't need those).

With DNSSEC, the delay from nonresponding name server becomes even more noticeable. We might want to adjust caching details, but it's a central feature of systemd-resolved functionality and it's not going away. (Both the memory of "last good" server, and previously queried resource records). So if you want something to change to accommodate your use case, help design the solution like proposed (#5755 (comment)) so that it works for you.

@keszybz keszybz reopened this Jul 15, 2017

@lifeboy

This comment has been minimized.

Show comment
Hide comment
@lifeboy

lifeboy Jul 16, 2017

I think @keszybz's workaround is not a good idea. It's still not a solution that keeps the established behaviour and add strange new "features" for that that wish to enable them. Why does @poettering insist on breaking things that work just fine?

I'm being forced off systemd more and more and I now see that's a good think. The more people move away, the better.

lifeboy commented Jul 16, 2017

I think @keszybz's workaround is not a good idea. It's still not a solution that keeps the established behaviour and add strange new "features" for that that wish to enable them. Why does @poettering insist on breaking things that work just fine?

I'm being forced off systemd more and more and I now see that's a good think. The more people move away, the better.

@keszybz

This comment has been minimized.

Show comment
Hide comment
@keszybz

keszybz Jul 16, 2017

Member

Well, we keep telling you why, and you keep asking.

Member

keszybz commented Jul 16, 2017

Well, we keep telling you why, and you keep asking.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Jul 17, 2017

Member

@lifeboy there are two conflicting needs here:

  1. You want that DNS server A is always queried first, and DNS server B second, for every single request, so that A's answer can be different than B's, and B is only used if A doesn't respond.

  2. What is actually implemented right now tries to be smart and reuses server B immediately if a previous lookup didn't get a timely answer from A. In order to make the system react quickly and in a snappy way we optimise things, and learn from previous lookups, and try to avoid to make the same mistake continously, which would be to keep contacting server A which isn't responsive.

Now, these two needs are directly conflicting: you want resolved to always start from the beginning, we want that we learn from previous lookups. I am pretty sure that item 2 is the better choice though, in particular when DNSSEC is used where lookups become increasingly slow, and we really don't want to waste time contacting servers we already know are unresponsive.

I am not convinced changing things to implement your option 1 is really the way forward though, simply because this seriously hampers the usefulness of defining fallback servers: if you are in need of one you always have to wait for the first one's full timeout, on every single request. A good way to do fallbacks I think however is to expose similar performance and behaviour if we can, to make the fallback cost as little as possible.

That said, resolved's behaviour is indeed different from traditional libc's resolver (though primarily due to the fact that glibc can't really do it better since they don't share system-wide state between lookups, but every process runs its own DNS client stack). Hence I'd be willing to add a compat option of some kind (which could even be enabled by default for all DNS servers we learn from /etc/resolv.conf as opposed to NM directly), to get the older, simpler and less smart version you are asking for.

I hope this makes sense.

Member

poettering commented Jul 17, 2017

@lifeboy there are two conflicting needs here:

  1. You want that DNS server A is always queried first, and DNS server B second, for every single request, so that A's answer can be different than B's, and B is only used if A doesn't respond.

  2. What is actually implemented right now tries to be smart and reuses server B immediately if a previous lookup didn't get a timely answer from A. In order to make the system react quickly and in a snappy way we optimise things, and learn from previous lookups, and try to avoid to make the same mistake continously, which would be to keep contacting server A which isn't responsive.

Now, these two needs are directly conflicting: you want resolved to always start from the beginning, we want that we learn from previous lookups. I am pretty sure that item 2 is the better choice though, in particular when DNSSEC is used where lookups become increasingly slow, and we really don't want to waste time contacting servers we already know are unresponsive.

I am not convinced changing things to implement your option 1 is really the way forward though, simply because this seriously hampers the usefulness of defining fallback servers: if you are in need of one you always have to wait for the first one's full timeout, on every single request. A good way to do fallbacks I think however is to expose similar performance and behaviour if we can, to make the fallback cost as little as possible.

That said, resolved's behaviour is indeed different from traditional libc's resolver (though primarily due to the fact that glibc can't really do it better since they don't share system-wide state between lookups, but every process runs its own DNS client stack). Hence I'd be willing to add a compat option of some kind (which could even be enabled by default for all DNS servers we learn from /etc/resolv.conf as opposed to NM directly), to get the older, simpler and less smart version you are asking for.

I hope this makes sense.

@diego-treitos

This comment has been minimized.

Show comment
Hide comment
@diego-treitos

diego-treitos Jul 17, 2017

Well, I do not have any conflict in my needs because I only have one.

  1. To have a DNS system that works just like it always worked.

Regarding the "smart" system, I did not experienced that smartness in any way. In my experience the secondary DNS is always being used. I did tests and my local DNS (primary) works perfectly fine. I did stress tests of hundreds of requests per second and not a single failure. However in my 2 computers where I have systemd-resolved the secondary DNS is being selected after a few minutes after restart and never going back to primary. The current implementation is not reliable.

So for what I see, this implementation adds features that nobody asked for, breaking backwards compatibility with what was working reliably (and secure) for years and it does not even do it properly.

For now the obvious solution is to not use systemd-resolved. When/If that simpler solution is implemented, I will take a look at it, although not sure why would I use it instead of the traditional version.

diego-treitos commented Jul 17, 2017

Well, I do not have any conflict in my needs because I only have one.

  1. To have a DNS system that works just like it always worked.

Regarding the "smart" system, I did not experienced that smartness in any way. In my experience the secondary DNS is always being used. I did tests and my local DNS (primary) works perfectly fine. I did stress tests of hundreds of requests per second and not a single failure. However in my 2 computers where I have systemd-resolved the secondary DNS is being selected after a few minutes after restart and never going back to primary. The current implementation is not reliable.

So for what I see, this implementation adds features that nobody asked for, breaking backwards compatibility with what was working reliably (and secure) for years and it does not even do it properly.

For now the obvious solution is to not use systemd-resolved. When/If that simpler solution is implemented, I will take a look at it, although not sure why would I use it instead of the traditional version.

@lifeboy

This comment has been minimized.

Show comment
Hide comment
@lifeboy

lifeboy Jul 17, 2017

lifeboy commented Jul 17, 2017

@bernux

This comment has been minimized.

Show comment
Hide comment
@bernux

bernux Sep 7, 2017

This behaviour gives us headache here.
We have 3 DNS pushed by DHCP 2 internal and one external which don't resolve internal record just here for emergency.
On my desktop (in 2h30) switching of DNS has been done 73 times because I make a script to check if some switching has been done and restart systemd-resolved.
Maybe it works like it should in some circumstance but it fails royally in others.
All I want, now, is to disable systemd-resolved.

bernux commented Sep 7, 2017

This behaviour gives us headache here.
We have 3 DNS pushed by DHCP 2 internal and one external which don't resolve internal record just here for emergency.
On my desktop (in 2h30) switching of DNS has been done 73 times because I make a script to check if some switching has been done and restart systemd-resolved.
Maybe it works like it should in some circumstance but it fails royally in others.
All I want, now, is to disable systemd-resolved.

@lifeboy

This comment has been minimized.

Show comment
Hide comment
@lifeboy

lifeboy Sep 7, 2017

lifeboy commented Sep 7, 2017

@bernux

This comment has been minimized.

Show comment
Hide comment
@bernux

bernux Sep 7, 2017

@lifeboy solved my problem too

bernux commented Sep 7, 2017

@lifeboy solved my problem too

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 12, 2017

So if your dns goes down or is misconfigured, systemd will silently fallback to google's dns servers? Does this "remembering" also remember the fallback? Will it use the fallback if other servers come up at a later time? People might not even know they are sending DNS queries to google when this happens.

ghost commented Sep 12, 2017

So if your dns goes down or is misconfigured, systemd will silently fallback to google's dns servers? Does this "remembering" also remember the fallback? Will it use the fallback if other servers come up at a later time? People might not even know they are sending DNS queries to google when this happens.

@chrisisbd

This comment has been minimized.

Show comment
Hide comment
@chrisisbd

chrisisbd Sep 12, 2017

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

I.e. it's no longer possible to even have a 'backup' DNS server. You can no longer designate one server as the one to be used by default with a backup one to use if the main one fails.

chrisisbd commented Sep 12, 2017

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

I.e. it's no longer possible to even have a 'backup' DNS server. You can no longer designate one server as the one to be used by default with a backup one to use if the main one fails.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Sep 12, 2017

Member

So if your dns goes down or is misconfigured, systemd will silently fallback to google's dns servers?

No. It won't. If any DNS configuration is configured at all, it is used, regardless if it actually works or doesn't. If no DNS configuration exists at all, then the default DNS configuration specified at systemd buildtime is used, using the DNS_SERVERS meson build parameters. We encourage distributions to set these servers towhatever they like, but many just leave it at 8.8.8.8. If you don't like that please politely try to convince your distribution to change them to better suited servers. Note that these fallback servers are exclusively used if no DNS configuration exists at all, and resolved immediately switches to whatever is configured as soon as something is configured again.

This is exactly the same btw as it is for NTP servers for timesyncd: the built-in is picked at build-time, and we encourage distros to set them to whatever is appropriate for them. Some do, others don't. If you don't like the choice your distro made there, then please try to convince them to use something else and tell them what. Also, exactly as for DNS: these built-in fallback NTP servers are only used if no other configuration was made, and timesyncd immediately stops using them if you configure something.

Both DNS and NTP may be sourced from DHCP btw, and are by defult if you use networkd or NM.

Does this "remembering" also remember the fallback? Will it use the fallback if other servers come up at a later time? People might not even know they are sending DNS queries to google when this happens.

You are mixing up two unrelated things here: fallback servers (which are used in case no configuration exists at all), and the fact that resolved continues to use DNS servers that it previously had success with (or specifically the first in the current configuration that replied reliably), instead of always beginning all lookups again with DNS servers it already knows are not responding reliably. The latter logic applies unconditionally, but when configuration is replaced (or we change from configuration to no configuration and thus to or away from the fallback DNS servers), we of course immediately stop using any DNS server no longer on the list to be used.

Member

poettering commented Sep 12, 2017

So if your dns goes down or is misconfigured, systemd will silently fallback to google's dns servers?

No. It won't. If any DNS configuration is configured at all, it is used, regardless if it actually works or doesn't. If no DNS configuration exists at all, then the default DNS configuration specified at systemd buildtime is used, using the DNS_SERVERS meson build parameters. We encourage distributions to set these servers towhatever they like, but many just leave it at 8.8.8.8. If you don't like that please politely try to convince your distribution to change them to better suited servers. Note that these fallback servers are exclusively used if no DNS configuration exists at all, and resolved immediately switches to whatever is configured as soon as something is configured again.

This is exactly the same btw as it is for NTP servers for timesyncd: the built-in is picked at build-time, and we encourage distros to set them to whatever is appropriate for them. Some do, others don't. If you don't like the choice your distro made there, then please try to convince them to use something else and tell them what. Also, exactly as for DNS: these built-in fallback NTP servers are only used if no other configuration was made, and timesyncd immediately stops using them if you configure something.

Both DNS and NTP may be sourced from DHCP btw, and are by defult if you use networkd or NM.

Does this "remembering" also remember the fallback? Will it use the fallback if other servers come up at a later time? People might not even know they are sending DNS queries to google when this happens.

You are mixing up two unrelated things here: fallback servers (which are used in case no configuration exists at all), and the fact that resolved continues to use DNS servers that it previously had success with (or specifically the first in the current configuration that replied reliably), instead of always beginning all lookups again with DNS servers it already knows are not responding reliably. The latter logic applies unconditionally, but when configuration is replaced (or we change from configuration to no configuration and thus to or away from the fallback DNS servers), we of course immediately stop using any DNS server no longer on the list to be used.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Sep 12, 2017

Member

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

No it won't, and no it's not the problem.

The problem is that all configured DNS servers are assumed to be equivalent but in some people's configuration they aren't. If multiple DNS servers are configured, and one for some reason whatsoever doesn't respond both the built-in glibc resolver and resolved switch to the next DNS server configured. Now, because the glibc resolver doesn't maintain state between individual lookups, on the next lookup it will start again from the first DNS server, even though it wasn't reliable the first time. resolved is smarter there, and continues to use the working DNS server for subsequent lookups, until one of them fail and it switches on to the next one and so on. That resolved does that is a good thing, since it deals nicely with failure, and ensures that lookups remain quick and we use what we learned. However, it conflicts with setups which are built on the assumption that each lookup forgets all state and starts from the beginning of the list again.

Member

poettering commented Sep 12, 2017

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

No it won't, and no it's not the problem.

The problem is that all configured DNS servers are assumed to be equivalent but in some people's configuration they aren't. If multiple DNS servers are configured, and one for some reason whatsoever doesn't respond both the built-in glibc resolver and resolved switch to the next DNS server configured. Now, because the glibc resolver doesn't maintain state between individual lookups, on the next lookup it will start again from the first DNS server, even though it wasn't reliable the first time. resolved is smarter there, and continues to use the working DNS server for subsequent lookups, until one of them fail and it switches on to the next one and so on. That resolved does that is a good thing, since it deals nicely with failure, and ensures that lookups remain quick and we use what we learned. However, it conflicts with setups which are built on the assumption that each lookup forgets all state and starts from the beginning of the list again.

@chrisisbd

This comment has been minimized.

Show comment
Hide comment
@chrisisbd

chrisisbd Sep 12, 2017

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

No it won't, and no it's not the problem.

The problem is that all configured DNS servers are assumed to be equivalent but in some people's configuration they aren't.

Well, alright, but the result is the same! The systemd resolver doesn't recognise that there is such a thing as a seconrary/backup DNS. I don't think this makes resolved 'smarter', for many people this makes it less smart.

chrisisbd commented Sep 12, 2017

It will stay using the backup DNS even if/when your local DNS comes back, that's the problem.

No it won't, and no it's not the problem.

The problem is that all configured DNS servers are assumed to be equivalent but in some people's configuration they aren't.

Well, alright, but the result is the same! The systemd resolver doesn't recognise that there is such a thing as a seconrary/backup DNS. I don't think this makes resolved 'smarter', for many people this makes it less smart.

@systemd systemd deleted a comment from lifeboy Sep 12, 2017

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 12, 2017

@poettering From my perspective they are related, and as I'm sure you know there are many people like myself who would prefer not to send things to certain places. It is important to me that my configuration is respected, broken or not. I'm glad this is the case. Control over my computer is something I hold as high value. Thank you for the detailed explanation.

My personal concerns aside, this behavior does seem to be a problem for those with what I would say is a common assumption. I think an option for the classic dns resolution method would be well received by the community.

To possibly expand upon your new and improved process, what if resolved checked to see if the previously down dns comes back up and then switch back to it when able? I understand doing this during each request completely defeats the purpose, but how about other times? Maybe periodically? The frustration arises when the "DNS switch" happens and for some, the pain never goes away due to the "smartness". A little more smartness would go a long way and If you're maintaining the list of servers in a stateful way I think this is possible.

ghost commented Sep 12, 2017

@poettering From my perspective they are related, and as I'm sure you know there are many people like myself who would prefer not to send things to certain places. It is important to me that my configuration is respected, broken or not. I'm glad this is the case. Control over my computer is something I hold as high value. Thank you for the detailed explanation.

My personal concerns aside, this behavior does seem to be a problem for those with what I would say is a common assumption. I think an option for the classic dns resolution method would be well received by the community.

To possibly expand upon your new and improved process, what if resolved checked to see if the previously down dns comes back up and then switch back to it when able? I understand doing this during each request completely defeats the purpose, but how about other times? Maybe periodically? The frustration arises when the "DNS switch" happens and for some, the pain never goes away due to the "smartness". A little more smartness would go a long way and If you're maintaining the list of servers in a stateful way I think this is possible.

@leonelwilliams

This comment has been minimized.

Show comment
Hide comment
@leonelwilliams

leonelwilliams Sep 12, 2017

It seems to me that things would be much easier if one used a (sub)domain one owned in a ICANN TLD with public nameservers instead of making up your own (e.g. .local/.internal). Works with all configured DNS resolvers without fiddling around.

leonelwilliams commented Sep 12, 2017

It seems to me that things would be much easier if one used a (sub)domain one owned in a ICANN TLD with public nameservers instead of making up your own (e.g. .local/.internal). Works with all configured DNS resolvers without fiddling around.

@jnye

This comment has been minimized.

Show comment
Hide comment
@jnye

jnye Sep 12, 2017

The Linux man page for resolv.conf(5) says they should be tried in order. There is a rotate option available but it sounds like most people complaining here don't use that. Without the rotate option it says the order is tried again every time.

jnye commented Sep 12, 2017

The Linux man page for resolv.conf(5) says they should be tried in order. There is a rotate option available but it sounds like most people complaining here don't use that. Without the rotate option it says the order is tried again every time.

@ryanaslett

This comment has been minimized.

Show comment
Hide comment
@ryanaslett

ryanaslett Sep 12, 2017

Is there, or could there be, a configurable threshold by which the determination is made that server A is "unresponsive" and that the switch should be made to server B? It appears that the determination of whether or not a server is capable of handling requests is far too fragile, and far too likely to switch to the next server at the first sign of trouble, which, given that DNS is on UDP, one cannot make the assertion that a single failed response or timeout is grounds to establish that a server is unresponsive.

ryanaslett commented Sep 12, 2017

Is there, or could there be, a configurable threshold by which the determination is made that server A is "unresponsive" and that the switch should be made to server B? It appears that the determination of whether or not a server is capable of handling requests is far too fragile, and far too likely to switch to the next server at the first sign of trouble, which, given that DNS is on UDP, one cannot make the assertion that a single failed response or timeout is grounds to establish that a server is unresponsive.

@tebruno99

This comment has been minimized.

Show comment
Hide comment
@tebruno99

tebruno99 Sep 12, 2017

I have a local dns server that hosts my public names internally so that my traffic doesn't go out my router right back into my public port to connect to my own locally hosted website (which doesn't work on Comcast btw). This is a major issue for me since the 2nd dns server in my list is the public 8.8.8.8 just incase my internal one isn't working and I want to use Google to find out why. I often restart my internal dns as I make changes and have had this issue several times which disables all my internal services since I can't loop back through the public IP from my LAN.

Primary/Backup, not a dumb list has always been what I was taught and how I expect the local resolver to work.

tebruno99 commented Sep 12, 2017

I have a local dns server that hosts my public names internally so that my traffic doesn't go out my router right back into my public port to connect to my own locally hosted website (which doesn't work on Comcast btw). This is a major issue for me since the 2nd dns server in my list is the public 8.8.8.8 just incase my internal one isn't working and I want to use Google to find out why. I often restart my internal dns as I make changes and have had this issue several times which disables all my internal services since I can't loop back through the public IP from my LAN.

Primary/Backup, not a dumb list has always been what I was taught and how I expect the local resolver to work.

@kroeckx

This comment has been minimized.

Show comment
Hide comment
@kroeckx

kroeckx Sep 12, 2017

I expect the first to work almost all the time. Reasons it might not answer is because some packet got dropped, the DNS server it's querying doesn't reply, and so on. The problem might not be with that the server isn't working, just some external problem.

In case the domain you're trying to look up is having problems (or internet is down), you might try all servers, have each fail, and then switch to some default that also doesn't work, which does not seem like it's something you want.

So I would hope that it would retry the servers after some time to see if they come back up. The list is at least a preferred order of where to send the request to for me.

But I also want to add that I expect all servers in that file to have the same view of DNS, and not that one can return something for what is behind the VPN and the other not. This packet can also be dropped and so the next one can be tried and you'll get the wrong result.

kroeckx commented Sep 12, 2017

I expect the first to work almost all the time. Reasons it might not answer is because some packet got dropped, the DNS server it's querying doesn't reply, and so on. The problem might not be with that the server isn't working, just some external problem.

In case the domain you're trying to look up is having problems (or internet is down), you might try all servers, have each fail, and then switch to some default that also doesn't work, which does not seem like it's something you want.

So I would hope that it would retry the servers after some time to see if they come back up. The list is at least a preferred order of where to send the request to for me.

But I also want to add that I expect all servers in that file to have the same view of DNS, and not that one can return something for what is behind the VPN and the other not. This packet can also be dropped and so the next one can be tried and you'll get the wrong result.

@kroeckx

This comment has been minimized.

Show comment
Hide comment
@kroeckx

kroeckx Sep 12, 2017

Since glibc doesn't do any checking of DNSSEC, all my the IP addresses in my resolv.conf have become the addresses of servers I run myself and are checking DNSSEC. So I would really like to avoid some fallback to some default server over which I have no control.

kroeckx commented Sep 12, 2017

Since glibc doesn't do any checking of DNSSEC, all my the IP addresses in my resolv.conf have become the addresses of servers I run myself and are checking DNSSEC. So I would really like to avoid some fallback to some default server over which I have no control.

@darkstar

This comment has been minimized.

Show comment
Hide comment
@darkstar

darkstar Sep 12, 2017

Maybe it just boils down to a timeout that makes systemd-resolved switch to the next server quicker than the usual resolver. That could explain that some people are seing a switch to the second server even though lookups with dig/nslookup work just fine. If so, it can probably be fixed or worked around.

But I think most people don't understand the fact that, as was already stated, the DNS servers are supposed to be exactly equivalent. And if they are equivalent, then it doesn't matter which one you choose. If you want reliable DNS, you have to provide one or more reliable DNS servers. Or, provide only one server (and deal with the occasional disruption if a single lookup fails) and let that server handle the forwarding to 8.8.8.8.
Yes, it might be annoying for home users that have only one DNS server. But there are already tons of other options that can help in such a setting (nscd, sssd, etc.) and there's no reason not to use them instead of systemd-resolved.

darkstar commented Sep 12, 2017

Maybe it just boils down to a timeout that makes systemd-resolved switch to the next server quicker than the usual resolver. That could explain that some people are seing a switch to the second server even though lookups with dig/nslookup work just fine. If so, it can probably be fixed or worked around.

But I think most people don't understand the fact that, as was already stated, the DNS servers are supposed to be exactly equivalent. And if they are equivalent, then it doesn't matter which one you choose. If you want reliable DNS, you have to provide one or more reliable DNS servers. Or, provide only one server (and deal with the occasional disruption if a single lookup fails) and let that server handle the forwarding to 8.8.8.8.
Yes, it might be annoying for home users that have only one DNS server. But there are already tons of other options that can help in such a setting (nscd, sssd, etc.) and there's no reason not to use them instead of systemd-resolved.

@Harleqin

This comment has been minimized.

Show comment
Hide comment
@Harleqin

Harleqin Sep 12, 2017

Where does the assumption that all DNS servers are supposed to be equivalent come from?

You see, they are not.

Harleqin commented Sep 12, 2017

Where does the assumption that all DNS servers are supposed to be equivalent come from?

You see, they are not.

@mthorpe7

This comment has been minimized.

Show comment
Hide comment
@mthorpe7

mthorpe7 Sep 13, 2017

@Harleqin - it comes from RFC 1034 and RFC 1035:

The strategy is to cycle around all of the addresses for all of the servers with a timeout between each transmission.

its fairly explicit that the resolver can determine the order:

To complete initialization of SLIST, the resolver attaches whatever history information it has to the each address in SLIST. This will usually consist of some sort of weighted averages for the response time of the address, and the batting average of the address (i.e., how often the address responded at all to the request).

mthorpe7 commented Sep 13, 2017

@Harleqin - it comes from RFC 1034 and RFC 1035:

The strategy is to cycle around all of the addresses for all of the servers with a timeout between each transmission.

its fairly explicit that the resolver can determine the order:

To complete initialization of SLIST, the resolver attaches whatever history information it has to the each address in SLIST. This will usually consist of some sort of weighted averages for the response time of the address, and the batting average of the address (i.e., how often the address responded at all to the request).

@systemd systemd locked and limited conversation to collaborators Sep 13, 2017

@systemd systemd deleted a comment from tebruno99 Sep 13, 2017

@systemd systemd deleted a comment from ctrix Sep 13, 2017

@systemd systemd deleted a comment from leonelwilliams Sep 13, 2017

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Sep 13, 2017

Member

Sorry, but given how the quality of discussion has degraded and the number of inappropriate comments I had to delete has increased I have now locked this issue. I will unlock this again in a few days when things have become quieter. Thank you for understanding.

Member

poettering commented Sep 13, 2017

Sorry, but given how the quality of discussion has degraded and the number of inappropriate comments I had to delete has increased I have now locked this issue. I will unlock this again in a few days when things have become quieter. Thank you for understanding.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.