Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
systemd-resolved does not keep the order of the DNS servers #5755
Comments
|
resolved will always begin with the first configured DNS service, and switch to any other only after failures to contact it. If you turn on debug logging in resolved (by setting the SYSTEMD_LOG_LEVEL=debug env var for it), then you'll see the precise reason it switched over. Switching over can have many reasons: the IP route to the destination is missing, the server might simple not respond at all, or only with an error... To turn on debug loggin, use "systemctl edit systemd-resolved", then write the two lines:
and issue "systemctl restart systemd-resolved", then watch the output with "journalctl -u systemd-resolved -f", and look for the lines announcing the switch and the context before it. I am pretty sure the output you'll see then will explain enough, hence I am closing this now. Feel free to reopen if it doesn't. |
poettering
closed this
Apr 24, 2017
poettering
added
not-a-bug
resolve
labels
Apr 24, 2017
diego-treitos
commented
Apr 25, 2017
•
|
I cannot reopen the issue, I am afraid. Regarding the issue, first of all, what I said is that I would expect
The primary DNS works just fine. I never see it offline. Actually, when
So here we see that despite the primary DNS server being available, I've never had any problems with any of them until I upgraded them the new Ubuntu version that uses Just in case I enabled the debug as you requested and this is what I see for the requests:
And this is what I see when it switches (Added >>> to easily see the switch):
Looks like it switched each time it has a problem resolving a record and then it keeps using that name server for next requests. |
This is what happens. However, in contrast to classic nss-dns we have memory: when we noticed that a DNS server didn't respond or returned some failure, or for some other reason wasn't working for us, and we skip to the next, then we remember that and the next lookup is attempted with the new one. If that one fails too, then we'll skip to the next one and the next one and so on, until we reach the end of the list and start from the beginning of the list again. This behaviour has the big advantage that we can build on what we learnt about a DNS server before, and don't waste the same timeout on a DNS server for each lookup should it not respond. Or to say this differently: If you specify multiple DNS servers, then that's not a way to merge DNS zones or so. It's simply a way to define alternative servers should the first DNS server not work correctly. If you want to route lookups in specific zones to specific DNS servers, then resolved doesn't really offer a nice way for that. A hack is to define multiple interfaces however, and configure different DNS servers and domains for them. |
This is where the server switches, and the lines before tell you why: the DNS server didn't respond to our query with transaction ID 29131. Why it didn't respond isn't known: somehow no UDP response packet was received. This could be because the query or the response packet simply got dropped on the way, or because the server refused to reply... Either way, resolved will retry but use a different DNS server, in the hope that works better. |
and here the same thing, when it swicthes back: the response for transaction 47747 wasn't received either, hence resolved tries the other server again, switching back. |
diego-treitos
commented
Apr 25, 2017
•
Yes I see that. And precisely because it is using UDP it will be easier for some packages to get droped and that the DNS switches. With the current systemd implementation you lose that priority in resolving names as it works more like a round-robin, and I understand the advantages of that in many scenarios (quick DNS failover switch). I think it would be great to have some configuration options on this like:
Or even to periodically check for primary nameserver availability so you can go back to use it asap. |
diego-treitos
commented
Apr 25, 2017
|
BTW, odd thing is that it looks easier to switch to the external
nameserver, when this is never able to resolve the local domains.
Em 25/04/2017 12:21 da tarde, "Lennart Poettering" <notifications@github.com>
escreveu:
…
|
Not sure I grok what you are trying to say? Note that if a DNS lookup results in a NODATA or NXDOMAIN reply, then that's considered final, and no other DNS server is tried. Again, defining multiple DNS servers is not a way to merge zones, it's a way to deal with unreliable servers, the assumption is always that all DNS servers configured provide the same dataset. So I think I grok what you are trying to do, but quite frankly, I think that even without resolved involved, this scheme is not reliable, and basically just taking benefit from a specific implementation detail of nss-dns/glibc. You are merging two concepts in what you are trying to do: fallback due to unreliable servers, and "merging" of zones. And I think for the latter it would be better to do proper per-domain request routing, for which an RFE is file in #5573 for example |
thomasleplus
commented
May 27, 2017
|
I have a similar situation than @diego-treitos. My company has a single internal DNS and so our DHCP server provides it as primary DNS, and OpenDNS as secondary. If any request to our DNS fails, systemd will switch to OpenDNS and I loose the ability to connect to internal servers. And since OpenDNS doesn't fail, I am never switching back to our DNS unless I disconnect and reconnect my network. I agree that the proper solution would be having a reliable DNS server or, even better, two internal servers for redundancy. But while I try to convince our sysadmins of that, IMHO it would be nice to have an option. |
diego-treitos
commented
May 29, 2017
|
I agree with that. I know that this may not be a direct problem with systemd, but this service is being used to replace a previous one, so I think it would be nice if it could work just like the service it is replacing. |
chrisisbd
commented
Jun 7, 2017
|
Yes, I agree that this is a problem. I have just upgraded a system to ubuntu 17.04 and what used to work in 16.04 now no longer works. We need a way to say that the second DNS is only to be used if the first one fails, the first one should always be tried first. |
chrisisbd
commented
Jun 7, 2017
|
Here's my output after adding the debug logging, it doesn't seem to make much sense:- Jun 7 10:36:28 t470 systemd-resolved[2161]: Using system hostname 't470'. So why does it switch to using 8.8.8.8, it doesn't seem to have even tried 192.168.1.2. |
|
@chrisisbd The "Switching to fallback DNS server 8.8.8.8." message indicates that you have no DNS servers configured at all, in which case resolved will use compiled-in fallback servers because it tries hard to just work also if you have a locally misconfigured system |
chrisisbd
commented
Jun 7, 2017
|
No, I have a working DNS on the LAN which (when I use it from xubuntu 16.04 systems) works perfectly. The relevant part from 'systemd-resolve --status' is:- ink 3 (wlp4s0) Most of the time local names resolve OK on the 17.04 system too but it (randomly?) falls back to using the 8.8.8.8 server for no obvious reason. |
amazon750
commented
Jun 8, 2017
|
Hi Lennart, thanks for all of your work so far. I'm trying to keep using systemd, but you can add me to the list of people for whom the old behaviour seemed to be standardised and useful, and the new behaviour seems like a regression.
That assumption doesn't seem universal. I too have local names that aren't in public DNS, and some local overrides for external names, neither of which work if the failover happens (I only have a secondary server listed for the same reason as these other fellas: to keep internet access working more reliably for the rest of the local users if the primary fails). Under the old system, with the same resolv.conf and the same primary DNS server, things worked as I designed nearly 100% of the time. Now, with systemd, it's become quite unreliable. I hadn't needed to do per-domain request routing before, but I'd be fine with that solution. I also like the suggestion of a switch to choose which behaviour the local admin prefers. Anything would be better, I've been reduced to editing /etc/hosts to relieve some frustration, which I haven't otherwise done in years.
Actually, on thinking about it further, that isn't as good. I would still prefer to use my internal DNS as primary for everything, and have it forward requests that it can't answer. Then again, maybe my preference is a bad practice, and won't be supported. But as mentioned, this all used to work, now it doesn't. If that's by design and won't be changed, that's unfortunate. |
lifeboy
commented
Jun 23, 2017
|
This is a problem, @poettering. The behaviour is a major change from the expected way and doesn't work in practice. If I specify 3 nameservers, the expectation that the first is always queried first, is settled. You can't change that now unilaterally. Consider this scenario: I have a VPN connection to a development environment where I have VM's running various tests. On the gateway on that cluster I run a DNS forwarder (dnsmasq) on pfsense. (192.168.121.1) Here I override the public DNS to resolv to a server on the LAN. This is not uncommon to do and in many corporate environments similar scenarios exist. In addition to the overriding of existing public DNS entries, I also add my own inhouse entries for my test servers. Since we don't work in one location and precisely therefor that we use VPN to connect to the various clusters, we need the expected behaviour: Always try to resolve in this order: What happens with systemd-resolved is this: The only way to restore this is to clear the dns cache and restart systemd-resolved . This is not acceptable and at the least we need a way to prevent this automatic jumping to a dns server lower down in the priority list. |
mourednik
commented
Jun 25, 2017
•
|
Hey guys. The only fix appears to be "install Linux without systemd" or install BSD. I'm not trolling. This is not a joke. |
poettering
locked and limited conversation to collaborators
Jun 26, 2017
johndoe31415
referenced this issue
Jun 28, 2017
Open
systemd-resolved does not use DNS for local resolution #6224
keszybz
unlocked this conversation
Jul 9, 2017
|
Hm, we could allow DNS= configuration to specify two tiers of servers (e.g. with the syntax Of course such a setup is not useful for merging zones (as @poettering wrote) in any reliable way, but it makes it easier to achieve "soft failure", where some local names stop working but the internet is not fully broken when the local nameserver goes down. Also, thanks to automatic switching back after a delay, things would "unbreak" automatically. |
lifeboy
commented
Jul 14, 2017
|
I don't get why you would want to switch nameservers in the first place.
DNS clients cache the answers (as they should), so it's only the first
lookup of a record that would possible be somewhat slower. The point is
this: If a list of servers is specified, the default should be to always
stick the list order. This doesn't break anything. If you want to add new
functionality, then add a flag to enable that (serverrotate=yes or
something similar).
…
|
keszybz
removed
the
not-a-bug
label
Jul 15, 2017
|
This was already explained above (#5755 (comment)), but I'll try again:
Contrary to what you wrote, DNS clients do not cache answers in general. Actually, when programs are short-lived, they cannot cache answers even if they wanted to; every time a program is restarted is starts with a clean slate. The place where caching is performed is inside of systemd-resolved (or in another external cache, like nscd, sssd, etc., but with systemd-resolved running the idea is that you don't need those). With DNSSEC, the delay from nonresponding name server becomes even more noticeable. We might want to adjust caching details, but it's a central feature of systemd-resolved functionality and it's not going away. (Both the memory of "last good" server, and previously queried resource records). So if you want something to change to accommodate your use case, help design the solution like proposed (#5755 (comment)) so that it works for you. |
keszybz
reopened this
Jul 15, 2017
lifeboy
commented
Jul 16, 2017
|
I think @keszybz's workaround is not a good idea. It's still not a solution that keeps the established behaviour and add strange new "features" for that that wish to enable them. Why does @poettering insist on breaking things that work just fine? I'm being forced off systemd more and more and I now see that's a good think. The more people move away, the better. |
|
Well, we keep telling you why, and you keep asking. |
|
@lifeboy there are two conflicting needs here:
Now, these two needs are directly conflicting: you want resolved to always start from the beginning, we want that we learn from previous lookups. I am pretty sure that item 2 is the better choice though, in particular when DNSSEC is used where lookups become increasingly slow, and we really don't want to waste time contacting servers we already know are unresponsive. I am not convinced changing things to implement your option 1 is really the way forward though, simply because this seriously hampers the usefulness of defining fallback servers: if you are in need of one you always have to wait for the first one's full timeout, on every single request. A good way to do fallbacks I think however is to expose similar performance and behaviour if we can, to make the fallback cost as little as possible. That said, resolved's behaviour is indeed different from traditional libc's resolver (though primarily due to the fact that glibc can't really do it better since they don't share system-wide state between lookups, but every process runs its own DNS client stack). Hence I'd be willing to add a compat option of some kind (which could even be enabled by default for all DNS servers we learn from /etc/resolv.conf as opposed to NM directly), to get the older, simpler and less smart version you are asking for. I hope this makes sense. |
diego-treitos
commented
Jul 17, 2017
•
|
Well, I do not have any conflict in my needs because I only have one.
Regarding the "smart" system, I did not experienced that smartness in any way. In my experience the secondary DNS is always being used. I did tests and my local DNS (primary) works perfectly fine. I did stress tests of hundreds of requests per second and not a single failure. However in my 2 computers where I have systemd-resolved the secondary DNS is being selected after a few minutes after restart and never going back to primary. The current implementation is not reliable. So for what I see, this implementation adds features that nobody asked for, breaking backwards compatibility with what was working reliably (and secure) for years and it does not even do it properly. For now the obvious solution is to not use systemd-resolved. When/If that simpler solution is implemented, I will take a look at it, although not sure why would I use it instead of the traditional version. |
lifeboy
commented
Jul 17, 2017
•
|
On 17 July 2017 at 18:53, Lennart Poettering ***@***.***> wrote:
@lifeboy there are two conflicting needs here:
1. You want that DNS server A is always queried first, and DNS server B
second, for every single request, so that A's answer can be different than
B's, and B is only used if A doesn't respond.
2. What is actually implemented right now tries to be smart and reuses
server B immediately if a previous lookup didn't get a timely answer from
A. In order to make the system react quickly and in a snappy way we
optimise things, and learn from previous lookups, and try to avoid to make
the same mistake continously, which would be to keep contacting server A
which isn't responsive.
Now, these two needs are directly conflicting: you want resolved to always
start from the beginning, we want that we learn from previous lookups. I am
pretty sure that item 2 is the better choice though, in particular when
DNSSEC is used where lookups become increasingly slow, and we really don't
want to waste time contacting servers we already know are unresponsive.
I am not convinced changing things to implement your option 1 is really
the way forward though, simply because this seriously hampers the
usefulness of defining fallback servers: if you are in need of one you
always have to wait for the first one's full timeout, on every single
request. A good way to do fallbacks I think however is to expose similar
performance and behaviour if we can, to make the fallback cost as little as
possible.
That said, resolved's behaviour *is* indeed different from traditional
libc's resolver (though primarily due to the fact that glibc can't really
do it better since they don't share system-wide state between lookups, but
every process runs its own DNS client stack). Hence I'd be willing to add a
compat option of some kind (which could even be enabled by default for all
DNS servers we learn from /etc/resolv.conf as opposed to NM directly), to
get the older, simpler and less smart version you are asking for.
I think that would be a good way forward on this, yes. Nameserver
failures, especially on a LAN where different records are inserted than are
available in the public DNS servers (I think you call this zone merging),
are very rare. Providing for this just to do a faster lookup in the this
rare event that a nameserver fails, is not productive in the real world. I
think on every corporate LAN I have worked on, some form of the "zone
merging" is being used, at least in our part of the world.
I hope this makes sense.
What you have written makes sense and has previously too. The problem is
that there doesn't seem to be a way recreate the desired behaviour in any
with the systemd-resolved functionality the way it is now.
regards
Roland
|
bernux
commented
Sep 7, 2017
|
This behaviour gives us headache here. |
lifeboy
commented
Sep 7, 2017
|
On 7 September 2017 at 17:33, Bernie Noel ***@***.***> wrote:
This behaviour gives us headache here.
We have 3 DNS pushed by DHCP 2 internal and one external which don't
resolve internal record just here for emergency.
On my desktop (in 2h30) switching of DNS has been done 73 times because I
make a script to check if some switching has been done and restart
systemd-resolved.
Maybe it works like it should in some circumstance but it fails royally in
others.
All I want, now, is to disable systemd-resolved.
Then do so. I simply gave up and removed systemd-resolved and enabled
dnsmasq to resolve dns for me. Problem solved.
|
bernux
commented
Sep 7, 2017
|
@lifeboy solved my problem too |
HackerMan69
commented
Sep 12, 2017
|
So if your dns goes down or is misconfigured, systemd will silently fallback to google's dns servers? Does this "remembering" also remember the fallback? Will it use the fallback if other servers come up at a later time? People might not even know they are sending DNS queries to google when this happens. |
chrisisbd
commented
Sep 12, 2017
|
It will stay using the backup DNS even if/when your local DNS comes back, that's the problem. I.e. it's no longer possible to even have a 'backup' DNS server. You can no longer designate one server as the one to be used by default with a backup one to use if the main one fails. |
No. It won't. If any DNS configuration is configured at all, it is used, regardless if it actually works or doesn't. If no DNS configuration exists at all, then the default DNS configuration specified at systemd buildtime is used, using the DNS_SERVERS meson build parameters. We encourage distributions to set these servers towhatever they like, but many just leave it at 8.8.8.8. If you don't like that please politely try to convince your distribution to change them to better suited servers. Note that these fallback servers are exclusively used if no DNS configuration exists at all, and resolved immediately switches to whatever is configured as soon as something is configured again. This is exactly the same btw as it is for NTP servers for timesyncd: the built-in is picked at build-time, and we encourage distros to set them to whatever is appropriate for them. Some do, others don't. If you don't like the choice your distro made there, then please try to convince them to use something else and tell them what. Also, exactly as for DNS: these built-in fallback NTP servers are only used if no other configuration was made, and timesyncd immediately stops using them if you configure something. Both DNS and NTP may be sourced from DHCP btw, and are by defult if you use networkd or NM.
You are mixing up two unrelated things here: fallback servers (which are used in case no configuration exists at all), and the fact that resolved continues to use DNS servers that it previously had success with (or specifically the first in the current configuration that replied reliably), instead of always beginning all lookups again with DNS servers it already knows are not responding reliably. The latter logic applies unconditionally, but when configuration is replaced (or we change from configuration to no configuration and thus to or away from the fallback DNS servers), we of course immediately stop using any DNS server no longer on the list to be used. |
No it won't, and no it's not the problem. The problem is that all configured DNS servers are assumed to be equivalent but in some people's configuration they aren't. If multiple DNS servers are configured, and one for some reason whatsoever doesn't respond both the built-in glibc resolver and resolved switch to the next DNS server configured. Now, because the glibc resolver doesn't maintain state between individual lookups, on the next lookup it will start again from the first DNS server, even though it wasn't reliable the first time. resolved is smarter there, and continues to use the working DNS server for subsequent lookups, until one of them fail and it switches on to the next one and so on. That resolved does that is a good thing, since it deals nicely with failure, and ensures that lookups remain quick and we use what we learned. However, it conflicts with setups which are built on the assumption that each lookup forgets all state and starts from the beginning of the list again. |
chrisisbd
commented
Sep 12, 2017
Well, alright, but the result is the same! The systemd resolver doesn't recognise that there is such a thing as a seconrary/backup DNS. I don't think this makes resolved 'smarter', for many people this makes it less smart. |
poettering
deleted a comment from
lifeboy
Sep 12, 2017
HackerMan69
commented
Sep 12, 2017
•
|
@poettering From my perspective they are related, and as I'm sure you know there are many people like myself who would prefer not to send things to certain places. It is important to me that my configuration is respected, broken or not. I'm glad this is the case. Control over my computer is something I hold as high value. Thank you for the detailed explanation. My personal concerns aside, this behavior does seem to be a problem for those with what I would say is a common assumption. I think an option for the classic dns resolution method would be well received by the community. To possibly expand upon your new and improved process, what if resolved checked to see if the previously down dns comes back up and then switch back to it when able? I understand doing this during each request completely defeats the purpose, but how about other times? Maybe periodically? The frustration arises when the "DNS switch" happens and for some, the pain never goes away due to the "smartness". A little more smartness would go a long way and If you're maintaining the list of servers in a stateful way I think this is possible. |
leonelwilliams
commented
Sep 12, 2017
•
|
It seems to me that things would be much easier if one used a (sub)domain one owned in a ICANN TLD with public nameservers instead of making up your own (e.g. |
jnye
commented
Sep 12, 2017
•
|
The Linux man page for resolv.conf(5) says they should be tried in order. There is a |
ryanaslett
commented
Sep 12, 2017
|
Is there, or could there be, a configurable threshold by which the determination is made that server A is "unresponsive" and that the switch should be made to server B? It appears that the determination of whether or not a server is capable of handling requests is far too fragile, and far too likely to switch to the next server at the first sign of trouble, which, given that DNS is on UDP, one cannot make the assertion that a single failed response or timeout is grounds to establish that a server is unresponsive. |
tebruno99
commented
Sep 12, 2017
•
|
I have a local dns server that hosts my public names internally so that my traffic doesn't go out my router right back into my public port to connect to my own locally hosted website (which doesn't work on Comcast btw). This is a major issue for me since the 2nd dns server in my list is the public 8.8.8.8 just incase my internal one isn't working and I want to use Google to find out why. I often restart my internal dns as I make changes and have had this issue several times which disables all my internal services since I can't loop back through the public IP from my LAN. Primary/Backup, not a dumb list has always been what I was taught and how I expect the local resolver to work. |
kroeckx
commented
Sep 12, 2017
|
I expect the first to work almost all the time. Reasons it might not answer is because some packet got dropped, the DNS server it's querying doesn't reply, and so on. The problem might not be with that the server isn't working, just some external problem. In case the domain you're trying to look up is having problems (or internet is down), you might try all servers, have each fail, and then switch to some default that also doesn't work, which does not seem like it's something you want. So I would hope that it would retry the servers after some time to see if they come back up. The list is at least a preferred order of where to send the request to for me. But I also want to add that I expect all servers in that file to have the same view of DNS, and not that one can return something for what is behind the VPN and the other not. This packet can also be dropped and so the next one can be tried and you'll get the wrong result. |
kroeckx
commented
Sep 12, 2017
|
Since glibc doesn't do any checking of DNSSEC, all my the IP addresses in my resolv.conf have become the addresses of servers I run myself and are checking DNSSEC. So I would really like to avoid some fallback to some default server over which I have no control. |
darkstar
commented
Sep 12, 2017
|
Maybe it just boils down to a timeout that makes systemd-resolved switch to the next server quicker than the usual resolver. That could explain that some people are seing a switch to the second server even though lookups with dig/nslookup work just fine. If so, it can probably be fixed or worked around. But I think most people don't understand the fact that, as was already stated, the DNS servers are supposed to be exactly equivalent. And if they are equivalent, then it doesn't matter which one you choose. If you want reliable DNS, you have to provide one or more reliable DNS servers. Or, provide only one server (and deal with the occasional disruption if a single lookup fails) and let that server handle the forwarding to 8.8.8.8. |
Harleqin
commented
Sep 12, 2017
|
Where does the assumption that all DNS servers are supposed to be equivalent come from? You see, they are not. |
mthorpe7
commented
Sep 13, 2017
|
@Harleqin - it comes from RFC 1034 and RFC 1035:
its fairly explicit that the resolver can determine the order:
|
poettering
locked and limited conversation to collaborators
Sep 13, 2017
poettering
deleted a comment from
tebruno99
Sep 13, 2017
poettering
deleted a comment from
ctrix
Sep 13, 2017
poettering
deleted a comment from
leonelwilliams
Sep 13, 2017
|
Sorry, but given how the quality of discussion has degraded and the number of inappropriate comments I had to delete has increased I have now locked this issue. I will unlock this again in a few days when things have become quieter. Thank you for understanding. |
diego-treitos commentedApr 18, 2017
•
Edited 1 time
-
diego-treitos
Apr 18, 2017
Submission type
NOTE: Do not submit anything other than bug reports or RFEs via the issue tracker!
systemd version the issue has been seen with
Version 232
NOTE: Do not submit bug reports about anything but the two most recently released systemd versions upstream!
Used distribution
Ubuntu 17.04
In case of bug report: Expected behaviour you didn't see
When having 2 nameservers like:
192.168.0.1
8.8.8.8
Defined in that order in
/etc/resolv.confI would expect to have the same behaviour than in resolv.conf: First use 192.168.0.1 and if for some reason it is not available, use 8.8.8.8.I am seeing that systemd-resolved is switching nameservers randomly
In case of bug report: Unexpected behaviour you saw
Random nameserver use
In case of bug report: Steps to reproduce the problem
just have 2 nameservers and use systemd-resolved service