-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check/test issues like message too big and truncation #45
Comments
@miekg I believe the behavior I'm seeing relates to this issue. I am running skydns in a docker container, which communicates successfully with etcd running in another docker container on a different server in my LAN. Skydns is included in my project via git submodule, currently at 24a11ee (Wed Jul 23 07:39:18 2014 +0100), which includes the 'Set the TC bit when the message gets too large' commit you mentioned in #69 Simple queries are successful: ~ # dig @172.17.0.2 hostname.subdomain.tld +short
10.1.2.3
~ # Wildcard queries with large results are truncated, but TCP retry hangs until dig times out: ~ # dig @172.17.0.2 *.tld
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.9.5-3-Ubuntu <<>> @172.17.0.2 *.tld
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
~ #
~ # dig @172.17.0.2 *.subdomain.tld
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.9.5-3-Ubuntu <<>> @172.17.0.2 *.tld
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
~ # For one such query, the traffic on docker0 of skydns container host looks like this (tcpdump output edited for legibility):
I can tell that both 53/udp and 53/tcp are responsive, but it looks like skydns doesn't attempt to resolve the query against etcd when it is submitted via 53/tcp. It accepts the tcp handshake and acknowledges the request segment, then does nothing. I am happy to supply additional detail as needed. Do you have any wisdom or intuition to share on this? |
Try some simple debugging with println? The entry point in skydns should be ServeDNS in server.go line 79. I may have a small hunch that I might be try setting the TC bit on tcp queries, but that should at least get you a reply. |
To be sure: |
Pushed 20cbd0b which does not set a low bufsize for TCP queries, but that should not impact and be unrelated to this. |
Thanks for swift response.
With the container as described yesterday, It fails:
During the TCP dig, traffic on docker0 is thus:
I've pulled I'll try rebuilding with some println debugging, per your suggestion. |
So basically TCP never works in your setup? If you run another TCP service in this setup that one does work? I.e. I'm
|
Well, I'm running sshd in the skydns container for convenience during development, and that works fine:
|
OK. The a println is ServeDNS should tell you more. Or debugging with gdb
|
Also we should add the OPT RR even when we truncate. I thin there needs to be support for this in go dns base library. |
Maybe this is related: kelseyhightower/confd#285 |
When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to the unbound instance via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here's an outstanding issue in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45
TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to the unbound instance via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here's an outstanding issue in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it speak only TCP to the upstream skydns (in the same container) with `tcp-upstream: yes`. This forces skydns to speak only TCP with its upstream. No UDP truncation shit. Things are fixed. :) We admit this is super funky, but there's no way to force skydns to speak only TCP right now.
Not entirely sure if this is related, but we saw this issue: spotify/helios#899 |
TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to the unbound instance via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here's an outstanding issue in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it forward DNS queries via UDP to the upstream skydns in the same container. Unbound will add the OPT section that makes everything work. Things are fixed. :) We admit this is super funky...And this only might work for UDP packets up to 4096 bytes, the default set by unbound in OPT.
TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to nameservers specified in `/etc/resolv.conf` via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here are outstanding issues in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it forward DNS queries via UDP to the upstream skydns in the same container. Unbound will add the OPT section that makes everything work. Things are fixed. :) We admit this is super funky...And this only might work for UDP packets up to 4096 bytes, the default set by unbound in OPT.
[ Quoting notifications@github.com in "Re: [skynetservices/skydns] Check/t..." ]
"When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error." Yes, that is a bug that needs to be fixed. Thought I propagate this error back /Miek Miek Gieben |
TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to nameservers specified in `/etc/resolv.conf` via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here are outstanding issues in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it forward DNS queries via UDP to the upstream skydns in the same container. Unbound will add the OPT section that makes everything work. Things are fixed. :) We admit this is super funky...And this only might work for UDP packets up to 4096 bytes, the default set by unbound in OPT.
TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to nameservers specified in `/etc/resolv.conf` via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here are outstanding issues in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it forward DNS queries via UDP to the upstream skydns in the same container. Unbound will add the OPT section that makes everything work. Things are fixed. :) We admit this is super funky...And this only might work for UDP packets up to 4096 bytes, the default set by unbound in OPT. Much thanks to @gimaker for helping and suggesting unbound inside the container.
@miekg Thanks for replying. A confirmation that this is indeed the issue and a fix would be amazing. |
There is a definite ongoing problem with SkyDNS that really upsets Ruby's Resolv::DNS class. Ruby sends no EDNS0 information about UDP message size so according to the rfc1035 the response UDP message must be 512 bytes only. It does, however, honour retry over TCP if it receives a UDP message with the TC (Truncated) flag set and it does this automatically. The ongoing problems are. SkyDNS returns a larger UDP response irrespective of the fact no EDNS0 tags are in the request, not standards compliant but if we assume we are only interested in the flags and various information at the beginning of the message this shouldn't be too much of a problem. Ruby, in lib/resolv.rb, insists on reading all answers claimed by the ANCOUNT field provided even when sent a fragment. It seems that SkyDNS sends a count matching all records it knows about and not the amount that it managed to Fit in the response. In order for Ruby to work at all with the fragmented response and move on to the TCP behaviour it needs both the 512 limit to be observed and the ANCOUNT to reflect the number of answers in the message itself. I posted a workaround to the ruby:trunk bugtacker here yesterday https://bugs.ruby-lang.org/issues/13513 with a Monkey patch that works around this problem but really the problem looks to stem from SkyDNS behaviour. |
[ Quoting <notifications@github.com> in "Re: [skynetservices/skydns] Check/t..." ]
There is a definite ongoing problem with SkyDNS that really upset's Ruby's Resolv::DNS class.
If looked at the ruby code a long time back and it was really basic at time
time.
Ruby sends no EDNS0 information about UDP message size so according to the rfc1035 the response UDP message must be 512 bytes only. It does, however, honour retry over TCP if it receives a UDP message with the TC (Truncated) flag set and it does this automatically.
OK, that's good
The ongoing problems are.
SkyDNS returns a larger UDP response irrespective of the fact no EDNS0 tags are in the request, not standards compliant but if we assume we are only interested in the flags and various information at the beginning of the message this shouldn't be too much of a problem.
technically, the client's buffer is only so large, although SkyDNS shouldn't
it's not the end of the world.
Ruby, in lib/resolv.rb, insists on reading all answers claimed by the ANCOUNT field provided even when sent a fragment. It seems that SkyDNS sends a count matching all records it knows about and not the amount that it managed to Fit in the response.
That is both a bug in SkyDNS and ruby (and for ruby it might also be a resources
dos)
In order for Ruby to work at all with the fragmented response and move on to the TCP behaviour it needs both the 512 limit to be observed and the ANCOUNT to reflect the number of answers in the message itself.
I posted a workaround to the ruby:trunk bugtacker here yesterday https://bugs.ruby-lang.org/issues/13513 with a Monkey patch that works around this problem but really the problem looks to stem from SkyDNS behaviour.
I believe CoreDNS has better, truncation behavior.
/Miek
…--
Miek Gieben
|
The Ruby default behaviour when you consider it closer it isn't buggy. Let's consider the expectation from a truncated response with standard (non EDNS) UDP. The server should send a response with all records it includes being the count of whatever fields recorded and within only a 512 byte UDP section of the frame. Noting that it sets TC to indicate that it was truncated and there's more but the error code it NO error because the response is legitimate for some purposes. It's actually up to the application to determine if this is suitable and to use the truncated UDP response. The rfc's state that mail servers must NOT use truncated responses with MX records in for good reason but beyond that the reponse may be suitable and should stand alone as a properly formatted response with the correct counts in the respective fields for all the queries and answers given. Now, many libraries like Ruby's standard library take the TC field as an indication that they should for the convenience of the developer proceed to using TCP to make the query. This does not invalidate the necessity for correctness in the UDP response for applications that deem that the truncated response is appropriate. Ruby's approach may lack robustness for malformed responses but if the responses are rfc compliant then there is no problem. |
Additionally the issue around UDP frames larger than 512 bytes for non EDNS0 responses could cause odd problems for client libraries that only read 512 bytes since the UDP section may well contain all the stated results and no TC set because it fits but they client system won't be able to read the results. |
[ Quoting <notifications@github.com> in "Re: [skynetservices/skydns] Check/t..." ]
Additionally the issue around UDP frames larger than 512 bytes for non EDNS0 responses could cause odd problems for client libraries that only read 512 bytes since the UDP section may well contain all the stated results and no TC set because it fits but they client system won't be able to read the results.
This could cause all sorts of odd problems for such clients when they encounter responses that sit within that area of a larger than 512 byte UDP message and one that SkyDNS hasn't needed to truncate.
Don't think the DNS standard actually says something about a client *using* a
truncated query, except for seeing the error.
Could I ask you to check CoreDNS, it is (should) be more robust in this regards
and I'm spending most of more spare time on that now a days.
|
I could test with CoreDNS but I can't impose that change here. On the subject of Truncation the relevant bits are in rfc1123 section 6.1.3.2. The suggestion is that the resolver SHOULD switch and retry with TCP on a truncated response but the discussion then states.
In practice I think this is more likely to happen with less common resolver implementations or implementations of resovers that defer the decision to try TCP to the application author. |
[ Quoting <notifications@github.com> in "Re: [skynetservices/skydns] Check/t..." ]
I could test with CoreDNS but I can't impose that change here.
Ack. Happy to take a PR for SkyDNS.
On the subject of Truncation the relevant bits are in rfc1123 section 6.1.3.2.
The suggestion is that the resolver SHOULD switch and retry with TCP on a truncated response but the discussion then states.
"Whether it is possible to use a truncated answer
depends on the application. A mailer must not use a
truncated MX response, since this could lead to mail
loops.
Responsible practices can make UDP suffice in the vast
majority of cases. Name servers must use compression
in responses. Resolvers must differentiate truncation
of the Additional section of a response (which only
loses extra information) from truncation of the Answer
section (which for MX records renders the response
unusable by mailers). Database administrators should
list only a reasonable number of primary names in lists
of name servers, MX alternatives, etc."
Interesting. Thanks for the RFC reference
/Miek
…--
Miek Gieben
|
Right now it is assumed the requester's buffer is large enough. We should explicitly check for this in the code,
The text was updated successfully, but these errors were encountered: