Skip to content

Conversation

@syvb
Copy link
Contributor

@syvb syvb commented Sep 1, 2021

When a DNS server claims to be unable or unwilling to handle a request, instead of passing that refusal along to the client, just treat it as any other error trying to connect to the DNS server. This prevents DNS requests from failing based on if a server can respond with a transient error before another server is able to give an actual response. DNS requests only failing sometimes is really hard to find the cause of (such as in #1033).

@bradfitz
Copy link
Member

Ideally this would have some tests. I don't know DNS well enough to approve this as is. @sailorfrag, @danderson, do you?

@syvb syvb force-pushed the dont-forward-transient-errors branch 2 times, most recently from cf1950e to b2c24c4 Compare September 16, 2021 16:01
@sailorfrag
Copy link
Member

sailorfrag commented Sep 16, 2021

As currently written, this will turn various errors returned by a server into timeouts (the errors channel appears to log but not send a response packet). In the event that we're racing multiple servers (either within Tailscale or the OS is racing Tailscale DNS with other DNS servers), that's arguably preferable (with a stronger argument for soft errors).

Better would be to return one of the errors (the first one?) back to the client as soon as we've exhausted the resolvers we're racing against each other. That would be a much more involved change.

I think this is a good idea, but without the extra part I'd want to be conservative about which error codes we consume. Server failure seems good, I'm less sure about the others (e.g. refused seems like a policy decision and perhaps queries for that domain should never have been configured to go to that server).

When a DNS server claims to be unable or unwilling to handle a request,
instead of passing that refusal along to the client, just treat it as
any other error trying to connect to the DNS server. This prevents DNS
requests from failing based on if a server can respond with a transient
error before another server is able to give an actual response. DNS
requests only failing *sometimes* is really hard to find the cause of
(tailscale#1033).

Signed-off-by: Smitty <me@smitop.com>
@syvb syvb force-pushed the dont-forward-transient-errors branch from b2c24c4 to 8ff154d Compare September 19, 2021 00:34
@syvb
Copy link
Contributor Author

syvb commented Sep 19, 2021

I've added some tests for the rcode parsing logic and changed it to only act on SERVFAIL responses.

Also, GitHub seems to require approval for running GH Actions again on this PR (which is weird, because they were approved on this PR before and I'm not a first-time contributor)

@sailorfrag
Copy link
Member

OK I think this should wait until 1.16 is released before merging so we don't make extra DNS changes close to release. I think DNS error handling needs a lot more attention but this feels like it's a start.

@sailorfrag sailorfrag merged commit b382161 into tailscale:main Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants