-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/dns: attempt to reload interface nameservers on SERVFAIL errors #12547
Conversation
cb5bfa6
to
d92e9cc
Compare
return | ||
} | ||
if limiter.Allow() { | ||
m.logf("DNS resolution failed due to missing upstream nameservers. Recompiling DNS configuration.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool idea for a v2: add a client_metric to see how prevalent this is over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In corp, if we make a [] nameservers result an error, then we get a health warning notification BTW... The whole crux of this is that we have no way of telling if [] is really an error, or if we were just too quick with our network setup.
If we waited until NetworkMonitor called us back with a 'viable' interface, we could probably get around this (just based on the ordering I see in the logs) - but that requires a considerable amount of swift->go plumbing that is very apple specific.
I say let's wait until @raggi is back in case there is anything fundamentally wrong with the approach. But this LGTM. |
net/dns/manager.go
Outdated
|
||
// This will recompile the DNS config, which in turn will requery the system | ||
// DNS settings. | ||
m.resolver.SetServFailRecovery(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this correctly that if a user has a piece of software that happens to regularly poll a bad upstream server, we'll now reconfigure DNS every 5s? Could we limit this to only occur when there are no known upstream DNS servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only called when we try a query and internally return SERVFAIL because we have no nameservers. Clarified the naming and the comments.
// This will recompile the DNS config, which in turn will requery the system | ||
// DNS settings. | ||
m.resolver.SetServFailRecovery(func() { | ||
if m.config == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need some kind of synchronization here? if m.config is changing from local goroutines, and being called by the resolver I'd expect we'd need a lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this needs synchronization
net/dns/manager.go
Outdated
@@ -94,6 +116,7 @@ func (m *Manager) Set(cfg Config) error { | |||
|
|||
rcfg, ocfg, err := m.compileConfig(cfg) | |||
if err != nil { | |||
m.config = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set m.config = nil
at the start of the func, rather than clearing it in each error return path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
net/dns/resolver/forwarder.go
Outdated
// This should attempt to properly (re)set the upstream resolvers. This is called | ||
// on every SERVFAIL error and the callee is expected to properly rate limit whatever | ||
// actions this takes. | ||
servFailRecovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
servFailRecovery | |
missingUpstreamRecovery func() |
SERVFAIL
is a DNS protocol thing, missing upstream resolvers is simply one possible way that happens, so making it about recovering from a missing upstream seems more descriptive
Also a type name on its own in a struct means the struct embeds that other type inside it, I don't think that's what you want here. That's the name you want and then the type will be func()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made it to that chapter of "Mastering Go" yet... :) thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. The type name in a struct is kind of like the closest thing Go has to Java class forwarder extends servFailRecovery
or C++ class forwarder : public servFailRecovery
, if those examples help at all. It's not an exact match but a similar idea, and not commonly used.
net/dns/resolver/forwarder.go
Outdated
// Attempt to recompile the DNS configuration the SERVFAIL. | ||
// If we are being asked to forward queries and we have no | ||
// nameservers, the network is in a bad state. | ||
f.servFailRecovery() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like kind of a late time to attempt to catch the error. Is it possible to notice that the OS returned an empty resolver list right away and start retrying then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not ideal, but the crux of the problem is our inability to determine whether or not no resolvers is real, or an error, or a race condition.
Ideally, we fix the timing, which may be possible by re-plumbing netMon using higher level APIs NWNetworkMonitor which generates events which seem to happen at the right times, but that's a lot of tricky C-go plumbing which requires care and feeding - and is specific to Apple - and will probably break more things int the short term than it fixes. It's on my radar though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds reasonable enough as a workaround for now and we can do better in future.
d92e9cc
to
cf9d058
Compare
Fixes tailscale/corp#20677 Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor". On apple platforms, the fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs. To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS. We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state. Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured. Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS state. Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
cf9d058
to
6be6595
Compare
Fixes tailscale/corp#20677 Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor". On apple platforms, the fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs. To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS. We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state. Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured. Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS state. Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
Fixes tailscale/corp#20677 Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor". On apple platforms, the fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs. To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS. We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state. Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured. Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS state. Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
Fixes tailscale/corp#20677 Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor". On apple platforms, the fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs. To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS. We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state. Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured. Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS state. Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
Fixes tailscale/corp#20677
Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor".
On apple platforms, fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs.
To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS.
We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state.
Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured.
Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS config after the first failed query.