Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/dns: attempt to reload interface nameservers on SERVFAIL errors #12547

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

barnstar
Copy link
Member

Fixes tailscale/corp#20677

Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor".

On apple platforms, fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs.

To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS.

We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state.

Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured.

Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS config after the first failed query.

return
}
if limiter.Allow() {
m.logf("DNS resolution failed due to missing upstream nameservers. Recompiling DNS configuration.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool idea for a v2: add a client_metric to see how prevalent this is over time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In corp, if we make a [] nameservers result an error, then we get a health warning notification BTW... The whole crux of this is that we have no way of telling if [] is really an error, or if we were just too quick with our network setup.

If we waited until NetworkMonitor called us back with a 'viable' interface, we could probably get around this (just based on the ordering I see in the logs) - but that requires a considerable amount of swift->go plumbing that is very apple specific.

@agottardo
Copy link
Contributor

I say let's wait until @raggi is back in case there is anything fundamentally wrong with the approach. But this LGTM.

@barnstar barnstar requested a review from sailorfrag June 21, 2024 18:13

// This will recompile the DNS config, which in turn will requery the system
// DNS settings.
m.resolver.SetServFailRecovery(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this correctly that if a user has a piece of software that happens to regularly poll a bad upstream server, we'll now reconfigure DNS every 5s? Could we limit this to only occur when there are no known upstream DNS servers?

Copy link
Member Author

@barnstar barnstar Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only called when we try a query and internally return SERVFAIL because we have no nameservers. Clarified the naming and the comments.

// This will recompile the DNS config, which in turn will requery the system
// DNS settings.
m.resolver.SetServFailRecovery(func() {
if m.config == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need some kind of synchronization here? if m.config is changing from local goroutines, and being called by the resolver I'd expect we'd need a lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@raggi raggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this needs synchronization

@@ -94,6 +116,7 @@ func (m *Manager) Set(cfg Config) error {

rcfg, ocfg, err := m.compileConfig(cfg)
if err != nil {
m.config = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set m.config = nil at the start of the func, rather than clearing it in each error return path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// This should attempt to properly (re)set the upstream resolvers. This is called
// on every SERVFAIL error and the callee is expected to properly rate limit whatever
// actions this takes.
servFailRecovery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
servFailRecovery
missingUpstreamRecovery func()

SERVFAIL is a DNS protocol thing, missing upstream resolvers is simply one possible way that happens, so making it about recovering from a missing upstream seems more descriptive

Also a type name on its own in a struct means the struct embeds that other type inside it, I don't think that's what you want here. That's the name you want and then the type will be func().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it to that chapter of "Mastering Go" yet... :) thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. The type name in a struct is kind of like the closest thing Go has to Java class forwarder extends servFailRecovery or C++ class forwarder : public servFailRecovery, if those examples help at all. It's not an exact match but a similar idea, and not commonly used.

// Attempt to recompile the DNS configuration the SERVFAIL.
// If we are being asked to forward queries and we have no
// nameservers, the network is in a bad state.
f.servFailRecovery()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like kind of a late time to attempt to catch the error. Is it possible to notice that the OS returned an empty resolver list right away and start retrying then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It not ideal, but the crux of the problem is our inability to determine whether or not no resolvers is real, or an error, or a race condition.

Ideally, we fix the timing, which may be possible by re-plumbing netMon using higher level APIs NWNetworkMonitor which generates events which seem to happen at the right times, but that's a lot of tricky C-go plumbing which requires care and feeding - and is specific to Apple - and will probably break more things int the short term than it fixes. It's on my radar though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds reasonable enough as a workaround for now and we can do better in future.

Fixes tailscale/corp#20677

Replaces the original attempt to rectify this (by injecting a netMon
event) which was both heavy handed, and missed cases where the
netMon event was "minor".

On apple platforms, the fetching the interface's nameservers can
and does return an empty list in certain situations.   Apple's API
in particular is very limiting here.  The header hints at notifications
for dns changes which would let us react ahead of time, but it's all
private APIs.

To avoid remaining in the state where we end up with no
nameservers but we absolutely need them, we'll react
to a lack of upstream nameservers by attempting to re-query
the OS.

We'll rate limit this to space out the attempts.   It seems relatively
harmless to attempt a reconfig every 5 seconds (triggered
by an incoming query) if the network is in this broken state.

Missing nameservers might possibly be a persistent condition
(vs a transient error), but that would  also imply that something
out of our control is badly misconfigured.

Tested by randomly returning [] for the nameservers.   When switching
between Wifi networks, or cell->wifi, this will randomly trigger
the bug, and we appear to reliably heal the DNS state.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
@barnstar barnstar merged commit 27033c6 into main Jun 25, 2024
41 of 48 checks passed
@barnstar barnstar deleted the jonathan/missing_resolvers branch June 25, 2024 18:56
barnstar added a commit that referenced this pull request Jul 2, 2024
Fixes tailscale/corp#20677

Replaces the original attempt to rectify this (by injecting a netMon
event) which was both heavy handed, and missed cases where the
netMon event was "minor".

On apple platforms, the fetching the interface's nameservers can
and does return an empty list in certain situations. Apple's API
in particular is very limiting here. The header hints at notifications
for dns changes which would let us react ahead of time, but it's all
private APIs.

To avoid remaining in the state where we end up with no
nameservers but we absolutely need them, we'll react
to a lack of upstream nameservers by attempting to re-query
the OS.

We'll rate limit this to space out the attempts. It seems relatively
harmless to attempt a reconfig every 5 seconds (triggered
by an incoming query) if the network is in this broken state.

Missing nameservers might possibly be a persistent condition
(vs a transient error), but that would also imply that something
out of our control is badly misconfigured.

Tested by randomly returning [] for the nameservers. When switching
between Wifi networks, or cell->wifi, this will randomly trigger
the bug, and we appear to reliably heal the DNS state.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
barnstar added a commit that referenced this pull request Jul 2, 2024
Fixes tailscale/corp#20677

Replaces the original attempt to rectify this (by injecting a netMon
event) which was both heavy handed, and missed cases where the
netMon event was "minor".

On apple platforms, the fetching the interface's nameservers can
and does return an empty list in certain situations. Apple's API
in particular is very limiting here. The header hints at notifications
for dns changes which would let us react ahead of time, but it's all
private APIs.

To avoid remaining in the state where we end up with no
nameservers but we absolutely need them, we'll react
to a lack of upstream nameservers by attempting to re-query
the OS.

We'll rate limit this to space out the attempts. It seems relatively
harmless to attempt a reconfig every 5 seconds (triggered
by an incoming query) if the network is in this broken state.

Missing nameservers might possibly be a persistent condition
(vs a transient error), but that would also imply that something
out of our control is badly misconfigured.

Tested by randomly returning [] for the nameservers. When switching
between Wifi networks, or cell->wifi, this will randomly trigger
the bug, and we appear to reliably heal the DNS state.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
knyar pushed a commit that referenced this pull request Jul 2, 2024
Fixes tailscale/corp#20677

Replaces the original attempt to rectify this (by injecting a netMon
event) which was both heavy handed, and missed cases where the
netMon event was "minor".

On apple platforms, the fetching the interface's nameservers can
and does return an empty list in certain situations. Apple's API
in particular is very limiting here. The header hints at notifications
for dns changes which would let us react ahead of time, but it's all
private APIs.

To avoid remaining in the state where we end up with no
nameservers but we absolutely need them, we'll react
to a lack of upstream nameservers by attempting to re-query
the OS.

We'll rate limit this to space out the attempts. It seems relatively
harmless to attempt a reconfig every 5 seconds (triggered
by an incoming query) if the network is in this broken state.

Missing nameservers might possibly be a persistent condition
(vs a transient error), but that would also imply that something
out of our control is badly misconfigured.

Tested by randomly returning [] for the nameservers. When switching
between Wifi networks, or cell->wifi, this will randomly trigger
the bug, and we appear to reliably heal the DNS state.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants