New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to refresh DNS information leading to indefinite network failures #41570

Closed
jonhoo opened this Issue Apr 26, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@jonhoo
Copy link
Contributor

jonhoo commented Apr 26, 2017

Consider the following simple network client:

fn main() {
    use std::thread;
    use std::time::Duration;
    use std::net::TcpStream;

    loop {
        match TcpStream::connect("google.com:80") {
            Ok(_) => {
                println!("connected");
                break;
            }
            Err(e) => {
                println!("failed: {:?}", e);
            }
        }
        thread::sleep(Duration::from_secs(1));
    }
}

This works fine if you run it while your internet connection is up and running. However, if you kill your network connection, it (obviously) does not. What is interesting is if you launch the program while your internet is offline (and crucially, while /etc/resolv.conf does not contain any nameservers), and then connect to the internet again. I would expect the program to eventually say "connected", however this is not the case.

This had me puzzle for a while, until I stumbled on this old issue on the Pidgin bug tracker. It turns out that the set of nameservers available when the program is started is cached, and is never automatically re-read. Instead, res_init must be called manually to refresh the nameserver list. Unfortunately, as far as I can tell, there is no way in Rust to call res_init, and thus the above program simply cannot be made to work in the presence of network failures.

It's not entirely clear what the "right" fix here is: we could simply providing a way to call res_init, or we could do something more fancy like a special connect_uncached that does it for you. Regardless, this seems like a fairly unfortunate shortcoming..

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Apr 26, 2017

Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby).

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Apr 26, 2017

An interesting decision from that Mozilla bug report is:

it calls res_init if gethostbyname (or getaddrinfo) fails

That seems pretty reasonable, and maybe something that Rust could do too? Specifically, we should probably do this in lookup_host in sys_common/net.rs, or alternatively in the resolve_socket_addr used in the impl of ToSocketAddr for str. We'd need res_init to be exposed by libc though...

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Apr 27, 2017

Opened a PR to libc over at rust-lang/libc#585

@jonhoo jonhoo referenced this issue Apr 27, 2017

Merged

Add res_init #585

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Apr 27, 2017

Sounds like a reasonable solution to me! (calling res_init on failure)

Thanks for looking into this @jonhoo!

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Apr 27, 2017

Do you think it'd be better to add this behavior into lookup_host, or in the higher-level resolve_socket_addr?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Apr 27, 2017

Nah I think throwing it into lookup_host is fine, that's already a mega "convenience" api

jonhoo added a commit to jonhoo/rust that referenced this issue Apr 27, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Depends on rust-lang/libc#585.

jonhoo added a commit to jonhoo/rust that referenced this issue Apr 27, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

jonhoo added a commit to jonhoo/rust that referenced this issue May 4, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

bors added a commit that referenced this issue May 4, 2017

Auto merge of #41582 - jonhoo:reread-nameservers-on-lookup-fail, r=al…
…excrichton

Reload nameserver information on lookup failure

As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

jonhoo added a commit to jonhoo/rust that referenced this issue May 4, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Introduces an std linkage dependency on libresolv on macOS/iOS.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

frewsxcv added a commit to frewsxcv/rust that referenced this issue May 4, 2017

Rollup merge of rust-lang#41582 - jonhoo:reread-nameservers-on-lookup…
…-fail, r=alexcrichton

Reload nameserver information on lookup failure

As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

bors added a commit that referenced this issue May 4, 2017

Auto merge of #41582 - jonhoo:reread-nameservers-on-lookup-fail, r=al…
…excrichton

Reload nameserver information on lookup failure

As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

jonhoo added a commit to jonhoo/rust that referenced this issue May 5, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Introduces an std linkage dependency on libresolv on macOS/iOS (which
also makes it necessary to update run-make/tools.mk).

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

frewsxcv added a commit to frewsxcv/rust that referenced this issue May 5, 2017

Rollup merge of rust-lang#41582 - jonhoo:reread-nameservers-on-lookup…
…-fail, r=alexcrichton

Reload nameserver information on lookup failure

As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

@bors bors closed this in #41582 May 6, 2017

hackeryarn added a commit to hackeryarn/rust that referenced this issue May 9, 2017

Reload nameserver information on lookup failure
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Introduces an std linkage dependency on libresolv on macOS/iOS (which
also makes it necessary to update run-make/tools.mk).

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.
@jan-hudec

This comment has been minimized.

Copy link

jan-hudec commented May 23, 2017

Does anybody have a link for the upstream bug?

Because programs, or even Rust runtime, are definitely not supposed to do this. res_init() is a GNU LibC implementation-specific function (OK, shared with BSD LibC, but no standard), while getaddrinfo() is POSIX. So use of getaddrinfo() can't depend on user fiddling with res_init(). And the specification definitely does not say anything that it is expected not to work if the network connection is changed after the program started.

So either:

  • User is never supposed to change /etc/resolv.conf at runtime and all programs that do that should provide Name Service Switch module, or DNS proxy, to take care of this—so it is a bug in DHCP-client and Network-Manager, or
  • Changing /etc/resolv.conf is supposed to happen and then it is a bug in GNU LibC not being able to notice it.
@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 23, 2017

@jan-hudec see #41582 for some further discussion. This is a bug in glibc (other libc implementations do not have this problem as they either do not cache, or they flush the cache when the set of nameservers change). It is reported upstream at https://sourceware.org/bugzilla/show_bug.cgi?id=984, but it seems unlikely that a fix will land any time soon.

I would argue strongly against your first point above (further indicating that this is a bug): /etc/resolv.conf can change for many reasons, many of which are not related to the user's actions. For example, the Arch Linux netctl network manager, and many other network managers, modify /etc/resolv.conf in response to network state changes through resolvconf. Yet they have no way of indicating this change to every running application. It is also not feasible to tell everyone to start using NSS, or to run their own DNS proxy (I run neither on my machine, and would not like to).

@jan-hudec

This comment has been minimized.

Copy link

jan-hudec commented May 23, 2017

Oh, that's why I haven't seen the issue for ages—Debian carries a fix for it.

oconnor663 added a commit to keybase/client that referenced this issue Jul 18, 2017

call libc::res_init() in response to DNS failures
Go's DNS resolution often defers to the libc implementation, and glibc's
resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984
It will cache the contents of /etc/resolv.conf, which can put the client
in a state where all DNS requests fail forever after a network change.
The conditions where Go calls into libc are complicated and
platform-specific, and the resolver cache involves thread-local state,
so repros tend to be inconsistent. But when you hit this on your laptop
on the subway or whatever, the effect is that everything is broken until
you restart the process.

One way to fix this would be to force using the pure-Go resolver
(net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf
every 5 seconds. I'm wary of doing that, because the Go devs went
through an enormous amount of trouble to enable cgo fallback, for
various platform- and environment-specific reasons. See all the comments
in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the
standard library.

Instead, we're trying the same workaround that the Rust standard library
chose, where we call libc::res_init() after DNS failures. See
rust-lang/rust#41570. The downside here is
that we have to remember to do this after we make network calls, and
that we have to use cgo in the build, but the upside is that it should
never break a DNS environment that was working before.

oconnor663 added a commit to keybase/go-framed-msgpack-rpc that referenced this issue Jul 18, 2017

call libc::res_init() in response to DNS failures
Go's DNS resolution often defers to the libc implementation, and glibc's
resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984
It will cache the contents of /etc/resolv.conf, which can put the client
in a state where all DNS requests fail forever after a network change.
The conditions where Go calls into libc are complicated and
platform-specific, and the resolver cache involves thread-local state,
so repros tend to be inconsistent. But when you hit this on your laptop
on the subway or whatever, the effect is that everything is broken until
you restart the process.

One way to fix this would be to force using the pure-Go resolver
(net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf
every 5 seconds. I'm wary of doing that, because the Go devs went
through an enormous amount of trouble to enable cgo fallback, for
various platform- and environment-specific reasons. See all the comments
in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the
standard library.

Instead, we're trying the same workaround that the Rust standard library
chose, where we call libc::res_init() after DNS failures. See
rust-lang/rust#41570. The downside here is
that we have to remember to do this after we make network calls, and
that we have to use cgo in the build, but the upside is that it should
never break a DNS environment that was working before.

oconnor663 added a commit to keybase/client that referenced this issue Jul 19, 2017

call libc::res_init() in response to DNS failures
Go's DNS resolution often defers to the libc implementation, and glibc's
resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984
It will cache the contents of /etc/resolv.conf, which can put the client
in a state where all DNS requests fail forever after a network change.
The conditions where Go calls into libc are complicated and
platform-specific, and the resolver cache involves thread-local state,
so repros tend to be inconsistent. But when you hit this on your laptop
on the subway or whatever, the effect is that everything is broken until
you restart the process.

One way to fix this would be to force using the pure-Go resolver
(net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf
every 5 seconds. I'm wary of doing that, because the Go devs went
through an enormous amount of trouble to enable cgo fallback, for
various platform- and environment-specific reasons. See all the comments
in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the
standard library.

Instead, we're trying the same workaround that the Rust standard library
chose, where we call libc::res_init() after DNS failures. See
rust-lang/rust#41570. The downside here is
that we have to remember to do this after we make network calls, and
that we have to use cgo in the build, but the upside is that it should
never break a DNS environment that was working before.

oconnor663 added a commit to keybase/client that referenced this issue Jul 19, 2017

call libc::res_init() in response to DNS failures
Go's DNS resolution often defers to the libc implementation, and glibc's
resolver has a serious bug: https://sourceware.org/bugzilla/show_bug.cgi?id=984
It will cache the contents of /etc/resolv.conf, which can put the client
in a state where all DNS requests fail forever after a network change.
The conditions where Go calls into libc are complicated and
platform-specific, and the resolver cache involves thread-local state,
so repros tend to be inconsistent. But when you hit this on your laptop
on the subway or whatever, the effect is that everything is broken until
you restart the process.

One way to fix this would be to force using the pure-Go resolver
(net.DefaultResolver.PreferGo = true), which refreshes /etc/resolv.conf
every 5 seconds. I'm wary of doing that, because the Go devs went
through an enormous amount of trouble to enable cgo fallback, for
various platform- and environment-specific reasons. See all the comments
in net/conf.go::initConfVal() and net/conf.go::hostLookupOrder() in the
standard library.

Instead, we're trying the same workaround that the Rust standard library
chose, where we call libc::res_init() after DNS failures. See
rust-lang/rust#41570. The downside here is
that we have to remember to do this after we make network calls, and
that we have to use cgo in the build, but the upside is that it should
never break a DNS environment that was working before.
@keeperofdakeys

This comment has been minimized.

Copy link
Contributor

keeperofdakeys commented Sep 4, 2017

For future reference, this was finally fixed in a recent glibc release. Though this workaround will probably need to be in place for a while longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment