New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload nameserver information on lookup failure #41582

Merged
merged 1 commit into from May 6, 2017

Conversation

Projects
None yet
10 participants
@jonhoo
Copy link
Contributor

jonhoo commented Apr 27, 2017

As discussed in #41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Apr 27, 2017

I wonder, are there performance implications about this? E.g. does a bunch of failing queries now take much longer?

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Apr 27, 2017

@alexcrichton yes, I believe that that is true. This is why I first proposed in #41570 that we might instead want a connect_uncached, or some other way of converting from str to SocketAddr. That said, with this approach, applications for which this is a problem can manually call getaddrinfo, and then use the resulting SocketAddr. Such a workaround does not exist if you do want this behavior, because you can't force the cache to be ignored except by calling res_init.

// The lookup failure could be caused by using a stale /etc/resolv.conf.
// See https://github.com/rust-lang/rust/issues/41570.
// We therefore force a reload of the nameserver information.
c::res_init();

This comment has been minimized.

@retep998

retep998 Apr 27, 2017

Member

This code is in sys_common so it is compiled on all platforms, but I don't see a res_init on Windows.

This comment has been minimized.

@jonhoo

jonhoo Apr 27, 2017

Contributor

That's what the if cfg!(unix) above was intended for?

This comment has been minimized.

@jonhoo

jonhoo Apr 27, 2017

Contributor

Ohh, of course, it's still compiled. I need #[cfg(..)] {} instead, right?

@jonhoo jonhoo referenced this pull request Apr 30, 2017

Merged

Add res_init #585

@arielb1 arielb1 added the T-libs label May 2, 2017

@arielb1

This comment has been minimized.

Copy link
Contributor

arielb1 commented May 2, 2017

@alexcrichton - are you looking at this PR? Friendly ping to keep this on your radar.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 2, 2017

@arielb1 ah yes I am, mostly just caught up in travel!

@jonhoo would you mind benchmarking the effect of calling ::res_init on failed queries? Just to get an idea of how much slower it is.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 2, 2017

@alexcrichton I'd say pretty negligible:

#![feature(test)]
extern crate libc;
extern crate test;

fn main() {}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    use std::net::ToSocketAddrs;

    #[bench]
    fn bench_plain(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0));
    }

    #[bench]
    fn bench_reinit(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| {
                   addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0);
                   unsafe { libc::res_init() };
               });
    }
}
$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:   4,419,043 ns/iter (+/- 331,978)
test tests::bench_reinit ... bench:   4,420,291 ns/iter (+/- 297,022)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

$ sudo netctl stop-all
$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:     267,152 ns/iter (+/- 18,776)
test tests::bench_reinit ... bench:     267,891 ns/iter (+/- 9,855)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 3, 2017

@jonhoo awesome thanks for testing!

@jonhoo jonhoo force-pushed the jonhoo:reread-nameservers-on-lookup-fail branch from 65fcd0d to 1309652 May 4, 2017

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

@alexcrichton I believe that with rust-lang/libc#585 merged, 1309652 should now work fine. What's the process for building with a more recent libc?

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

@alexcrichton not sure if this is the right way to do it, but I bumped the submodule revision for liblibc, and everything seems to be working as it should. r?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 4, 2017

@bors: r+

Looks good! Let's see what CI says

@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 4, 2017

📌 Commit db36fc8 has been approved by alexcrichton

@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 4, 2017

⌛️ Testing commit db36fc8 with merge 31b2a0a...

bors added a commit that referenced this pull request May 4, 2017

Auto merge of #41582 - jonhoo:reread-nameservers-on-lookup-fail, r=al…
…excrichton

Reload nameserver information on lookup failure

As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton
@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 4, 2017

💔 Test failed - status-travis

@TimNN

This comment has been minimized.

Copy link
Contributor

TimNN commented May 4, 2017

Looks legitimate (Edit: this is on macOS, eg: https://travis-ci.org/rust-lang/rust/jobs/228616549):

[00:02:40]   = note: Undefined symbols for architecture x86_64:
[00:02:40]             "_res_9_init", referenced from:
[00:02:40]                 std::net::lookup_host::hd3d9eaad793f8abc in std-438eba4cd7d88a45.0.o
[00:02:40]           ld: symbol(s) not found for architecture x86_64
[00:02:40]           clang: error: linker command failed with exit code 1 (use -v to see invocation)
@kennytm

This comment has been minimized.

Copy link
Member

kennytm commented May 4, 2017

Problem on libc side. Using res_init on macOS requires linking with -lresolv also.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

@kennytm yeah, that's what I was worried about. @alexcrichton, looks like we can't rely on std to pull in resolv. How do you want us to fix this? Add linking with resolv on macOS in libc?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 4, 2017

Ah yeah sorry to be clear I think adding linkage directives to libstd is ok for now, I think we just want to avoid modifying libc for now due to the impact it'll have.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

@alexcrichton how would I go about doing that?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 4, 2017

I think the best way would likely be to modify libc-shim's build script in this repo to print out rustc-link-lib=resolv on OSX perhaps?

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

Ah, okay -- gave that a shot in 912da9b.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 4, 2017

Looks pretty good to me, although on second though I think this may actually be best in src/libstd/build.rs as the function's actually used in libstd, sorry about that! While you're at it as well, could you squash the commits into one?

@jonhoo jonhoo force-pushed the jonhoo:reread-nameservers-on-lookup-fail branch from 912da9b to 43acba5 May 4, 2017

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 4, 2017

Done in 43acba5

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 4, 2017

@bors: r+

Thanks!

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented May 5, 2017

macOS failed again with the same error, I think, though not sure:

[01:28:35]   "_res_9_init", referenced from:
[01:28:35]       std::net::lookup_host::hd3d9eaad793f8abc in libbar.a(libbar.0.o)
@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 5, 2017

That's so strange... The log even says that it's linking with libresolv. At this point I think I'd need someone running macOS to take a look and track this symbol down? Debugging this is really tricky without direct access to a macOS environment. The only thing that comes to mind is that the symbol may actually be called res_9_init (note the lack of the _ prefix) in recent macOS deployments? If so, I guess we'd need to update libc too @alexcrichton ? Could someone running macOS take a stab at confirming this locally?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 5, 2017

Oh -lresolv is correct but both of these tests are performing manual linking, so the libs linked just need to be updated (they're run-make tests I believe)

Reload nameserver information on lookup failure
As discussed in #41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see #41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Introduces an std linkage dependency on libresolv on macOS/iOS (which
also makes it necessary to update run-make/tools.mk).

Fixes #41570.
Depends on rust-lang/libc#585.

@jonhoo jonhoo force-pushed the jonhoo:reread-nameservers-on-lookup-fail branch from 43acba5 to 68ae617 May 5, 2017

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 5, 2017

Updated src/test/run-make/tools.mk, squashed, and pushed in 68ae617.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 5, 2017

@bors: r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 5, 2017

📌 Commit 68ae617 has been approved by alexcrichton

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 5, 2017

I'll take that as you agreeing that 68ae617#diff-bff523d3aff3a86f367f9d199a559b71R75 is the right change to make :p

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 5, 2017

Indeed!

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 5, 2017

Hmm, seems like @bors is taking a while to pick this up -- @alexcrichton ?

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented May 5, 2017

You are currently 4th in the queue: https://buildbot2.rust-lang.org/homu/queue/rust. It can take a while, especially now--we're actively trying to land a patch that fixes nightly, so that has higher priority so whenever we retry it, it jumps to the top.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 5, 2017

Ah, thanks! Didn't know there was a way of looking at the queue. Never mind me :)

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 5, 2017

Rollup merge of rust-lang#41582 - jonhoo:reread-nameservers-on-lookup…
…-fail, r=alexcrichton

Reload nameserver information on lookup failure

As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

bors added a commit that referenced this pull request May 5, 2017

Auto merge of #41773 - frewsxcv:rollup, r=frewsxcv
Rollup of 9 pull requests

- Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763
- Failed merges:

@bors bors merged commit 68ae617 into rust-lang:master May 6, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@bors

This comment has been minimized.

Copy link
Contributor

bors commented May 6, 2017

⌛️ Testing commit 68ae617 with merge 42a4f37...

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented May 6, 2017

Is bors drunk? This PR was just merged in a rollup, so why is bors trying to test it?

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented May 11, 2017

It is worth noting that if https://sourceware.org/bugzilla/show_bug.cgi?id=984 ever lands, we might want to revert this change.

kennytm added a commit to kennytm/rust-ios-android that referenced this pull request May 21, 2017

// The lookup failure could be caused by using a stale /etc/resolv.conf.
// See https://github.com/rust-lang/rust/issues/41570.
// We therefore force a reload of the nameserver information.
c::res_init();

This comment has been minimized.

@tamird

tamird May 23, 2017

Contributor

Doesn't this still result in surprising behaviour if e.g. the contents of /etc/resolv.conf change without the old resolver becoming unusable?

For instance, if I change my DNS resolver without making the old resolver unreachable, I'll never hit this error and any running rust applications will continue to use the old resolver...indefinitely.

This comment has been minimized.

@jonhoo

jonhoo May 23, 2017

Contributor

Yes. Though if the resolution happens successfully, what is the problem? It's also quite hard to get around that particular case. We could always call res_init, but that seems a little wasteful. The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception). Applications that want to be robust against this could always call libc::res_init directly though of course.

This comment has been minimized.

@tamird

tamird May 25, 2017

Contributor

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

This comment has been minimized.

@jonhoo

jonhoo May 25, 2017

Contributor

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

True, though that sounds like a very weird setup indeed. One in which you can connect using the resolution information from the old server, but you need to instead connect to the server provided by a new resolver?

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

I did some benchmarks above (#41582 (comment)), and it's not terrible (especially because it doesn't require a syscall), but if we can avoid doing something...

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

No other libcs have this issue. Some of them don't cache /etc/resolv.conf, some integrate with NSS or similar services, which know when the cache should be flushed. I haven't looked into it too carefully. It is unclear what the "right" solution is given that glibc wants to be both fast (i.e., don't do a file read on every connect), and not rely on other services (like NSS).

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Jul 14, 2017

The glibc man page for res_init notes:

The traditional resolver interfaces such as res_init() and res_query() use some static (global) state stored in the _res structure, rendering these functions non-thread-safe.

Is it safe to call res_init like we're doing now? If not, what are the options for making it safer? We could take a global lock, but I don't think that would help if e.g. we're linking against code in other languages that doesn't know about our lock. The same man page notes that there are more recent functions like res_ninit that can use per-thread state. Those might be safer if they're widespread enough to depend on them.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Jul 14, 2017

That's a good point, though through some digging it appears as though res_init is in fact thread-safe. It's unfortunately somewhat tricky to use res_ninit and have it affect the same structure as is used by glibc. There's some discussion of it further down in the glibc man page:

In glibc, when you link with -lpthread, such a per-thread resolver state is already present. It can be accessed using _res', which has been redefined as a macro, in a similar way to what has been done for the errno' and h_errno' variables. This per-thread resolver state is also used for the gethostby*' family of functions, which means that for example `gethostbyname_r' is now fully thread-safe and re-entrant.

This suggests that we could (and probably should) use res_ninit, but we'd need to figure out which per-thread symbol to use. The aliasing magic seems to reside here, but I can't quite tell how it works to provide a per-thread _res symbol through a macro?

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Jul 17, 2017

@jonhoo should we file an issue somewhere for this? I'm not sure I would call it "unsound" so much as "maybe unsound under certain very specific conditions and versions of glibc" :p

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Jul 17, 2017

Well, https://sourceware.org/bugzilla/show_bug.cgi?id=984 is now marked as RESOLVED, so it could be that we can now revert this commit entirely (though I haven't tested). I don't know of a version of glibc where our use is unsound, though you may be right that there is one.

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Jul 17, 2017

Existing versions of glibc that don't contain the fix will be around for years though. Probably we can never revert this workaround? :(

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Aug 1, 2017

@jonhoo I'm close to convinced that we've actually started doing something unsafe here. See #43592.

Edit: A fix has landed: #44965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment