Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS load balancing with Weave is biased by getaddrinfo's sorting #1245

Closed
2opremio opened this issue Jul 30, 2015 · 10 comments
Closed

DNS load balancing with Weave is biased by getaddrinfo's sorting #1245

2opremio opened this issue Jul 30, 2015 · 10 comments

Comments

@2opremio
Copy link
Contributor

Weave probably cannot do much about this one, but it's a good example of how unreliable it is to do load balancing based on plain A-records, following on #1213.

If hostname foo maps to three IPs in WeaveDNS, the user would expect to randomly and uniformly load balance across those IPs when connecting to hostname foo. However, due to the A-record sorting of getaddrinfo (see gai.conf's manpage for further details) that's not the case.

For instance, getaddrinfo always prioritizes the local IP. Here's an example:

I launch 3 containers with the same A-record (foo) in WeaveDNS.

$ weave launch-router
$ weave launch-proxy --hostname-match '([^-]+)-[0-9]+'
$ export DOCKER_HOST=unix:///var/run/weave.sock
$ docker run -d --name=foo-1 ubuntu sh -c 'while true; do echo "I am foo-1" | nc -l -p 4567 ; done'
$ docker run -d --name=foo-2 ubuntu sh -c 'while true; do echo "I am foo-2" | nc -l -p 4567 ; done'
$ docker run -d --name=foo-3 ubuntu sh -c 'while true; do echo "I am foo-3" | nc -l -p 4567 ; done'
$ weave dns-lookup foo
10.32.0.1
10.32.0.3
10.32.0.2
$ docker run -ti ubuntu nc 10.32.0.1 4567
I am foo-1
$ docker run -ti ubuntu nc 10.32.0.2 4567
I am foo-2
$ docker run -ti ubuntu nc 10.32.0.3 4567
I am foo-3

Now, if I connect to foo from an external container, access seems to be random:

$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-3
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-3
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-2
$ docker run -ti ubuntu nc foo 4567
I am foo-3
$ docker run -ti ubuntu nc foo 4567
I am foo-3
$ docker run -ti ubuntu nc foo 4567
I am foo-3
$ docker run -ti ubuntu nc foo 4567
I am foo-1

However, if I connect to foo inside the containers named foo, the local container is always favored:

$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-1 nc foo 4567
I am foo-1
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-2 nc foo 4567
I am foo-2
$ docker exec -ti foo-3 nc foo 4567
I am foo-3
$ docker exec -ti foo-3 nc foo 4567
I am foo-3
$ docker exec -ti foo-3 nc foo 4567
I am foo-3
$ docker exec -ti foo-3 nc foo 4567
I am foo-3
$ docker exec -ti foo-3 nc foo 4567
I am foo-3

This is because ubuntu uses OpenBSD's variant of nc, which uses getaddrinfo for name resolution:

$ docker run -ti ubuntu sh -c 'ls -l $(which nc); ls -l /etc/alternatives/nc; nc -v' 
lrwxrwxrwx 1 root root 20 Jun 30 10:25 /bin/nc -> /etc/alternatives/nc
lrwxrwxrwx 1 root root 15 Jun 30 10:25 /etc/alternatives/nc -> /bin/nc.openbsd
This is nc from the netcat-openbsd package. An alternative nc is available
in the netcat-traditional package.
usage: nc [-46bCDdhjklnrStUuvZz] [-I length] [-i interval] [-O length]
          [-P proxy_username] [-p source_port] [-q seconds] [-s source]
          [-T toskeyword] [-V rtable] [-w timeout] [-X proxy_protocol]
          [-x proxy_address[:port]] [destination] [port]

When using netcat-traditional (which uses gethostbyname), access to foo is random, even when the local container is associated to a foo A-record:

$ docker exec -ti foo-1 apt-get update
[...]
$ docker exec -ti foo-1 apt-get install netcat-traditional
[...]
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-2
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-3
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-3
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-1
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-3
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-2
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-1
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-2
$ docker exec -ti foo-1 nc.traditional foo 4567
I am foo-3
@2opremio 2opremio changed the title DNS loadbalancing with Weave is biased by getaddrinfo's sorting DNS load balancing with Weave is biased by getaddrinfo's sorting Jul 30, 2015
@rade
Copy link
Member

rade commented Jul 30, 2015

I reckon we should just document the behaviour. It's not actually completely unreasonable, i.e. it favours local servers over others, right?

@rade rade added this to the current milestone Jul 30, 2015
@2opremio
Copy link
Contributor Author

@rade Favoring the local server indeed seems reasonable, but I only used it as an example of things not working as expected. I guess we should add a chore to go through RFC 3484 in detail (which I didn't) and figure out whether that's the only impact of getaddrinfo's sorting.

@tomwilkie tomwilkie self-assigned this Aug 4, 2015
@tomwilkie
Copy link
Contributor

I had a bit-more-that-a-quick look through the algorithm and AFAICT we can't influence the sort order - it seems to be based purely on the ip addresses returned.

I considered removed the shuffle code from the DNS server (as its not useful) but in the case of no ips being considered 'local', it might actually do what we expect, as the behaviour of the algorithm is to preserve the order in the last resort. This needs testing.

Assuming the above is true, and no one finds any new information, we will document this behaviour and close this issue.

@tomwilkie
Copy link
Contributor

Yes, for non 'local' IPs, the shuffling is useful - 3 containers on host2, and from host1 you get random:

vagrant@host1:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.276/1.276/1.276 ms
vagrant@host1:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.1): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.276/1.276/1.276 ms
vagrant@host1:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.0): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.433/1.433/1.433 ms
vagrant@host1:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.0): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.385/1.385/1.385 ms

From host2 you always get 10.40.0.2:

vagrant@host2:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.368/0.368/0.368 ms
vagrant@host2:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.240/0.240/0.240 ms
vagrant@host2:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.222/0.222/0.222 ms
vagrant@host2:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.234/0.234/0.234 ms
vagrant@host2:~$ docker run gliderlabs/alpine ping -nq -W 1 -c 1 foo.weave.local
PING foo.weave.local (10.40.0.2): 56 data bytes

--- foo.weave.local ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.205/0.205/0.205 ms

@2opremio
Copy link
Contributor Author

@2opremio
Copy link
Contributor Author

Also relevant: http://www.zytrax.com/books/dns/ch9/rr.html#services .

RFC 6724 (and its predecesor RFC 3484) nominally defines address selection for IPv6 but is also applicable to IPv4 (especially, but not exclusively, in dual stack operations) depending on the LIBC implementation being used. Certainly glibc (GNU libc used on most *nix systems) implements the RFC features for address selection (many thanks to Dennis Leeuw for the heads-up on this issue). From GLIBC version 2.5 to at least GLIBC 2.16 (GLIBC uses a bizarrely exotic numbering system) the RFC 3484 definition was implemented meaning that certain IPv4 addresses (depending on the precise IPv4 set) can be excluded from ever appearing first in the address list thus defeating any load balancing attempts by authoritative servers. This behavior can be controlled by use of a /etc/gai.conf file (see man gai.conf for details). RFC 6724 restored the supremacy of the DNS sourced list (all other things being equal) though at this time (February 2014) it is not known if the RFC 6724 behavior has been implemented in any GLIBC release (the earliest would have been 2.17). In summary, depending on the implementation of RFC 3484/6724 the client can have a definitive (and final) effect on the ordering of IP (IPv4/IPv6) addresses irrespective of any attempts by the authoritative domain owner to change this.

So, it seems that if glibc respected RFC 6724 we wouldn't have this problem

@rade
Copy link
Member

rade commented Jan 5, 2017

It occurs to me that we could get our DNS to resolve special names "blah.single.weave.works" (or s.t. like that) to a single random record chosen from "blah.weave.works".

@bboreham
Copy link
Contributor

bboreham commented Jan 5, 2017

@rade similar: hashicorp/consul#1481

@SpComb
Copy link

SpComb commented Mar 12, 2018

Here's an example of how this behaves... the behavior depends on on the exact layout of the bits in the client/server container addresses: https://gist.github.com/SpComb/c509bd064bc75151e6b41e8bc949d13f

For multiple server addresses in the same subnet as the client, glibc seems to use a longest-prefix match to sort the destination addresses:

@SpComb
Copy link

SpComb commented Mar 12, 2018

WTF github, I did not unassign anyone... must be some accidential keyboard hotkey.

Anyways, best workaround I can figure out for this would indeed be to only return a single randomly chosen A record for such round-robin service names... in the case of servers going down, that would rely on the client applications to retry resolving and connect to a different IP (in addition to healtchecks to remove the broken servers from the DNS pool).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants