New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis Mac tests fail intermittently due to environmental issues #393

Closed
ark3 opened this Issue Jan 2, 2018 · 4 comments

Comments

2 participants
@ark3
Contributor

ark3 commented Jan 2, 2018

DNS requests from the Travis Mac environment have domain names that vary during a single Telepresence run. For example, one request will be for hellotelepresence10.jobs.macstadium.blah and the next will be for hellotelepresence11.macstadium.blah. Once Telepresence captures the first request's domain for automatic stripping, the second request yields an NXDOMAIN, which causes either the Telepresence connection or the test to fail.

@ark3 ark3 added the infrastructure label Jan 2, 2018

@exarkun

This comment has been minimized.

Contributor

exarkun commented Jan 8, 2018

Here's a random thing from OS X:

$ ipconfig getpacket en0 | grep jobs
domain_name (string): jobs.macstadium-us-se-1.travisci.net

Here's my initial hypothesis. The resolver starts with the full domain as a suffix. Telepresence gets hellotelepresenceN.jobs.macstadium-us-se-1.travisci.net and is prepared to return a response. However, the client times out first. The client moves on to trying a query with one fewer "dots": hellotelepresenceN.macstadium-us-se-1.travisci.net. Telepresence doesn't recognize this because it believes it already has the suffix figured out, tries to do a regular name lookup, fails, returns the failure to the client (which accidentally doesn't timeout before all this happens), and the client things things have failed.

An instance of this failure I can find on travis has this blurb in its logs:


  12.4 TL | [35] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence10')"],)...
  12.4 24 | c : DNS request from ('10.182.9.190', 50441) to None: 74 bytes
  12.4 10 | 2018-01-02T17:01:35+0000 [stdout#info] Set DNS suffix we filter out to: [b'jobs', b'macstadium-us-se-1', b'travisci', b'net']
  12.4 10 | 2018-01-02T17:01:35+0000 [stdout#info] Result for b'hellotelepresence10.jobs.macstadium-us-se-1.travisci.net' is ['127.0.0.1']
  12.5 TL | [35] captured.
http request failed, sleeping before retry (Command '['kubectl', 'run', '--attach', 'testing-f1c9efa-164-44151329994202-14022', '--quiet', '--rm', '--image=alpine', '--restart', 'Never', '--namespace', 'default', '--command', '--', 'wget', '-q', '-O-', '-T', '3', 'http://testing-f1c9efa-152-63478922843933-14022:70/__init__.py']' returned non-zero exit status 1.)
  12.9 24 | c : DNS request from ('10.182.9.190', 59359) to None: 54 bytes
  13.0 10 | 2018-01-02T17:01:36+0000 [stdout#info] A query: b'36.courier-push-apple.com.akadns.net'
  13.1 10 | 2018-01-02T17:01:36+0000 [stdout#info] Result for b'36.courier-push-apple.com.akadns.net' is ['17.249.230.5', '17.249.92.23', '17.249.92.19', '17.249.230.8', '17.249.92.33', '17.249.230.38', '17.249.230.12', '17.249.230.9']
  13.5 24 | c : DNS request from ('10.182.9.190', 56097) to None: 54 bytes
  13.5 TL | [36] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence11')"],)...
  13.5 10 | 2018-01-02T17:01:36+0000 [stdout#info] A query: b'22.courier-push-apple.com.akadns.net'
  13.5 24 | c : DNS request from ('10.182.9.190', 54914) to None: 69 bytes
  13.6 10 | 2018-01-02T17:01:37+0000 [stdout#info] Result for b'22.courier-push-apple.com.akadns.net' is ['17.249.230.33', '17.249.230.24', '17.249.230.29', '17.249.230.8', '17.249.230.15', '17.249.92.33', '17.249.230.38', '17.249.230.18']
  13.6 10 | 2018-01-02T17:01:37+0000 [stdout#info] A query: b'hellotelepresence11.macstadium-us-se-1.travisci.net'
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] [Failure instance: Traceback: <class 'socket.gaierror'>: [Errno -2] Name does not resolve
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/threading.py:916:_bootstrap_inner
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/threading.py:864:run
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_threadworker.py:46:work
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_team.py:190:doWork
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] --- <exception caught here> ---
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:250:inContext
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:266:<lambda>
  13.7 10 | 2018-01-02T17:01:37+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/context.py:122:callWithContext
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

The first request is for hellotelepresence10.jobs.macstadium-us-se-1.travisci.net and the second is for hellotelepresence11.macstadium-us-se-1.travisci.net. This is not supportive of my hypothesis above since it's two different names that are looked up with two different suffixes. Also, looking through resolver documentation, I wasn't able to find any which supported the existence of the behavior described at the top.

@exarkun

This comment has been minimized.

Contributor

exarkun commented Jan 18, 2018

And here's a Travis Linux-flavored example of what seems likely to be essentially the same problem:

 6.4 TL | [25] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence0')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   6.7 TL | [26] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence1')"],)...
   6.7 12 | Listening...
   6.7 12 | 2018-01-18T13:32:16+0000 [-] Loading ./forwarder.py...
   6.7 12 | 2018-01-18T13:32:16+0000 [-] SOCKSv5Factory starting on 9050
   6.7 12 | 2018-01-18T13:32:16+0000 [socks.SOCKSv5Factory#info] Starting factory <socks.SOCKSv5Factory object at 0x7ff909a3e438>
   6.7 12 | 2018-01-18T13:32:16+0000 [-] /etc/resolv.conf changed, reparsing
   6.7 12 | 2018-01-18T13:32:16+0000 [-] Resolver added ('10.3.240.10', 53) to server list
   6.7 12 | 2018-01-18T13:32:16+0000 [-] DNSDatagramProtocol starting on 9053
   6.7 12 | 2018-01-18T13:32:16+0000 [-] Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7ff90b57cf28>
   6.7 12 | 2018-01-18T13:32:16+0000 [-] Loaded.
   6.7 12 | 2018-01-18T13:32:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.9.0 (/usr/bin/python3.6 3.6.1) starting up.
   6.7 12 | 2018-01-18T13:32:16+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   6.8 24 | Starting sshuttle proxy.
   6.9 TL | [27] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence2')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.0 24 | firewall manager: Starting firewall with Python version 3.5.4
   7.0 24 | firewall manager: ready method name nat.
   7.0 24 | IPv6 enabled: False
   7.0 24 | UDP enabled: False
   7.0 24 | DNS enabled: True
   7.0 24 | TCP redirector listening on ('127.0.0.1', 12300).
   7.0 24 | DNS listening on ('127.0.0.1', 12300).
   7.0 24 | Starting client with Python version 3.5.4
   7.0 24 | c : connecting to server...
   7.0 13 | Handling connection for 34545
   7.0 TL | [28] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence3')"],)...
   7.1 24 | Warning: Permanently added '[localhost]:34545' (ECDSA) to the list of known hosts.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.2 TL | [29] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence4')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.3 TL | [30] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence5')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.5 TL | [31] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence6')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.6 TL | [32] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence7')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.8 TL | [33] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence8')"],)...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   7.8 24 | Starting server with Python version 3.6.1
   7.8 24 |  s: latency control setting = True
   7.8 24 |  s: available routes:
   7.8 24 |  s:   2/10.0.3.0/24
   7.8 24 | c : Connected.
   7.8 24 | firewall manager: setting up.
   7.8 24 | >> iptables -t nat -N sshuttle-12300
   7.8 24 | >> iptables -t nat -F sshuttle-12300
   7.9 24 | >> iptables -t nat -I OUTPUT 1 -j sshuttle-12300
   7.9 24 | >> iptables -t nat -I PREROUTING 1 -j sshuttle-12300
   7.9 24 | >> iptables -t nat -A sshuttle-12300 -j RETURN --dest 127.0.0.1/32 -p tcp
   7.9 24 | >> iptables -t nat -A sshuttle-12300 -j REDIRECT --dest 10.0.3.0/24 -p tcp --to-ports 12300 -m ttl ! --ttl 42
   7.9 24 | >> iptables -t nat -A sshuttle-12300 -j REDIRECT --dest 10.3.240.0/20 -p tcp --to-ports 12300 -m ttl ! --ttl 42
   7.9 24 | >> iptables -t nat -A sshuttle-12300 -j REDIRECT --dest 169.254.169.254/32 -p udp --dport 53 --to-ports 12300 -m ttl ! --ttl 42
   7.9 24 | >> iptables -t nat -A sshuttle-12300 -j REDIRECT --dest 224.0.0.252/32 -p udp --dport 5355 --to-ports 12300 -m ttl ! --ttl 42
   7.9 24 | conntrack v1.4.1 (conntrack-tools): 0 flow entries have been deleted.
   7.9 TL | [34] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence9')"],)...
   7.9 24 | c : DNS request from ('10.20.0.245', 40065) to None: 66 bytes
   7.9 12 | 2018-01-18T13:32:18+0000 [stdout#info] Set DNS suffix we filter out to: [b'c', b'eco-emissary-99515', b'internal']
   7.9 12 | 2018-01-18T13:32:18+0000 [stdout#info] Result for b'hellotelepresence9.c.eco-emissary-99515.internal' is ['127.0.0.1']
   8.0 TL | [34] captured.
   9.0 TL | [35] Capturing: (['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence10')"],)...
   9.0 24 | c : DNS request from ('10.20.0.245', 43829) to None: 53 bytes
   9.0 12 | 2018-01-18T13:32:19+0000 [stdout#info] A query: b'hellotelepresence10.google.internal'
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] [Failure instance: Traceback: <class 'socket.gaierror'>: [Errno -2] Name does not resolve
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/threading.py:916:_bootstrap_inner
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/threading.py:864:run
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_threadworker.py:46:work
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_team.py:190:doWork
   9.1 24 | c : DNS request from ('10.20.0.245', 42745) to None: 37 bytes
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] --- <exception caught here> ---
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:250:inContext
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:266:<lambda>
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/context.py:122:callWithContext
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/context.py:85:callWithContext
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] ./forwarder.py:31:resolve
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] ]
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] A query: b'hellotelepresence10'
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] [Failure instance: Traceback: <class 'socket.gaierror'>: [Errno -2] Name does not resolve
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/threading.py:916:_bootstrap_inner
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/threading.py:864:run
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_threadworker.py:46:work
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/_threads/_team.py:190:doWork
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] --- <exception caught here> ---
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:250:inContext
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/threadpool.py:266:<lambda>
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/context.py:122:callWithContext
Traceback (most recent call last):
  File "<string>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] /usr/lib/python3.6/site-packages/twisted/python/context.py:85:callWithContext
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] ./forwarder.py:31:resolve
   9.1 12 | 2018-01-18T13:32:19+0000 [stdout#info] ]
   9.1 24 | c : DNS request from ('10.20.0.245', 36342) to None: 42 bytes
   9.1 24 | c : DNS request from ('10.20.0.245', 36342) to None: 42 bytes
Looks like there's a bug in our code. Sorry about that!
Here's the traceback:
Traceback (most recent call last):
  File "/home/travis/build/datawire/telepresence/telepresence/cli.py", line 73, in call_f
    return f(*args, **kwargs)
  File "/home/travis/build/datawire/telepresence/telepresence/main.py", line 493, in go
    runner, remote_info, args, env, subprocesses, socks_port, ssh
  File "/home/travis/build/datawire/telepresence/telepresence/local.py", line 154, in run_local_command
    connect_sshuttle(runner, remote_info, args, subprocesses, env, ssh)
  File "/home/travis/build/datawire/telepresence/telepresence/vpn.py", line 241, in connect_sshuttle
    get_hellotelepresence()
  File "/home/travis/build/datawire/telepresence/telepresence/vpn.py", line 230, in get_hellotelepresence
    format(next(counter))
  File "/home/travis/build/datawire/telepresence/telepresence/runner.py", line 95, in get_output
    result = str(check_output(*args, **kwargs).strip(), "utf-8")
  File "/opt/python/3.5.4/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/opt/python/3.5.4/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['python3', '-c', "import socket; socket.gethostbyname('hellotelepresence10')"]' returned non-zero exit status 1

Notice the query b'hellotelepresence9.c.eco-emissary-99515.internal' followed by the query b'hellotelepresence10.google.internal'. This example is clearly not following a pattern consistent with my hypothesis from my previous message (not that Linux and OSX problems necessarily have to have the same underlying cause; Travis could (and probably does) have completely different networks and network configurations for these two environments, for all we know).

@exarkun

This comment has been minimized.

Contributor

exarkun commented Jan 18, 2018

https://cloud.google.com/compute/docs/internal-dns says some things about GCE DNS configuration which seem to align with the observed Linux Travis behavior. Note the two queries from above seem to fit perfectly with the search c.[PROJECT_ID].internal google.internal. line in the sample resolv.conf (if we assume that "eco-emissary-99515" is somehow the applicable project id...).

@exarkun

This comment has been minimized.

Contributor

exarkun commented Jan 18, 2018

An inspection of the CircleCI Linux build environment reveals this, for what it's worth:

$ cat /etc/resolv.conf        
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 169.254.169.254
search c.circleci-picard-mahmood.internal google.internal

@exarkun exarkun self-assigned this Feb 7, 2018

@plombardi89 plombardi89 added this to Completed in Roadmap Feb 20, 2018

@plombardi89 plombardi89 moved this from Completed to macOS in Roadmap Feb 20, 2018

@rhs rhs added this to Tech Debt in Buckets Mar 8, 2018

@exarkun exarkun closed this in #450 Mar 9, 2018

Roadmap automation moved this from macOS to Completed Mar 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment