Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp-vpn method fails when entries present in /etc/resolver on macOS #986

Open
kunickiaj opened this issue Apr 5, 2019 · 5 comments

Comments

2 participants
@kunickiaj
Copy link

commented Apr 5, 2019

What were you trying to do?

Use method tcp-vpn for any telepresence commands.
e.g. telepresence --run-shell

What did you expect to happen?

Shell comes up connected.

What happened instead?

telepresence failed, attaching full log via gist.
https://gist.github.com/kunickiaj/3d40b53a4311bf904e07672f69301ad2

I am also running with dnsmasq installed and a couple of resolvers in /etc/resolver for several domains set to forward to dnsmasq for use with minikube.

When I remove all entries from /etc/resolver (whether or not dnsmasq is running), telepresence seems to behave as expected. I'm not sure why these entries are interfering with telepresence's checks as they're for specific domains. All other domains should go to the upstream proxy, bypassing dnsmasq altogether.

Example entries in /etc/resolver

/etc/resolver/streamsets.dev

domain streamsets.dev
port 53535```

`/etc/resolver/streamsets.net`
```nameserver 172.31.xxx.xxx
domain streamsets.net```

### Automatically included information

Command line: `['/usr/local/bin/telepresence', '--verbose', '--run-shell']`
Version: `0.98`
Python version: `3.6.6 (default, Oct  4 2018, 20:50:27) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]`
kubectl version: `Client Version: v1.14.0 // Server Version: v1.12.6-gke.10`
oc version: `(error: Command '['oc', 'version']' returned non-zero exit status 1.)`
OS: `Darwin streamsets-adam-1726337.local 18.5.0 Darwin Kernel Version 18.5.0: Mon Mar 11 20:40:32 PDT 2019; root:xnu-4903.251.3~3/RELEASE_X86_64 x86_64`

Traceback (most recent call last):
File "/usr/local/bin/telepresence/telepresence/cli.py", line 130, in crash_reporting
yield
File "/usr/local/bin/telepresence/telepresence/main.py", line 77, in main
runner, remote_info, env, socks_port, ssh, mount_dir, pod_info
File "/usr/local/bin/telepresence/telepresence/outbound/setup.py", line 73, in launch
runner_, remote_info, command, args.also_proxy, env, ssh
File "/usr/local/bin/telepresence/telepresence/outbound/local.py", line 121, in launch_vpn
connect_sshuttle(runner, remote_info, also_proxy, ssh)
File "/usr/local/bin/telepresence/telepresence/outbound/vpn.py", line 295, in connect_sshuttle
raise RuntimeError("vpn-tcp tunnel did not connect")
RuntimeError: vpn-tcp tunnel did not connect


Logs:

45.2 TEL | (proxy checking local liveness)
45.2 20 | debug2: channel 1: read<=0 rfd 6 len 0
45.2 20 | debug2: channel 1: read failed
45.2 20 | debug2: channel 1: chan_shutdown_read (i0 o0 sock 6 wfd 6 efd -1 [closed])
45.2 20 | debug2: channel 1: input open -> drain
45.2 20 | debug2: channel 1: ibuf empty
45.2 20 | debug2: channel 1: send eof
45.2 20 | debug2: channel 1: input drain -> closed
45.3 15 | 2019-04-05T22:53:32+0000 [Poll#info] Checkpoint
45.3 20 | debug2: channel 1: rcvd eof
45.3 20 | debug2: channel 1: output open -> drain
45.3 20 | debug2: channel 1: obuf empty
45.3 20 | debug2: channel 1: chan_shutdown_write (i3 o1 sock 6 wfd 6 efd -1 [closed])
45.3 20 | debug2: channel 1: output drain -> closed
45.3 20 | debug2: channel 1: rcvd close
45.3 20 | debug2: channel 1: send close
45.3 20 | debug2: channel 1: is dead
45.3 20 | debug2: channel 1: garbage collecting
45.3 20 | debug1: channel 1: free: 127.0.0.1, nchannels 2

@ark3

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Sorry about that crash, and thank you for filing this issue with detailed information.

It looks like either the very first DNS request after sshuttle is launched blocks for 30 seconds, or the second one blocks and doesn't make it to the sshuttle process. Subsequent queries get through, so it's clear that sshuttle is working. However, the 30-second wait is long enough for Telepresence to give up and crash.

We need to put a timeout on those DNS requests and maybe verify that some minimum number of attempts have been made before giving up.

@ark3 ark3 added the bug label Apr 8, 2019

@ark3 ark3 added this to To do in Tel Tracker via automation Apr 8, 2019

@kunickiaj

This comment has been minimized.

Copy link
Author

commented Apr 8, 2019

How does it choose what to resolve for the checks?

30s is a long time, so I was wondering if it's sending addresses specific to one of the extra resolvers, one that can't be resolved from the cluster.

@ark3

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

It alternates between resolving hellotelepresence## and hellotelepresence##.a.sanity.check.telepresence.io where ## is an incrementing number. The latter is supposed to fail, not time out, i.e. the DNS servers for telepresence.io return NXDOMAIN for those requests. There's a good chance it's the latter that is tripping things up.

@ark3 ark3 moved this from To do to In progress in Tel Tracker Apr 16, 2019

@ark3

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

Could you try running this modified Telepresence? It is Telepresence 0.98 plus a couple of fixes, including something that may avoid your issue.

tel-0.98-adh.gz

Assuming you installed Telepresence using Homebrew, you'll need to place this binary in /usr/local/Cellar/telepresence/0.98/bin and call it using the full path. Something like this should work:

$ wget https://github.com/telepresenceio/telepresence/files/3086947/tel-0.98-adh.gz
$ gunzip tel-0.98-adh.gz
$ chmod a+x tel-0.98-adh
$ mv tel-0.98-adh /usr/local/Cellar/telepresence/0.98/bin/
$ /usr/local/Cellar/telepresence/0.98/bin/tel-0.98-adh --run curl -svk https://kubernetes.default/api

To be specific, this version puts timeouts around the DNS lookups Telepresence does to validate initial VPN connectivity. This way, one slow lookup won't wreck the entire session.

@kunickiaj

This comment has been minimized.

Copy link
Author

commented Apr 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.