Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp-vpn method fails when entries present in /etc/resolver on macOS #986

Closed
kunickiaj opened this issue Apr 5, 2019 · 6 comments
Closed
Labels
bug Something isn't working stale Issue is stale and will be closed

Comments

@kunickiaj
Copy link

kunickiaj commented Apr 5, 2019

What were you trying to do?

Use method tcp-vpn for any telepresence commands.
e.g. telepresence --run-shell

What did you expect to happen?

Shell comes up connected.

What happened instead?

telepresence failed, attaching full log via gist.
https://gist.github.com/kunickiaj/3d40b53a4311bf904e07672f69301ad2

I am also running with dnsmasq installed and a couple of resolvers in /etc/resolver for several domains set to forward to dnsmasq for use with minikube.

When I remove all entries from /etc/resolver (whether or not dnsmasq is running), telepresence seems to behave as expected. I'm not sure why these entries are interfering with telepresence's checks as they're for specific domains. All other domains should go to the upstream proxy, bypassing dnsmasq altogether.

Example entries in /etc/resolver

/etc/resolver/streamsets.dev

domain streamsets.dev
port 53535```

`/etc/resolver/streamsets.net`
```nameserver 172.31.xxx.xxx
domain streamsets.net```

### Automatically included information

Command line: `['/usr/local/bin/telepresence', '--verbose', '--run-shell']`
Version: `0.98`
Python version: `3.6.6 (default, Oct  4 2018, 20:50:27) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]`
kubectl version: `Client Version: v1.14.0 // Server Version: v1.12.6-gke.10`
oc version: `(error: Command '['oc', 'version']' returned non-zero exit status 1.)`
OS: `Darwin streamsets-adam-1726337.local 18.5.0 Darwin Kernel Version 18.5.0: Mon Mar 11 20:40:32 PDT 2019; root:xnu-4903.251.3~3/RELEASE_X86_64 x86_64`

Traceback (most recent call last):
File "/usr/local/bin/telepresence/telepresence/cli.py", line 130, in crash_reporting
yield
File "/usr/local/bin/telepresence/telepresence/main.py", line 77, in main
runner, remote_info, env, socks_port, ssh, mount_dir, pod_info
File "/usr/local/bin/telepresence/telepresence/outbound/setup.py", line 73, in launch
runner_, remote_info, command, args.also_proxy, env, ssh
File "/usr/local/bin/telepresence/telepresence/outbound/local.py", line 121, in launch_vpn
connect_sshuttle(runner, remote_info, also_proxy, ssh)
File "/usr/local/bin/telepresence/telepresence/outbound/vpn.py", line 295, in connect_sshuttle
raise RuntimeError("vpn-tcp tunnel did not connect")
RuntimeError: vpn-tcp tunnel did not connect


Logs:

45.2 TEL | (proxy checking local liveness)
45.2 20 | debug2: channel 1: read<=0 rfd 6 len 0
45.2 20 | debug2: channel 1: read failed
45.2 20 | debug2: channel 1: chan_shutdown_read (i0 o0 sock 6 wfd 6 efd -1 [closed])
45.2 20 | debug2: channel 1: input open -> drain
45.2 20 | debug2: channel 1: ibuf empty
45.2 20 | debug2: channel 1: send eof
45.2 20 | debug2: channel 1: input drain -> closed
45.3 15 | 2019-04-05T22:53:32+0000 [Poll#info] Checkpoint
45.3 20 | debug2: channel 1: rcvd eof
45.3 20 | debug2: channel 1: output open -> drain
45.3 20 | debug2: channel 1: obuf empty
45.3 20 | debug2: channel 1: chan_shutdown_write (i3 o1 sock 6 wfd 6 efd -1 [closed])
45.3 20 | debug2: channel 1: output drain -> closed
45.3 20 | debug2: channel 1: rcvd close
45.3 20 | debug2: channel 1: send close
45.3 20 | debug2: channel 1: is dead
45.3 20 | debug2: channel 1: garbage collecting
45.3 20 | debug1: channel 1: free: 127.0.0.1, nchannels 2

@ark3
Copy link
Contributor

ark3 commented Apr 8, 2019

Sorry about that crash, and thank you for filing this issue with detailed information.

It looks like either the very first DNS request after sshuttle is launched blocks for 30 seconds, or the second one blocks and doesn't make it to the sshuttle process. Subsequent queries get through, so it's clear that sshuttle is working. However, the 30-second wait is long enough for Telepresence to give up and crash.

We need to put a timeout on those DNS requests and maybe verify that some minimum number of attempts have been made before giving up.

@ark3 ark3 added the bug Something isn't working label Apr 8, 2019
@kunickiaj
Copy link
Author

How does it choose what to resolve for the checks?

30s is a long time, so I was wondering if it's sending addresses specific to one of the extra resolvers, one that can't be resolved from the cluster.

@ark3
Copy link
Contributor

ark3 commented Apr 8, 2019

It alternates between resolving hellotelepresence## and hellotelepresence##.a.sanity.check.telepresence.io where ## is an incrementing number. The latter is supposed to fail, not time out, i.e. the DNS servers for telepresence.io return NXDOMAIN for those requests. There's a good chance it's the latter that is tripping things up.

@ark3
Copy link
Contributor

ark3 commented Apr 16, 2019

Could you try running this modified Telepresence? It is Telepresence 0.98 plus a couple of fixes, including something that may avoid your issue.

tel-0.98-adh.gz

Assuming you installed Telepresence using Homebrew, you'll need to place this binary in /usr/local/Cellar/telepresence/0.98/bin and call it using the full path. Something like this should work:

$ wget https://github.com/telepresenceio/telepresence/files/3086947/tel-0.98-adh.gz
$ gunzip tel-0.98-adh.gz
$ chmod a+x tel-0.98-adh
$ mv tel-0.98-adh /usr/local/Cellar/telepresence/0.98/bin/
$ /usr/local/Cellar/telepresence/0.98/bin/tel-0.98-adh --run curl -svk https://kubernetes.default/api

To be specific, this version puts timeouts around the DNS lookups Telepresence does to validate initial VPN connectivity. This way, one slow lookup won't wreck the entire session.

@kunickiaj
Copy link
Author

kunickiaj commented Apr 16, 2019 via email

@stale
Copy link

stale bot commented Mar 16, 2021

This issue has been automatically marked as stale because it has not had recent activity.
Issue Reporter: Is this still a problem? If not, please close this issue.
Developers: Do you need more information? Is this a duplicate? What's the next step?

@stale stale bot added the stale Issue is stale and will be closed label Mar 16, 2021
@stale stale bot closed this as completed Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issue is stale and will be closed
Projects
None yet
Development

No branches or pull requests

2 participants