New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telepresence --docker-run fails from an ec2 instance in a VPC #462

Closed
blak3mill3r opened this Issue Feb 21, 2018 · 10 comments

Comments

4 participants
@blak3mill3r
Copy link

blak3mill3r commented Feb 21, 2018

I'm trying without success to run telepresence --docker-run ... from an ec2 instance inside our VPC

I've spent a lot of time trial-and-error narrowing down what's wrong

telepresence --docker-run -it debian:latest /bin/bash <-- pretty simple failing case

in the telepresence.log I can see that it is trying to resolve a strange DNS name:

9.4 42 | ssh: Could not resolve hostname ip-172-17-0-1.e: Name does not resolve^M

I have a feeling this might have to do with EC2's DNS server behavior, which involves certain names resolving differently from inside the AWS network ... DNS names that are public and resolve from outside as a public IP will resolve as a private IP if the DNS query is from inside EC2.

I wonder if anyone else is trying to do this? telepresence --docker-run from an ec2 instance?

Interestingly, telepresence --method=vpn-tcp works fine from this ec2 instance.

Sometimes when I run the above, telepresence hangs forever, and in the log I can see it waiting for 20.7 TL | [54] Running: (['docker', 'run', '--network=container:telepresence-1519197605-8019295-17852', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)..., which never happens because no container telepresence-1519197605-8019295-17852 is running.

The ec2 instance is Ubuntu 16.04 LTS from an official AMI from https://cloud-images.ubuntu.com/locator/ec2/ which I launched earlier today (specifically: ami-0b383171), I installed docker-ce from the docker apt repo and installed telepresence 0.75 via apt, all today.

It does launch the telepresence deployment on the remote end, and the logs from that pod look normal.

the Would you like to file an issue in our issue tracker? bit does not work from the ec2 instance (presumably because it cannot launch a web browser?)

here's the part that is relevant (pretty certain)

   8.9 41 | Starting sshuttle proxy.
   9.1 41 | firewall manager: Starting firewall with Python version 3.6.1
   9.1 41 | firewall manager: ready method name nat.
   9.1 41 | IPv6 enabled: False
   9.1 41 | UDP enabled: False
   9.1 41 | DNS enabled: True
   9.1 41 | TCP redirector listening on ('127.0.0.1', 12300).
   9.1 41 | DNS listening on ('127.0.0.1', 12300).
   9.1 41 | Starting client with Python version 3.6.1
   9.1 41 | c : connecting to server...
   9.1 41 | ssh: Could not resolve hostname ip-172-17-0-1.e: Name does not resolve
   9.1 41 | c : fatal: failed to establish ssh session (2)
   9.1 41 |    0.5 TL | Main process (['sshuttle-telepresence', '-v', '--dns', '--method', 'nat', '-e', 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null', '--to-ns', '127.0.0.1:9053', '-r', 'telepresence@ip-172-17-0-1.e:40966', '100.96.2.0/24', '100.64.0.0/13', '100.96.6.0/24', '100.96.7.0/24', '100.96.8.0/24', '100.96.3.0/24', '100.96.0.0/24']) exited with code 99.
   9.2 41 | [INFO  tini (1)] Main child exited normally (with status '99')
  19.5 42 | Failed to connect to proxy in remote cluster.
  19.5 42 | [INFO  tini (1)] Main child exited normally (with status '1')
  19.6 TL | [42] exit 1.

Would you like to file an issue in our issue tracker? We'd really appreciate the help improving our product. [Y/n]: y

I'm trying to figure out what ip-172-17-0-1.e is

I thought perhaps it was connected to the kubernetes cluster's masters ELB but I believe I've ruled that out... our VPC subnets do not overlap with 172.17.0.0/16 so I am pretty perplexed as to what ip-172-17-0-1.e is. It is not the address of a Kubernetes node or any EC2 instance or elastic network interface in our account. Is it something virtual inside Kubernetes, or inside Docker?

I've tried a bunch of things (different versions of docker-ce, different versions of telepresence, installing from source, removing the vanilla sshuttle apt package... nothing seems to work although I've gotten at least 2 different failure symptoms (one hanging forever, one crashing). I'll attach a log from each of these.

@blak3mill3r

This comment has been minimized.

Copy link

blak3mill3r commented Feb 21, 2018

   0.0 TL | Telepresence launched at Wed Feb 21 08:04:27 2018
   0.0 TL |   ['/usr/bin/telepresence', '--docker-run', '-it', 'debian:latest', '/bin/bash']
   0.0 TL | Scout info: {'application': 'telepresence', 'latest_version': '0.72', 'notices': []}
   0.0 TL | Context: kube.us-east-1.iris.tv, namespace: default, kubectl_command: kubectl
   0.0 TL | [1] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', 'cluster-info'],)...
   0.3 TL | [1] captured.
   0.3 TL | [2] Capturing: (['ssh', '-V'],)...
   0.3 TL | [2] captured.
   0.3 TL | [3] Capturing: (['which', 'torsocks'],)...
   0.3 TL | [3] captured.
   0.3 TL | [4] Capturing: (['which', 'sshfs'],)...
   0.3 TL | [4] captured.
   0.3 TL | [5] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'delete', '--ignore-not-found', 'all', '--selector=telepresence=1347227a-0441-4115-83dd-12aba08ba6b9'],)...
   0.5 TL | [5] captured.
   0.5 TL | [6] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'run', '--restart=Always', '--limits=cpu=100m,memory=256Mi', '--requests=cpu=25m,memory=64Mi', 'telepresence-1519200267-344957-21985', '--image=datawire/telepresence-k8s:0.75', '--labels=telepresence=1347227a-0441-4115-83dd-12aba08ba6b9'],)...
   0.6 TL | [6] captured.
   0.6 TL | [7] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'deployment', '-o', 'json', '--export', '--selector=telepresence=1347227a-0441-4115-83dd-12aba08ba6b9'],)...
   0.8 TL | [7] captured.
   0.8 TL | Expected metadata for pods: {'creationTimestamp': None, 'labels': {'telepresence': '1347227a-0441-4115-83dd-12aba08ba6b9'}}
   0.8 TL | [8] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'pod', '-o', 'json', '--export'],)...
   0.9 TL | [8] captured.
   0.9 TL | Checking {'telepresence': '1347227a-0441-4115-83dd-12aba08ba6b9', 'pod-template-hash': '1774008260'} (phase Pending)...
   0.9 TL | Looks like we've found our pod!
   0.9 TL | [9] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'pod', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '-o', 'json'],)...
   1.0 TL | [9] captured.
   1.2 TL | [10] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'pod', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '-o', 'json'],)...
   1.4 TL | [10] captured.
   1.6 TL | [11] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'pod', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '-o', 'json'],)...
   1.7 TL | [11] captured.
   2.0 TL | [12] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'get', 'pod', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '-o', 'json'],)...
   2.1 TL | [12] captured.
   2.1 TL | [13] Launching: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'logs', '-f', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '--container', 'telepresence-1519200267-344957-21985'],)...
   2.1 TL | [14] Launching: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'port-forward', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '44442:8022'],)...
   2.1 TL | [15] Capturing: (['ip', 'addr', 'show', 'dev', 'docker0'],)...
   2.1 TL | [15] captured.
   2.1 TL | [16] Launching: (['socat', 'TCP4-LISTEN:44442,bind=172.17.0.1,reuseaddr,fork', 'TCP4:127.0.0.1:44442'],)...
   2.1 TL | [17] Running: (['ssh', '-F', '/dev/null', '-q', '-oStrictHostKeyChecking=no', '-oUserKnownHostsFile=/dev/null', '-p', '44442', 'telepresence@localhost', '/bin/true'],)...
   2.2 TL | [17] exit 255.
   2.4 TL | [18] Running: (['ssh', '-F', '/dev/null', '-q', '-oStrictHostKeyChecking=no', '-oUserKnownHostsFile=/dev/null', '-p', '44442', 'telepresence@localhost', '/bin/true'],)...
   2.4 TL | [18] exit 255.
   2.5 14 | Forwarding from 127.0.0.1:44442 -> 8022
   2.7 TL | [19] Running: (['ssh', '-F', '/dev/null', '-q', '-oStrictHostKeyChecking=no', '-oUserKnownHostsFile=/dev/null', '-p', '44442', 'telepresence@localhost', '/bin/true'],)...
   2.7 14 | Handling connection for 44442
   3.3 TL | [19] ran.
   3.3 TL | [20] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'exec', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '--container', 'telepresence-1519200267-344957-21985', 'env'],)...
   3.6 TL | [20] captured.
   3.6 TL | [21] Running: (['sudo', 'sshfs', '-p', '44442', '-F', '/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'allow_other', 'telepresence@localhost:/', '/tmp/tmprltxyh54'],)...
   3.6 14 | Handling connection for 44442
   4.1 TL | [21] ran.
   4.1 TL | [22] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'exec', '--container=telepresence-1519200267-344957-21985', 'telepresence-1519200267-344957-21985-5cc844d6b4-z8lpn', '--', 'python3', '-c', '\nimport socket, sys, json\n\nresult = []\nfor host in sys.argv[1:]:\n    result.append(socket.gethostbyname(host))\nsys.stdout.write(json.dumps(result))\nsys.stdout.flush()\n'],)...
   5.2 TL | [22] captured.
   5.2 TL | [23] Capturing: (['kubectl', 'get', 'nodes', '-o', 'json'],)...
   5.3 TL | [23] captured.
   5.3 TL | [24] Capturing: (['kubectl', 'get', 'services', '-o', 'json'],)...
   5.4 TL | [24] captured.
   5.4 TL | [25] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200273-333588-21985', '--tcp=3000'],)...
   5.6 25 | service "telepresence-1519200273-333588-21985" created
   5.6 TL | [25] ran.
   5.6 TL | [26] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200273-4855056-21985', '--tcp=3000'],)...
   5.7 26 | service "telepresence-1519200273-4855056-21985" created
   5.7 TL | [26] ran.
   5.7 TL | [27] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200273-6562274-21985', '--tcp=3000'],)...
   5.9 27 | service "telepresence-1519200273-6562274-21985" created
   5.9 TL | [27] ran.
   5.9 TL | [28] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200273-8204556-21985', '--tcp=3000'],)...
   6.1 28 | service "telepresence-1519200273-8204556-21985" created
   6.1 TL | [28] ran.
   6.1 TL | [29] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200274-0112665-21985', '--tcp=3000'],)...
   6.3 29 | service "telepresence-1519200274-0112665-21985" created
   6.3 TL | [29] ran.
   6.3 TL | [30] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200274-1723058-21985', '--tcp=3000'],)...
   6.4 30 | service "telepresence-1519200274-1723058-21985" created
   6.4 TL | [30] ran.
   6.4 TL | [31] Running: (['kubectl', 'create', 'service', 'clusterip', 'telepresence-1519200274-365355-21985', '--tcp=3000'],)...
   6.6 31 | service "telepresence-1519200274-365355-21985" created
   6.6 TL | [31] ran.
   6.6 TL | [32] Capturing: (['kubectl', 'get', 'services', '-o', 'json'],)...
   6.7 TL | [32] captured.
   6.7 TL | [33] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200273-333588-21985'],)...
   6.9 33 | service "telepresence-1519200273-333588-21985" deleted
   6.9 TL | [33] ran.
   6.9 TL | [34] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200273-4855056-21985'],)...
   7.0 13 | Listening...
   7.0 13 | 2018-02-21T08:04:34+0000 [-] Loading ./forwarder.py...
   7.0 13 | 2018-02-21T08:04:34+0000 [-] SOCKSv5Factory starting on 9050
   7.0 13 | 2018-02-21T08:04:34+0000 [socks.SOCKSv5Factory#info] Starting factory <socks.SOCKSv5Factory object at 0x7f14dff082e8>
   7.0 13 | 2018-02-21T08:04:34+0000 [-] /etc/resolv.conf changed, reparsing
   7.0 13 | 2018-02-21T08:04:34+0000 [-] Resolver added ('100.64.0.10', 53) to server list
   7.0 13 | 2018-02-21T08:04:34+0000 [-] DNSDatagramProtocol starting on 9053
   7.0 13 | 2018-02-21T08:04:34+0000 [-] Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f14dfe995c0>
   7.0 13 | 2018-02-21T08:04:34+0000 [-] Loaded.
   7.0 13 | 2018-02-21T08:04:34+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.9.0 (/usr/bin/python3.6 3.6.1) starting up.
   7.0 13 | 2018-02-21T08:04:34+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
   7.1 34 | service "telepresence-1519200273-4855056-21985" deleted
   7.1 TL | [34] ran.
   7.1 TL | [35] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200273-6562274-21985'],)...
   7.3 35 | service "telepresence-1519200273-6562274-21985" deleted
   7.3 TL | [35] ran.
   7.3 TL | [36] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200273-8204556-21985'],)...
   7.5 36 | service "telepresence-1519200273-8204556-21985" deleted
   7.5 TL | [36] ran.
   7.5 TL | [37] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200274-0112665-21985'],)...
   7.7 37 | service "telepresence-1519200274-0112665-21985" deleted
   7.7 TL | [37] ran.
   7.7 TL | [38] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200274-1723058-21985'],)...
   7.8 38 | service "telepresence-1519200274-1723058-21985" deleted
   7.8 TL | [38] ran.
   7.8 TL | [39] Running: (['kubectl', 'delete', 'service', 'telepresence-1519200274-365355-21985'],)...
   8.0 39 | service "telepresence-1519200274-365355-21985" deleted
   8.0 TL | [39] ran.
   8.0 TL | [40] Launching: (['docker', 'run', '--rm', '--privileged', '--name=telepresence-1519200271-9928365-21985', 'datawire/telepresence-local:0.75', 'proxy', '{"expose_ports": [], "cidrs": ["100.96.2.0/24", "100.96.0.0/24", "100.64.0.0/13", "100.96.6.0/24", "100.96.8.0/24", "100.96.7.0/24", "100.96.3.0/24"], "port": 44442}'],)...
   8.0 TL | [41] Running: (['docker', 'run', '--network=container:telepresence-1519200271-9928365-21985', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)...
   8.3 40 | [INFO  tini (1)] Spawned child process 'python3' with pid '8'
   8.4 41 | [INFO  tini (1)] Spawned child process 'python3' with pid '7'
   8.4 40 |    0.0 TL | Telepresence launched at Wed Feb 21 08:04:36 2018
   8.4 40 |    0.0 TL |   ['/usr/bin/entrypoint.py', 'proxy', '{"expose_ports": [], "cidrs": ["100.96.2.0/24", "100.96.0.0/24", "100.64.0.0/13", "100.96.6.0/24", "100.96.8.0/24", "100.96.7.0/24", "100.96.3.0/24"], "port": 44442}']
   8.4 40 |    0.0 TL | Everything launched. Waiting to exit...
   8.6 40 | Starting sshuttle proxy.
   8.8 40 | firewall manager: Starting firewall with Python version 3.6.1
   8.9 40 | firewall manager: ready method name nat.
   8.9 40 | IPv6 enabled: False
   8.9 40 | UDP enabled: False
   8.9 40 | DNS enabled: True
   8.9 40 | TCP redirector listening on ('127.0.0.1', 12300).
   8.9 40 | DNS listening on ('127.0.0.1', 12300).
   8.9 40 | Starting client with Python version 3.6.1
   8.9 40 | c : connecting to server...
   8.9 40 | ssh: Could not resolve hostname ip-172-17-0-1.e: Name does not resolve
   8.9 40 | c : fatal: failed to establish ssh session (2)
   8.9 40 |    0.5 TL | Main process (['sshuttle-telepresence', '-v', '--dns', '--method', 'nat', '-e', 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null', '--to-ns', '127.0.0.1:9053', '-r', 'telepresence@ip-172-17-0-1.e:44442', '100.96.2.0/24', '100.96.0.0/24', '100.64.0.0/13', '100.96.6.0/24', '100.96.8.0/24', '100.96.7.0/24', '100.96.3.0/24']) exited with code 99.
   9.0 40 | [INFO  tini (1)] Main child exited normally (with status '99')
  19.3 41 | Failed to connect to proxy in remote cluster.
  19.3 41 | [INFO  tini (1)] Main child exited normally (with status '1')
  19.4 TL | [41] exit 1.
  28.3 TL | [42] Running: (['docker', 'stop', '--time=1', 'telepresence-1519200271-9928365-21985'],)...
  28.3 TL | [42] exit 1.
  28.3 TL | [43] Capturing: (['kubectl', '--context', 'kube.us-east-1.iris.tv', '--namespace', 'default', 'delete', '--ignore-not-found', 'all', '--selector=telepresence=1347227a-0441-4115-83dd-12aba08ba6b9'],)...
  28.4 42 | Error response from daemon: No such container: telepresence-1519200271-9928365-21985
  31.7 TL | [43] captured.

@blak3mill3r

This comment has been minimized.

Copy link

blak3mill3r commented Feb 21, 2018

Ah, 172.17.0.1 is the local inet addr of the docker0 virtual network interface...

@blak3mill3r

This comment has been minimized.

Copy link

blak3mill3r commented Feb 21, 2018

Okay, I've figured it out...

I'm not going to try to explain it in detail because I'm tired, but for others who run into this snag...

Just do this:

sudo echo 'supersede domain-name-servers 8.8.8.8, 8.8.4.4;' >> /etc/dhcp/dhclient.conf
sudo dhclient

That will override Amazon's default settings and cause this instance to use Google's public DNS servers instead. This prevents the "magic" I mentioned above (resolving internal names differently inside the VPC).

Then telepresence --docker-run -it debian:latest /bin/bash works perfectly and so does the thing I was really trying to do. 👍

@richarddli richarddli self-assigned this Feb 21, 2018

@richarddli

This comment has been minimized.

Copy link
Contributor

richarddli commented Feb 21, 2018

Thanks for tracking this down! I'm going to re-open this issue so that we can at least include it in the documentation somewhere.

@richarddli richarddli reopened this Feb 21, 2018

@plombardi89 plombardi89 added this to Features in Roadmap Feb 21, 2018

richarddli added a commit that referenced this issue Feb 21, 2018

richarddli added a commit that referenced this issue Feb 21, 2018

@richarddli

This comment has been minimized.

Copy link
Contributor

richarddli commented Feb 21, 2018

Added to the troubleshooting documentation

@richarddli richarddli closed this Feb 21, 2018

Roadmap automation moved this from Features to Completed Feb 21, 2018

@pkim-auro

This comment has been minimized.

Copy link

pkim-auro commented Mar 21, 2018

Hi, I am trying to run telepresence on an aws ec2 instance in a vpc and am having what appears to be the same symptoms - either hangs or errors out. However, when I tried the solution provided to add an entry into dhclient.conf, the kubenetes cluster name does not resolve on the following attempt (straight from the tutorial):

telepresence --docker-run -i -t alpine /bin/sh

Note, the kubernetes cluster is setup in a different vpc (with kops 1.9.4) and while I am not sure that is relevant, I am going to spin up a new vm to run teleprecense in the same vpc to test).

Any thoughts would be appreciated.

@ark3

This comment has been minimized.

Copy link
Contributor

ark3 commented Mar 21, 2018

You may be running into an issue we had with Kubernaut in the past (#320 and #391), where running kubectl in a Telepresence shell or container would not work.

The kubectl command needs to resolve the DNS name of the master. Under Telepresence, the master resolved to an internal IP address, but kubectl could not reach that internal address because Telepresence was not forwarding it to the cluster. The workaround was to either add an --also-proxy specifying that name or use an IP address instead of a name for the master info. A third possibility is to forward all traffic to the cluster, as happens with the inject-tcp method and probably should happen for container method as well.

@ark3

This comment has been minimized.

Copy link
Contributor

ark3 commented Mar 22, 2018

From @pkim-auro:

pkim@infra-build:~$ docker run --rm -it --entrypoint route datawire/telepresence-local:0.75
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         ip-172-17-0-1.e 0.0.0.0         UG    0      0        0 eth0
172.17.0.0      *               255.255.0.0     U     0      0        0 eth0

Let's try having Telepresence run route -n instead of route to see whether that avoids the truncated hostname issue and works without further configuration.

@ark3

This comment has been minimized.

Copy link
Contributor

ark3 commented Mar 26, 2018

Looks like release 0.76 fixes this for @pkim-auro. Closing.

@ark3 ark3 closed this Mar 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment