New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when running telepresence --swap-deployment & --docker-run #464

Open
p8952 opened this Issue Feb 21, 2018 · 4 comments

Comments

2 participants
@p8952
Copy link
Contributor

p8952 commented Feb 21, 2018

What were you trying to do?

Swap a deployment running in Minkube with one running locally.

What did you expect to happen?

Telepresence would stop the running pod created by the deployment and replace it with one running locally.

What happened instead?

Telepresence crashed.

Automatically included information

Command line: ['/usr/bin/telepresence', '--swap-deployment', 'sample-service', '--docker-run', '--rm', '-it', '--volume', '/home/peter/Projects/drum-microservices/src/daemon/services/sample-service/../../shared/modules/:/usr/shared/modules/', '--volume', '/home/peter/Projects/drum-microservices/src/daemon/services/sample-service:/usr/src/drum-app/', 'node:8-alpine', 'sh', '-c', 'cd /usr/src/drum-app && sh']
Version: 0.75
Python version: 3.6.4 (default, Feb 1 2018, 11:06:09) [GCC 7.2.1 20170915 (Red Hat 7.2.1-2)]
kubectl version: Client Version: v1.9.3
oc version: (error: [Errno 2] No such file or directory: 'oc': 'oc')
OS: Linux localhost.localdomain 4.15.3-300.fc27.x86_64 #1 SMP Tue Feb 13 17:02:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Traceback:

Traceback (most recent call last):
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/cli.py", line 73, in call_f
    return f(*args, **kwargs)
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/main.py", line 489, in go
    ssh,
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/container.py", line 133, in run_docker_command
    TELEPRESENCE_LOCAL_IMAGE, "wait"
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 84, in check_call
    raise CalledProcessError(retcode, args)
subprocess.CalledProcessError: Command '(['docker', 'run', '--network=container:telepresence-1519227879-1176856-27548', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 1.

Logs:

hon version 3.6.1
  13.4 26 | firewall manager: ready method name nat.
  13.4 26 | IPv6 enabled: False
  13.4 26 | UDP enabled: False
  13.4 26 | DNS enabled: True
  13.4 26 | TCP redirector listening on ('127.0.0.1', 12300).
  13.4 26 | DNS listening on ('127.0.0.1', 12300).
  13.4 26 | Starting client with Python version 3.6.1
  13.4 26 | c : connecting to server...
  14.0 26 |    1.1 TL | A subprocess (['ssh', '-N', '-oServerAliveInterval=1', '-oServerAliveCountMax=10', '-F', '/dev/null', '-q', '-oStrictHostKeyChecking=no', '-oUserKnownHostsFile=/dev/null', '-p', '37249', 'telepresence@172.17.0.1', '-R', '*:4000:127.0.0.1:4000']) died with code 255, killed all processes...
  14.0 26 | Proxy to Kubernetes exited. This is typically due to a lost connection.
  14.1 26 | [INFO  tini (1)] Main child exited normally (with status '3')
  24.4 27 | Failed to connect to proxy in remote cluster.
  24.4 27 | [INFO  tini (1)] Main child exited normally (with status '1')
  24.6 TL | [27] exit 1.

@p8952

This comment has been minimized.

Copy link
Contributor

p8952 commented Feb 21, 2018

It seems starting the telepresence container locally fails with an exit code of 125 the first time it's invoked, 1 the second time causing the crash here, and 125 all subsequent times. The exit code of 125 appears to be because the sshuttle container is not yet running.

Command '(['docker', 'run', '--network=container:telepresence-1519229501-9802337-3360', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 125.
Command '(['docker', 'run', '--network=container:telepresence-1519229501-9802337-3360', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 1.
Command '(['docker', 'run', '--network=container:telepresence-1519229501-9802337-3360', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 125.
Command '(['docker', 'run', '--network=container:telepresence-1519229501-9802337-3360', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 125.
Command '(['docker', 'run', '--network=container:telepresence-1519229501-9802337-3360', '--rm', 'datawire/telepresence-local:0.75', 'wait'],)' returned non-zero exit status 125.
@p8952

This comment has been minimized.

Copy link
Contributor

p8952 commented Feb 21, 2018

Looks to be the same as #395, although a different root cause as I'm not running a VPN or anything unusual on the network side.

/ # ssh -vvv -N -oServerAliveInterval=1 -oServerAliveCountMax=10 -F /dev/null -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 36383 telepresence@172.17.0.1 -R '*:4000:127.0.0.1:4000'
OpenSSH_7.5p1-hpn14v4, LibreSSL 2.5.5
debug1: Reading configuration data /dev/null
debug2: resolving "172.17.0.1" port 36383
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to 172.17.0.1 [172.17.0.1] port 36383.
debug1: connect to address 172.17.0.1 port 36383: Host is unreachable
ssh: connect to host 172.17.0.1 port 36383: Host is unreachable

/ # route | fgrep default
default         172.17.0.1      0.0.0.0         UG    0      0        0 eth0

/ # nc -vvv 172.17.0.1 12345
nc: 172.17.0.1 (172.17.0.1:12345): Host is unreachable
sent 0, rcvd 0
@p8952

This comment has been minimized.

Copy link
Contributor

p8952 commented Feb 21, 2018

Looks like there was something in the default firewall rules applied in Fedora 27 which blocks this. If I disable firewalld I am able to connect:

/ # ssh -vvv -N -oServerAliveInterval=1 -oServerAliveCountMax=10 -F /dev/null -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 36383 telepresence@172.17.0.1 -R '*:4000:127.0.0.1:4000'
OpenSSH_7.5p1-hpn14v4, LibreSSL 2.5.5
debug1: Reading configuration data /dev/null
debug2: resolving "172.17.0.1" port 36383
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to 172.17.0.1 [172.17.0.1] port 36383.
debug1: connect to address 172.17.0.1 port 36383: Connection refused
ssh: connect to host 172.17.0.1 port 36383: Connection refused

@richarddli richarddli added this to Reliability in T Roadmap (v2) Feb 21, 2018

@ark3

This comment has been minimized.

Copy link
Contributor

ark3 commented Feb 21, 2018

The summary is that getting Connection refused or Host is unreachable from SSH usually indicates network configuration issues: conflict with a VPN or firewall or... We should document reasons and workarounds (e.g., https://www.telepresence.io/reference/limitations#ec2) and have Telepresence point to said documentation when it detects this sort of failure.

@rhs rhs added this to Error Feedback in Buckets Mar 8, 2018

@ark3 ark3 added this to To Do in Container UX via automation Apr 16, 2018

@ark3 ark3 added this to Container UX in Blobs Aug 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment