Tel does not always notice when critical processes die during startup #590

ark3 · 2018-04-12T19:38:09Z

Example trace:

   3.2 TEL | BEGIN SPAN main.py:127(connect)
   3.2 TEL | [15] Launching: kubectl --context ctx --namespace default logs -f gather-3569791658-ks07g --container gather
   3.2 TEL | CMD: kubectl --context ctx --namespace default logs -f gather-3569791658-ks07g --container gather
   3.2 TEL | [16] Launching: kubectl --context ctx --namespace default port-forward gather-3569791658-ks07g 63319:8022
   3.2 TEL | CMD: kubectl --context ctx --namespace default port-forward gather-3569791658-ks07g 63319:8022
   3.2 TEL | [17] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.2 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.3 TEL | [17] exit 255 in 0.03 secs.
   3.5 TEL | [18] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.5 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.5 TEL | [18] exit 255 in 0.03 secs.
   3.8  16 | error: error upgrading connection: error dialing backend: EOF
   3.8 TEL | [16] exit 1
   3.8 TEL | [19] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.8 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.8 TEL | [19] exit 255 in 0.03 secs.

Once the port forward has died, nothing else will work, yet we don't notice for some reason. Thus Tel hangs for a while, then yields a scary crash (RuntimeError: SSH isn't starting.).

This happens to be a silly intermittent failure; simply trying again works.

The text was updated successfully, but these errors were encountered:

ark3 · 2018-05-15T21:11:32Z

To clarify, we don't notice until we check, and we only check for dead processes once everything is up and running, in cleanup.py:wait_for_exit(...). Checking while things are coming up would allow us to avoid waiting for timeouts and fail much more quickly. Or even better, we could retry!

ark3 · 2018-05-22T17:06:07Z

This is a problem for any critical process, not just port-forward. For example, if sshuttle crashes during startup, Telepresence sits through a long timeout process instead of noticing immediately.

   6.4 TEL | [35] Launching: sshuttle-telepresence -v --dns --method auto -e 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null' --to-ns 127.0.0.1:9053 -r telepresence@localhost:50158 52.5.182.176/32 10.0.127.0/24 52.23.126.223 10.0.125.0/24 10.3.240.0/20 10.0.126.0/24 10.0.12.0/24 52.44.53.64/32
   6.4 TEL | BEGIN SPAN vpn.py:300(connect_sshuttle,sshuttle-wait)
   6.4 TEL | [36] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence0'"'"')'
   6.5  36 | Traceback (most recent call last):
   6.5  36 |   File "<string>", line 1, in <module>
   6.5  36 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
   6.5 TEL | [36] exit 1 in 0.07 secs.
   6.5  35 | Starting sshuttle proxy.
   6.6  35 | Traceback (most recent call last):
   6.6  35 |   File "/Users/distiller/project/virtualenv/bin/sshuttle-telepresence", line 4, in <module>
   6.6  35 |     import re
   6.6  35 |   File "/Users/distiller/project/virtualenv/lib/python3.6/re.py", line 142, in <module>
   6.6  35 |     class RegexFlag(enum.IntFlag):
   6.6  35 | AttributeError: module 'enum' has no attribute 'IntFlag'
   6.6  35 | fatal: ['sudo', '-p', '[local sudo] Password: ', 'PYTHONPATH=/Users/distiller/project/virtualenv/lib/python3.6/site-packages', '--', '/Users/distiller/project/virtualenv/bin/python3.6', '/Users/distiller/project/virtualenv/bin/sshuttle-telepresence', '-v', '--method', 'auto', '--firewall'] returned 1
   6.6 TEL | [35] exit 99

but Telepresence doesn't really notice, instead looping like this:

   6.6 TEL | [37] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence1'"'"')'
   6.7  37 | Traceback (most recent call last):
   6.7  37 |   File "<string>", line 1, in <module>
   6.7  37 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
   6.7 TEL | [37] exit 1 in 0.07 secs.

...

  26.5 TEL | [112] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence86'"'"')'
  26.6 112 | Traceback (most recent call last):
  26.6 112 |   File "<string>", line 1, in <module>
  26.6 112 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
  26.6 TEL | [112] exit 1 in 0.07 secs.

ark3 · 2018-08-01T18:26:03Z

One way to fix this would be to add a done callback to runner.launch(...) and use that flag other parts of the code. The stuff that currently happens in wait_for_exit(...) generically for all launched processes could happen in a more nuanced manner per subprocess.

ark3 · 2018-12-05T16:16:47Z

Fixed by #853; I botched the commit message.

ark3 added t:bug Something isn't working crash instead of error labels Apr 12, 2018

ark3 changed the title ~~Tel does not always notice when kubectl port-forward dies~~ Tel does not always notice when critical processes die during startup May 22, 2018

ark3 self-assigned this May 23, 2018

ark3 added this to To do in Tel Tracker via automation Jun 11, 2018

This was referenced Aug 17, 2018

cannot get telepresence --swap-deployment to work #734

Closed

Swapping CoreDNS deployment causes "RuntimeError: SSH isn't starting" error #740

Closed

ark3 closed this as completed Dec 5, 2018

Tel Tracker automation moved this from To do to Done Dec 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tel does not always notice when critical processes die during startup #590

Tel does not always notice when critical processes die during startup #590

ark3 commented Apr 12, 2018

ark3 commented May 15, 2018

ark3 commented May 22, 2018

ark3 commented Aug 1, 2018

ark3 commented Dec 5, 2018

Tel does not always notice when critical processes die during startup #590

Tel does not always notice when critical processes die during startup #590

Comments

ark3 commented Apr 12, 2018

ark3 commented May 15, 2018

ark3 commented May 22, 2018

ark3 commented Aug 1, 2018

ark3 commented Dec 5, 2018