Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tel does not always notice when critical processes die during startup #590

Closed
ark3 opened this issue Apr 12, 2018 · 4 comments
Closed

Tel does not always notice when critical processes die during startup #590

ark3 opened this issue Apr 12, 2018 · 4 comments
Assignees
Labels
crash instead of error t:bug Something isn't working
Projects

Comments

@ark3
Copy link
Contributor

ark3 commented Apr 12, 2018

Example trace:

   3.2 TEL | BEGIN SPAN main.py:127(connect)
   3.2 TEL | [15] Launching: kubectl --context ctx --namespace default logs -f gather-3569791658-ks07g --container gather
   3.2 TEL | CMD: kubectl --context ctx --namespace default logs -f gather-3569791658-ks07g --container gather
   3.2 TEL | [16] Launching: kubectl --context ctx --namespace default port-forward gather-3569791658-ks07g 63319:8022
   3.2 TEL | CMD: kubectl --context ctx --namespace default port-forward gather-3569791658-ks07g 63319:8022
   3.2 TEL | [17] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.2 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.3 TEL | [17] exit 255 in 0.03 secs.
   3.5 TEL | [18] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.5 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.5 TEL | [18] exit 255 in 0.03 secs.
   3.8  16 | error: error upgrading connection: error dialing backend: EOF
   3.8 TEL | [16] exit 1
   3.8 TEL | [19] Running: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.8 TEL | CMD: ssh -F /dev/null -q -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -p 63319 telepresence@localhost /bin/true
   3.8 TEL | [19] exit 255 in 0.03 secs.

Once the port forward has died, nothing else will work, yet we don't notice for some reason. Thus Tel hangs for a while, then yields a scary crash (RuntimeError: SSH isn't starting.).

This happens to be a silly intermittent failure; simply trying again works.

@ark3 ark3 added t:bug Something isn't working crash instead of error labels Apr 12, 2018
@ark3
Copy link
Contributor Author

ark3 commented May 15, 2018

To clarify, we don't notice until we check, and we only check for dead processes once everything is up and running, in cleanup.py:wait_for_exit(...). Checking while things are coming up would allow us to avoid waiting for timeouts and fail much more quickly. Or even better, we could retry!

@ark3 ark3 changed the title Tel does not always notice when kubectl port-forward dies Tel does not always notice when critical processes die during startup May 22, 2018
@ark3
Copy link
Contributor Author

ark3 commented May 22, 2018

This is a problem for any critical process, not just port-forward. For example, if sshuttle crashes during startup, Telepresence sits through a long timeout process instead of noticing immediately.

   6.4 TEL | [35] Launching: sshuttle-telepresence -v --dns --method auto -e 'ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -F /dev/null' --to-ns 127.0.0.1:9053 -r telepresence@localhost:50158 52.5.182.176/32 10.0.127.0/24 52.23.126.223 10.0.125.0/24 10.3.240.0/20 10.0.126.0/24 10.0.12.0/24 52.44.53.64/32
   6.4 TEL | BEGIN SPAN vpn.py:300(connect_sshuttle,sshuttle-wait)
   6.4 TEL | [36] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence0'"'"')'
   6.5  36 | Traceback (most recent call last):
   6.5  36 |   File "<string>", line 1, in <module>
   6.5  36 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
   6.5 TEL | [36] exit 1 in 0.07 secs.
   6.5  35 | Starting sshuttle proxy.
   6.6  35 | Traceback (most recent call last):
   6.6  35 |   File "/Users/distiller/project/virtualenv/bin/sshuttle-telepresence", line 4, in <module>
   6.6  35 |     import re
   6.6  35 |   File "/Users/distiller/project/virtualenv/lib/python3.6/re.py", line 142, in <module>
   6.6  35 |     class RegexFlag(enum.IntFlag):
   6.6  35 | AttributeError: module 'enum' has no attribute 'IntFlag'
   6.6  35 | fatal: ['sudo', '-p', '[local sudo] Password: ', 'PYTHONPATH=/Users/distiller/project/virtualenv/lib/python3.6/site-packages', '--', '/Users/distiller/project/virtualenv/bin/python3.6', '/Users/distiller/project/virtualenv/bin/sshuttle-telepresence', '-v', '--method', 'auto', '--firewall'] returned 1
   6.6 TEL | [35] exit 99

but Telepresence doesn't really notice, instead looping like this:

   6.6 TEL | [37] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence1'"'"')'
   6.7  37 | Traceback (most recent call last):
   6.7  37 |   File "<string>", line 1, in <module>
   6.7  37 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
   6.7 TEL | [37] exit 1 in 0.07 secs.

...

  26.5 TEL | [112] Capturing: python3 -c 'import socket; socket.gethostbyname('"'"'hellotelepresence86'"'"')'
  26.6 112 | Traceback (most recent call last):
  26.6 112 |   File "<string>", line 1, in <module>
  26.6 112 | socket.gaierror: [Errno 8] nodename nor servname provided, or not known
  26.6 TEL | [112] exit 1 in 0.07 secs.

@ark3 ark3 self-assigned this May 23, 2018
@ark3 ark3 added this to To do in Tel Tracker via automation Jun 11, 2018
@ark3
Copy link
Contributor Author

ark3 commented Aug 1, 2018

One way to fix this would be to add a done callback to runner.launch(...) and use that flag other parts of the code. The stuff that currently happens in wait_for_exit(...) generically for all launched processes could happen in a more nuanced manner per subprocess.

@ark3
Copy link
Contributor Author

ark3 commented Dec 5, 2018

Fixed by #853; I botched the commit message.

@ark3 ark3 closed this as completed Dec 5, 2018
Tel Tracker automation moved this from To do to Done Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash instead of error t:bug Something isn't working
Projects
Development

No branches or pull requests

1 participant