New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telepresence session dies due to idle connection #573

Closed
plombardi89 opened this Issue Apr 9, 2018 · 12 comments

Comments

Projects
None yet
2 participants
@plombardi89
Contributor

plombardi89 commented Apr 9, 2018

Telepresence seems to drop its network connection sometimes and it's unknown why this happens. We need to do some discovery to get a rough idea of what's causing this behavior and then figure out what to do from there.

The log file gist is available here: https://gist.github.com/plombardi89/05e2c1909745a314d48f1321b0b33df6

@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 9, 2018

Quick note: It looks like port forward exited.

2169.7 TL | A subprocess (['kubectl', '--context', 'telepresence-admin-context', '--namespace', 'tele-v7', 'port-forward', 'mydeployment-6d8bd68f99-ps6x8', '56007:8022']) died with code 0, killed all processes...

@plombardi89

This comment has been minimized.

Contributor

plombardi89 commented Apr 9, 2018

I've noticed port-forward failing before when talking to Ambassador's admin UI. It could be the network here...

@plombardi89

This comment has been minimized.

Contributor

plombardi89 commented Apr 10, 2018

  1. Setup a telepresence connection and see what happens if I do nothing
  2. Setup a telepresence connection and see what happens if I do dns queries that talk to the cluster
  3. Setup a telepresence connection and see what happens if I do not do dns queries that talk to the cluster.
  4. Setup a telepresence connection that pumps traffic through the connection.

See what happens over period of time to try and reproduce.

@plombardi89

This comment has been minimized.

Contributor

plombardi89 commented Apr 17, 2018

  1. It terminates right around the 2 minute mark.
  2. Skipped for #4 since it has to do a DNS query through the cluster anyways
  3. Skipped because not quite sure what this is testing...
  4. I telepresenced a shell with telepresence --run-shell and then ran watch curl httpbin.org/uuid and did not see a connection drop.

Seems like its a lack of traffic over the api server port forward connection.

@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 17, 2018

I thought we had some sort of keep-alive set on one of the ssh sessions. We should double-check that. Setting up a looping curl of the api server or whatever from within Telepresence is also something to try.

@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 18, 2018

Can you try your test 1 again, please? I'm curious about these three variants:

  1. telepresence (the original)
  2. telepresence --expose 5000
  3. telepresence --method inject-tcp

I predict that variant 1 will drop as before, but variants 2 and 3 will survive.

@plombardi89

This comment has been minimized.

Contributor

plombardi89 commented Apr 25, 2018

As requested

plombardi@philbox ~> telepresence --method vpn-tcp --run-shell
Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts with --also-proxy. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

Guessing that Services IP range is 172.31.192.0/20. Services started after this point will be inaccessible if are outside this range; restart telepresence if you can't access a new Service.

@telepresence-admin-context|bash-4.4$ Proxy to Kubernetes exited. This is typically due to a lost connection.
exit
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/remote.py", line 293, in cleanup
    sudo_prefix + ["fusermount", "-z", "-u", mount_dir]
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 239, in check_call
    track, "Running", "ran", out_cb, err_cb, args, **kwargs
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 230, in run_command
    raise CalledProcessError(retcode, args)
subprocess.CalledProcessError: Command '['fusermount', '-z', '-u', '/tmp/tmpoeev80hp']' returned non-zero exit status 1.
plombardi@philbox ~> telepresence --expose 5000
Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts with --also-proxy. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

Forwarding remote port 5000 to local port 5000.

@telepresence-admin-context|bash-4.4$ Proxy to Kubernetes exited. This is typically due to a lost connection.
exit
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/remote.py", line 293, in cleanup
    sudo_prefix + ["fusermount", "-z", "-u", mount_dir]
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 239, in check_call
    track, "Running", "ran", out_cb, err_cb, args, **kwargs
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 230, in run_command
    raise CalledProcessError(retcode, args)
subprocess.CalledProcessError: Command '['fusermount', '-z', '-u', '/tmp/tmp8_4ypfok']' returned non-zero exit status 1.
plombardi@philbox ~> telepresence -m inject-tcp
Starting proxy with method 'inject-tcp', which has the following limitations: Go programs, static binaries, suid programs, and custom DNS implementations are not supported. For a full list of method limitations see https://telepresence.io/reference/methods.html
Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.

No traffic is being forwarded from the remote Deployment to your local machine. You can use the --expose option to specify which ports you want to forward.

@telepresence-admin-context|bash-4.4$ Proxy to Kubernetes exited. This is typically due to a lost connection.
exit
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/remote.py", line 293, in cleanup
    sudo_prefix + ["fusermount", "-z", "-u", mount_dir]
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 239, in check_call
    track, "Running", "ran", out_cb, err_cb, args, **kwargs
  File "/usr/share/telepresence/libexec/lib64/python3.6/site-packages/telepresence/runner.py", line 230, in run_command
    raise CalledProcessError(retcode, args)
subprocess.CalledProcessError: Command '['fusermount', '-z', '-u', '/tmp/tmp5yssmpl2']' returned non-zero exit status 1.
@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 25, 2018

My prediction was incorrect. Looks like the ssh-based keep-alive stuff isn't going to do the job in some cases.

@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 27, 2018

Looks like we'll have to implement some sort of background traffic to keep the connection from going idle. I don't understand why ssh's keep-alive stuff is insufficient. In any case, this will have to be a method-specific implementation.

  • vpn-tcp: access the API server from within the Telepresence process
  • inject-tcp: spin up a torsocksed subprocess to access the API server
  • container: add a background process to the network container that accesses the API server

Not sure how we're going to test this...

@ark3 ark3 assigned ark3 and unassigned plombardi89 Apr 27, 2018

@ark3 ark3 added enhancement and removed exploration labels Apr 27, 2018

@ark3

This comment has been minimized.

Contributor

ark3 commented Apr 27, 2018

See also #355 and others.

@ark3 ark3 changed the title from Telepresence VPN drops connection randomly to Telepresence session dies due to idle connection May 7, 2018

@ark3

This comment has been minimized.

Contributor

ark3 commented May 7, 2018

I've modified the issue title to reflect the particulars of what I'm trying to fix here. Given the goal of avoiding various idle connection timeouts, my comment above suggests method-specific ways to have the user's machine periodically access the cluster.

Another approach would be to have the proxy pod periodically access a trivial service running within the Telepresence command itself. Besides creating the desired periodic traffic, this approach would allow the proxy to notice when the client has gone away, facilitating a portion of the work for #260. This would cost a randomly-chosen local port, which doesn't seem like a big deal.

@ark3

This comment has been minimized.

Contributor

ark3 commented May 14, 2018

Stuff like the above will tackle the issue of port-forward getting dropped due to an idle connection. However, because of #598, we must also make sure the kubectl logs connection does not go idle. That is more of an issue with the inject-tcp method, as the vpn-tcp method tends to spew DNS lookup noise into the log.

@ark3 ark3 closed this in #640 May 15, 2018

ark3 added a commit that referenced this issue May 15, 2018

Merge pull request #640 from datawire/avoid-idle
Avoid session dying due to idle connection
Fixes #573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment