Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telepresence exits when attaching to process in Goland #1003

Closed
the-gigi opened this issue Apr 21, 2019 · 3 comments

Comments

3 participants
@the-gigi
Copy link
Contributor

commented Apr 21, 2019

I run a local minikube cluster and want to debug a Go microservice locally. I managed to get everything working just fine. telepresence created the proxy. I was able to hit the cluster and see my modified service getting called. so far so good.

$ telepresence --swap-deployment link-manager --run go run main.go
T: How Telepresence uses sudo: https://www.telepresence.io/reference/install#dependencies
T: Invoking sudo. Please enter your sudo password.
Password:
T: Starting proxy with method 'vpn-tcp', which has the following limitations: All processes are affected, only one telepresence can run per machine, and you can't use other VPNs. You may need to add cloud hosts and headless services with --also-proxy.
T: For a full list of method limitations see https://telepresence.io/reference/methods.html
T: Volumes are rooted at $TELEPRESENCE_ROOT. See https://telepresence.io/howto/volumes.html for details.
T: Starting network proxy to cluster by swapping out Deployment link-manager with a proxy
T: Forwarding remote port 8080 to local port 8080.

T: Guessing that Services IP range is 10.96.0.0/12. Services started after this point will be inaccessible if are outside this range; restart telepresence if you can't access a new Service.
T: Setup complete. Launching your command.
2019/04/20 01:17:06 DB host: 10.100.193.162 DB port: 5432
2019/04/20 01:17:06 Listening on port 8080...

but, then I tried to attach to the running process with Goland and that seems to kill telepresence.

T: Your process has exited.
T: Exit cleanup in progress
T: Swapping Deployment link-manager back to its original state

Here is the relevant output of ps aux before trying to attach to the process:

$ ps aux | rg -v rg | rg "go run main.go"
the-gigi      91302   0.0  0.1  4402496  17992 s004  S+    2:07AM   0:01.52 go run main.go
the-gigi      90795   0.0  0.2  4391716  34836 s004  S+    2:04AM   0:03.07 /usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python /usr/local/bin/telepresence --swap-deployment link-manager --expose 8080 --run go run main.go

After trying to attach to the local service (process 91302) the telepresence process (90795) is gone, but the local service is still running.

$ ps aux | rg -v rg | rg "go run main.go"
the-gigi      91302   0.0  0.1  4402500  16164 s004  SX    2:07AM   0:01.54 go run main.go

how can I make Goland attach to my local service launched by telepresence without killing telepresence?

@the-gigi the-gigi changed the title Telepresence exists when attaching to process in Goland Telepresence exits when attaching to process in Goland Apr 21, 2019

@LukeShu

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

wait(3p) returns not just when the process exists, but also when the process is stopped/paused (like by a debugger). It seems we aren't checking WIFEXITED/WIFSIGNALED, and we need to.

@ark3 ark3 added the bug label Apr 22, 2019

@ark3 ark3 added this to To do in Tel Tracker via automation Apr 22, 2019

@the-gigi

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2019

I investigated this a little bit. Here are my findings and some sample code. I'll open a PR later.

  1. telepresence monitors the status of the process using Popen.poll(). poll() returns None when the process is running, exit code when the process exits AND it also returns 0 when a debugger (at least the Goland/delve debugger) attaches to its process.

  2. The 0 result from poll() makes telepresence think the process under the debugger exited and it shuts down everything.

  3. I considered using psutil to check the real status of the process, but @ark3 said that it's best to avoid depending on psutil with its C code.

  4. Instead I used ps -o stat= -p {pid} to find the status.

  5. When the debugger attaches to a process the process will not terminate, but will remain in zombie mode even after it exits. I'm not sure why. But, the important thing is that it can be detected and treated just like a normal exit by telepresence.

Here is a little Go program (z.go) I used to test all that. It runs for 30 seconds and every 5 seconds prints its pid and ZZZ... to the screen. That gives enough time to attach the debugger

package main

import (
	"fmt"
	"os"
	"time"
)

func main() {
	for i := 0; i < 6; i++ {
		fmt.Println(os.Getpid(), "ZZZ...")
		time.Sleep(time.Second * 5)
	}
}

The following Python program launches the z Go program and then waits for it to finish OR become a zombie. The main loop is very similar to the loop in Runner.wait_for_exit() method calling poll() repeatedly until the return code is not None.

import os
import time
from subprocess import Popen
import subprocess


def wait_for_pid(pid):
    """Wait for process to disappear or to become zombie

    When the process really dies sending the 0 signal will result in OSError
    When the proces is still around, but its status is Zombie the ps command finds it

    In both of these cases the function returns, otherwise it keeps spinning
    """
    while True:
        try:
            os.kill(pid, 0)
            status = subprocess.check_output(f'ps -o stat= -p {pid}'.split()).strip()
            if status == b'Z':
                return
        except OSError:
            return
        time.sleep(1)


def main():
    p = subprocess.Popen('./z')
    code = None
    while code is None:
        code = p.poll()
        time.sleep(0.1)
    if code == 0:
        wait_for_pid(p.pid)

    print('Process exited with code', code)


if __name__ == '__main__':
    main()
@the-gigi

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2019

@ark3 noticed that the ps-based solution doesn't work on busybox (It could possibly work with some changes) and suggested to use a separate thread that does a blocking p.wait() and then sets the code and the quitting flag. I like this suggestion and it actually simplifies a lot because now the main loop always waits just for the quitting flag. The waiting code is simpler too (no need for those nasty os.kill(pid, 0)). Here is my sample code modified to use this approach

import os
import time
from subprocess import Popen
import subprocess
from threading import Thread

quitting = False


def main():
    def wait(p):
        global quitting
        nonlocal code
        code = p.wait()
        quitting = True

    code = None
    p = subprocess.Popen('./z')
    t = Thread(target=wait, args=(p,)).start()
    while not quitting:
        time.sleep(0.5)

    print('Process exited with code', code)


if __name__ == '__main__':
    main()

Updated PR coming soon

@ark3 ark3 closed this in #1005 May 31, 2019

Tel Tracker automation moved this from To do to Done May 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.