1.1.0 causes multiprocessing workers to hang on exit #255

reticulatedpines · 2023-04-11T03:58:56Z

Possible regression from 1.0.0, it seems that 1.1.0 doesn't let multiprocessing workers exit (even though they join). This means the main script also refuses to exit. Only tested on Linux (Debian Bullseye, and Fedora 37, identical behaviour).

I don't have much experience with vncdotool, multiprocessing or twisted, so I could easily be missing something obvious here. I do have a simple repro script though so it's hopefully easy to test. I tested in a venv with only these listed by freeze:

attrs==22.2.0
Automat==22.10.0
constantly==15.1.0
hyperlink==21.0.0
idna==3.4
incremental==22.10.0
Pillow==9.5.0
pycryptodomex==3.17
six==1.16.0
Twisted==22.10.0
typing-extensions==4.5.0
vncdotool==1.1.0
zope.interface==6.0

I didn't control for system packages, please let me know if that's relevant.

vncdotool version
1.1.0

VNC server and version
None required.

Steps to reproduce
Make a python3 venv, install vncdotool 1.1.0. Run the following repro script, it will fail to exit. Or fail to repro, let's find out :)
Downgrade to 1.0.0 and it should exit.

#!/usr/bin/env python3

import sys
import multiprocessing

import twisted
import vncdotool
from vncdotool import api

def test_worker():
    return

def main():
    print("vncdotool version: %s" % vncdotool.__version__)
    print("twisted version: %s" % twisted.__version__)

    with vncdotool.api.connect(":1234") as v:
        # no server required, ConnectionRefused still triggers the hang
        pass

    worker = multiprocessing.Process(target=test_worker)
    worker.start()
    worker.join()
    print("please can we exit now?")
    exit(0)
    # this hangs for me, with vncdotool==1.1.0, twisted==22.10.0
    # hangs with 1.1.0, 21.2.0
    # exits okay with 1.0.0 + 22.10.0
    # exits okay with 1.0.0 + 21.2.0

if __name__ == "__main__":
    # setting spawn start method here doesn't prevent the problem
    multiprocessing.set_start_method("spawn")

    main()

Expected result
Above repro script should exit.

Which erroneous result did you get instead
Script hangs just after final print statement.

Additional information
Possibly related to #188
But that lists using spawn start method as a workaround, and that doesn't help me.

The text was updated successfully, but these errors were encountered:

pmhahn · 2023-04-11T07:38:43Z

Using vncdotool.api creates a background-daemon-thread, which runs the Twisted Reactor: If not explicitly terminated by calling api.shutdown() it will keep running. While it is marked as a daemon thread, which should not prevent Python from terminating, but mixing threading with multi-processing leads to all kinds of obscure issues.

gdb -p 117319 --batch -ex 'thread apply all py-bt'

Thread 3 (Thread 0x7f3fa3fff700 (LWP 117322) "python3"):
Traceback (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
  File "/usr/lib/python3.9/queue.py", line 171, in get
    self.not_empty.wait()
  File "vncdotool/venv/lib/python3.9/site-packages/twisted/_threads/_threadworker.py", line 46, in work
    for task in iter(queue.get, _stop):
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 912, in _bootstrap
    self._bootstrap_inner()

Thread 2 (Thread 0x7f3fa8b2d700 (LWP 117321) "python3"):
Traceback (most recent call first):
  File "vncdotool/venv/lib/python3.9/site-packages/twisted/internet/epollreactor.py", line 227, in doPoll
    l = self._poller.poll(timeout, len(self._selectables))
  File "vncdotool/venv/lib/python3.9/site-packages/twisted/internet/base.py", line 1331, in mainLoop
    reactorBaseSelf.doIteration(t)
  File "vncdotool/venv/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 912, in _bootstrap
    self._bootstrap_inner()

Thread 1 (Thread 0x7f3faab4c740 (LWP 117319) "python3"):
Traceback (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 1428, in _shutdown
    lock.acquire()

Looking at the threads and their .name attribute:

is the main thread, which is calling exit()
is Twisted, the main reactor by vncdotool.api.connect()
is PoolThread-twisted.internet.reactor-0 from twisted/python/threadpool.py; According to the code it inherits daemon from its parent thread 2.

PS: As stated above calling api.shutdown() before sys.exit(0) mitigates the issue, but we should find a real fix.

reticulatedpines · 2023-04-11T15:31:29Z

Ah, great, thanks for the info and quick workaround! I didn't know where to even start debugging twisted + multiprocessing. I can confirm that shutdown() lets it exit cleanly for me here, too. I thought it was weird that join() still succeeded, makes sense now.

It's curious that 1.0.0 doesn't exhibit this behaviour, I've run 100s of tests and it's been consistent, so I'd guess not a race or similar. It's beyond my ability to diagnose, so I'll leave that up to you.

Possibly docstring for api.connect() could mention shutdown() as being wise to call before attempting to exit? I don't know if that's an appropriate place - I would have seen it, but who knows if I'm typical.

Maybe the api object could be made into a context manager too, then there's an obvious place to automatically call shutdown(), though you'd have to nest two contexts just to get a client which is kind of ugly. Since nobody is using api as a context manager currently it shouldn't break existing code.

    with vncdotool.api as a:
        with a.connect(":1234") as client:
            client.stuff()
        # client teardown happens here
    # api teardown here

I don't love it, but I guess it's not terrible looking.

Less likely to be appropriate, shutdown() could implicitly occur on leaving the context manager block for the connect()? Perhaps that's too surprising a side effect, since it's the client that leaves scope, not the api. More likely to break existing code.

Possibly, use atexit: https://docs.python.org/3/library/atexit.html
I have no experience using this, it reads like conceptually the right thing to do. Register the handler to call shutdown() whenever the main reactor is started. It's stdlib which is nice.

reticulatedpines · 2023-04-11T17:19:26Z

Sadly for me, vncdotool.api.shutdown() doesn't help me in the larger code I was working on. Perhaps I found two different hang bugs. I'll try and work up another clean repro case.

reticulatedpines · 2023-04-14T12:48:58Z

Found my mistake: in the more complex code, I had multiple processes each handling a VNC connection. I'd missed adding api.shutdown() to one exit route. Your workaround is still effective.

pmhahn added the bug label Apr 24, 2023

pevogam mentioned this issue Aug 23, 2024

Programs using api.connect hang unless api.shutdown is explicitly called before exit #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.1.0 causes multiprocessing workers to hang on exit #255

1.1.0 causes multiprocessing workers to hang on exit #255

reticulatedpines commented Apr 11, 2023

pmhahn commented Apr 11, 2023

reticulatedpines commented Apr 11, 2023

reticulatedpines commented Apr 11, 2023

reticulatedpines commented Apr 14, 2023

1.1.0 causes multiprocessing workers to hang on exit #255

1.1.0 causes multiprocessing workers to hang on exit #255

Comments

reticulatedpines commented Apr 11, 2023

pmhahn commented Apr 11, 2023

reticulatedpines commented Apr 11, 2023

reticulatedpines commented Apr 11, 2023

reticulatedpines commented Apr 14, 2023