Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libzmq assertion at exit under Python 3 #173

Closed
jgallagher opened this issue Feb 16, 2012 · 7 comments
Closed

libzmq assertion at exit under Python 3 #173

jgallagher opened this issue Feb 16, 2012 · 7 comments

Comments

@jgallagher
Copy link
Contributor

The following code works fine (i.e., exits cleanly) under Python 2, but under Python 3.2 with pyzmq-2.1.11 prints:

stopping
start returned
Assertion failed: get_load () == 0 (poller_base.cpp:32)
Abort trap: 6
from zmq.eventloop.ioloop import IOLoop
from zmq.eventloop.zmqstream import ZMQStream
import zmq

def stop_loop(stream, loop):
    print("stopping")
    stream.close()
    loop.stop()

loop = IOLoop()
ctx = zmq.Context()
socket = ctx.socket(zmq.REQ)
socket.linger = 0
socket.connect('tcp://127.0.0.1:12345') # nothing listening
stream = ZMQStream(socket, loop)
stop_func = lambda x,y: stop_loop(stream, loop)
stream.on_send(stop_func)
loop.add_callback(lambda: stream.send(b"abc"))
loop.start()
print("start returned")

It's possible I'm doing something wrong that just happens to work under Python 2, but I thought this might be a bug in pyzmq. Could someone with a little more experience take a look?

@minrk
Copy link
Member

minrk commented Feb 16, 2012

Interesting. I can confirm the issue with Python 3.2.2 on OSX 10.7.3, and pyzmq master. Asserts in libzmq generally mean bugs in libzmq itself, but it is certainly strange that pyzmq could hit it differently in Python 2 and 3.

Adding an explict ctx.term() at the end of the script seems to reliably prevent the assert, so maybe it is a thread cleanup issue at shutdown.

@jgallagher
Copy link
Contributor Author

I tried to dig into this a little this evening without a lot of success. The assertion is being thrown because the poller still has one file descriptor that wasn't removed (i.e., add_fd was called one time more than rm_fd). On a successful run (with ctx.term()), the kqueue_t destructor joins what I believe is the context's IO thread, and before that join returns the IO thread removes the final fd. On a failed run, the join returns immediately as though the IO thread had died somehow without ever removing one of its fds.

Should I post this as an issue over on libzmq instead?

@jgallagher
Copy link
Contributor Author

I think I tracked the problem down to this change in Python 3.2: http://hg.python.org/cpython/rev/c892b0321d23

I believe this is the approximate sequence of events:

  1. A message is created at message.pyx:253 with the "free function" assigned to free_python_msg (I think on the IO thread)
  2. Python interpreter goes into shutdown mode.
  3. libzmq calls the free function associated with the message (i.e., free_python_msg).
  4. Because of the Python change linked above, when free_python_msg tries to acquire the GIL, Python forcefully exits the thread.
  5. The remaining cleanup libzmq needed to do on that thread never happens, eventually leading to the assertion.

I have no idea how to get started fixing this, but it definitely appears to be a Python/pyzmq issue and not a libzmq one. I suspect this "PyThread_exit_thread() if trying to acquire the GIL during shutdown" could cause all kinds of unclean shutdown issues with libzmq threads running.

minrk added a commit to minrk/pyzmq that referenced this issue Feb 18, 2012
which used copy=False by default.

It doesn't make sense to use different defaults than Socket.send,
and non-copying sends can cause problems on shutdown without
manually terminating the context in Python 3.

ref: zeromq#173
@minrk
Copy link
Member

minrk commented Feb 18, 2012

Thanks for tracking that down! Your analysis is correct that the free-function is called from the IO thread, and that's causing the problem. This is unavoidable[1] for non-copying sends, but fortunately it only affects non-copying sends, and for that matter, only affects non-copying sends that will be discarded at termination due to LINGER and failure to cleanup the context in your own code. Understanding that, here's a simpler repro script:

import zmq
ctx = zmq.Context.instance()
socket = ctx.socket(zmq.REQ)
socket.linger = 0
socket.connect('tcp://127.0.0.1:12345') # nothing listening
socket.send(b'abc', copy=False)

# explicit close&term prior to Python exit eliminates the issue
# ctx.destroy()

print("begin cleanup")
# if destroy was not called, this will crash in Python3

ZMQStream does non-copying sends by default for some reason, which is a bad choice, so I am going to revert that in PR #172. For now, the answer is going to be: you must use explicit cleanup if you are doing non-copying sends and might discard them at shutdown due to LINGER. In your sample script, that amounts to a simple ctx.term at the end to ensure clean shutdown.

I'm not sure this is a huge deal, because it only affects already exiting processes, but there have been other places where grabbing the GIL from the io_thread has caused weirdness.

[1] The only solution I have in my head is for the free_fn to actually be called from yet another thread that pyzmq creates, and the io_thread's callback somehow GIL-lessly puts the necessary info in a queue and notifies (or spawns) the thread, which grabs the GIL and deletes the object. That should avoid this problem, because it doesn't really matter that the free_fn is aborted at shutdown, it only matters that the io_thread is killed, which will not happen if it never grabs the GIL. But that means I would have to look up how to spawn and communicate with a C-thread in a platform-independent way, which isn't going to happen soon.

@jgallagher
Copy link
Contributor Author

Thanks for the explanation. I'm surprised that the stream.close() in the original example doesn't take care of the problem. If stream.close() executes, why is there still a message hanging around to be cleaned up at ctx.term()?

@minrk
Copy link
Member

minrk commented Feb 18, 2012

That's because close is an asynchronous event. It starts the LINGER countdown, but does not ensure that the messages are cleaned up, because that happens in the io_threads. The message will be discarded very soon after the socket is closed, but not before close() returns. I find that a time.sleep(0.01) after close is enough to prevent the crash, but in the sample code, the application is exiting as fast as it can after socket close, which is too fast for the reaper.

@minrk
Copy link
Member

minrk commented Sep 16, 2013

closed by #408

@minrk minrk closed this as completed Sep 16, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants