libzmq assertion at exit under Python 3 #173

jgallagher · 2012-02-16T13:44:34Z

The following code works fine (i.e., exits cleanly) under Python 2, but under Python 3.2 with pyzmq-2.1.11 prints:

stopping
start returned
Assertion failed: get_load () == 0 (poller_base.cpp:32)
Abort trap: 6

from zmq.eventloop.ioloop import IOLoop
from zmq.eventloop.zmqstream import ZMQStream
import zmq

def stop_loop(stream, loop):
    print("stopping")
    stream.close()
    loop.stop()

loop = IOLoop()
ctx = zmq.Context()
socket = ctx.socket(zmq.REQ)
socket.linger = 0
socket.connect('tcp://127.0.0.1:12345') # nothing listening
stream = ZMQStream(socket, loop)
stop_func = lambda x,y: stop_loop(stream, loop)
stream.on_send(stop_func)
loop.add_callback(lambda: stream.send(b"abc"))
loop.start()
print("start returned")

It's possible I'm doing something wrong that just happens to work under Python 2, but I thought this might be a bug in pyzmq. Could someone with a little more experience take a look?

The text was updated successfully, but these errors were encountered:

minrk · 2012-02-16T18:02:29Z

Interesting. I can confirm the issue with Python 3.2.2 on OSX 10.7.3, and pyzmq master. Asserts in libzmq generally mean bugs in libzmq itself, but it is certainly strange that pyzmq could hit it differently in Python 2 and 3.

Adding an explict ctx.term() at the end of the script seems to reliably prevent the assert, so maybe it is a thread cleanup issue at shutdown.

jgallagher · 2012-02-17T04:56:21Z

I tried to dig into this a little this evening without a lot of success. The assertion is being thrown because the poller still has one file descriptor that wasn't removed (i.e., add_fd was called one time more than rm_fd). On a successful run (with ctx.term()), the kqueue_t destructor joins what I believe is the context's IO thread, and before that join returns the IO thread removes the final fd. On a failed run, the join returns immediately as though the IO thread had died somehow without ever removing one of its fds.

Should I post this as an issue over on libzmq instead?

jgallagher · 2012-02-18T06:47:00Z

I think I tracked the problem down to this change in Python 3.2: http://hg.python.org/cpython/rev/c892b0321d23

I believe this is the approximate sequence of events:

A message is created at message.pyx:253 with the "free function" assigned to free_python_msg (I think on the IO thread)
Python interpreter goes into shutdown mode.
libzmq calls the free function associated with the message (i.e., free_python_msg).
Because of the Python change linked above, when free_python_msg tries to acquire the GIL, Python forcefully exits the thread.
The remaining cleanup libzmq needed to do on that thread never happens, eventually leading to the assertion.

I have no idea how to get started fixing this, but it definitely appears to be a Python/pyzmq issue and not a libzmq one. I suspect this "PyThread_exit_thread() if trying to acquire the GIL during shutdown" could cause all kinds of unclean shutdown issues with libzmq threads running.

which used copy=False by default. It doesn't make sense to use different defaults than Socket.send, and non-copying sends can cause problems on shutdown without manually terminating the context in Python 3. ref: zeromq#173

minrk · 2012-02-18T15:41:41Z

Thanks for tracking that down! Your analysis is correct that the free-function is called from the IO thread, and that's causing the problem. This is unavoidable[1] for non-copying sends, but fortunately it only affects non-copying sends, and for that matter, only affects non-copying sends that will be discarded at termination due to LINGER and failure to cleanup the context in your own code. Understanding that, here's a simpler repro script:

import zmq
ctx = zmq.Context.instance()
socket = ctx.socket(zmq.REQ)
socket.linger = 0
socket.connect('tcp://127.0.0.1:12345') # nothing listening
socket.send(b'abc', copy=False)

# explicit close&term prior to Python exit eliminates the issue
# ctx.destroy()

print("begin cleanup")
# if destroy was not called, this will crash in Python3

ZMQStream does non-copying sends by default for some reason, which is a bad choice, so I am going to revert that in PR #172. For now, the answer is going to be: you must use explicit cleanup if you are doing non-copying sends and might discard them at shutdown due to LINGER. In your sample script, that amounts to a simple ctx.term at the end to ensure clean shutdown.

I'm not sure this is a huge deal, because it only affects already exiting processes, but there have been other places where grabbing the GIL from the io_thread has caused weirdness.

[1] The only solution I have in my head is for the free_fn to actually be called from yet another thread that pyzmq creates, and the io_thread's callback somehow GIL-lessly puts the necessary info in a queue and notifies (or spawns) the thread, which grabs the GIL and deletes the object. That should avoid this problem, because it doesn't really matter that the free_fn is aborted at shutdown, it only matters that the io_thread is killed, which will not happen if it never grabs the GIL. But that means I would have to look up how to spawn and communicate with a C-thread in a platform-independent way, which isn't going to happen soon.

jgallagher · 2012-02-18T18:28:40Z

Thanks for the explanation. I'm surprised that the stream.close() in the original example doesn't take care of the problem. If stream.close() executes, why is there still a message hanging around to be cleaned up at ctx.term()?

minrk · 2012-02-18T20:34:12Z

That's because close is an asynchronous event. It starts the LINGER countdown, but does not ensure that the messages are cleaned up, because that happens in the io_threads. The message will be discarded very soon after the socket is closed, but not before close() returns. I find that a time.sleep(0.01) after close is enough to prevent the crash, but in the sample code, the application is exiting as fast as it can after socket close, which is too fast for the reaper.

minrk · 2013-09-16T19:09:41Z

closed by #408

minrk mentioned this issue Feb 18, 2012

PyZMQ API improvements #172

Merged

minrk closed this as completed Sep 16, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libzmq assertion at exit under Python 3 #173

libzmq assertion at exit under Python 3 #173

jgallagher commented Feb 16, 2012

minrk commented Feb 16, 2012

jgallagher commented Feb 17, 2012

jgallagher commented Feb 18, 2012

minrk commented Feb 18, 2012

jgallagher commented Feb 18, 2012

minrk commented Feb 18, 2012

minrk commented Sep 16, 2013

libzmq assertion at exit under Python 3 #173

libzmq assertion at exit under Python 3 #173

Comments

jgallagher commented Feb 16, 2012

minrk commented Feb 16, 2012

jgallagher commented Feb 17, 2012

jgallagher commented Feb 18, 2012

minrk commented Feb 18, 2012

jgallagher commented Feb 18, 2012

minrk commented Feb 18, 2012

minrk commented Sep 16, 2013