-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zeromq can hang indefinitely in zmq::ctx_t::terminate() despite LINGER=0 #2586
Comments
@tcwalther LINGER=0 is not enough, you have to close all open sockets prior to terminating context. After interrupting all blocking calls, zmq_ctx_destroy() shall block until the following conditions are satisfied: Are you sure you have closed all sockets before trying to destroy context? |
@bjovke thanks for your reply. This would mean that I would have to further debug this in class Connection:
def __init__(self, max_n_sockets=50):
self.context = zmq.asyncio.Context()
self.context.setsockopt(zmq.LINGER, 0)
self.context.set(zmq.BLOCKY, False)
self.sockets = asyncio.Queue()
self.used_sockets = 0
self.max_n_sockets = max_n_sockets
def __del__(self):
self.context.destroy(linger=0)
# below here up to `max_n_sockets` sockets are created
# .... where I guess I'll have to continue trying to reproduce this in my Python debug build to see the full Python stack trace, and thus understand why I the socket has not been closed. As a last question: is there any way that I can configure ZeroMQ to avoid this behaviour? Having a program hang indefinitely is really not in my interest. |
@tcwalther Make sure that you're not calling |
Yes, thanks for the advice. It's one of the reasons I chose the Right now, I'm stuck in trying to recreate the race condition in an environment where I can backtrace the garbage collection. It would be great if ZeroMQ had a builtin way to avoid this deadlock from the start; it's easy to detect if a program has crashed, but hard to detect if a program is hanging. Note, also, that if I send a |
SIGTERM probably unblocks |
Thanks for your reply, @bjovke. I don't think that it's a |
Sorry, to ask the question more precisely: is it possible that ZeroMQ is indeed not waiting on a socket? There doesn't seem to be a thread that is currently inside a socket. If not, what is it waiting for? As another point of information: I just went through all 48 sockets in the context. In frame 4, thread 1 - see stack above - I printed Going through the output, I see that all have Here is the output:
|
ZeroMQ creates I/O thread for each created socket. From that point on thread is autonomous in receiving/sending messages. Try to find all threads when there's a infinite wait on context destroy. Beside main thread one thread is Reaper and other ones are I/O threads. I/O threads which are still existent are the ones blocking your program. |
Also I've noticed in your last output |
The Python library closes all sockets before calling term, see: https://github.com/zeromq/pyzmq/blob/master/zmq/backend/cython/context.pyx#L236 |
I know that, but you can try just for test to close them by yourself. |
By default there is only one io thread, not one per socket |
@bluca Yes, you're right, where has my brain gone to... |
Looking at the stack traces, I don't see any If I'm not mistaken, the issue seems related to #1279 - the backtrace looks identical, and the fact that it also happened in 1 out of 10 cases in said bug report equally corresponds to the behaviour I have here. I don't think I'm using |
the I/O thread uses epoll by default rather than select/poll (it's configurable) |
Ok, whatever it is, |
Update: all backtraces before were executed in the same docker container with the same process being debugged, but from this comment onward, I was unfortunately forced to start a new docker container with a new process. As such, there may be small differences between previous backtraces and information from here-on. Original comment below. Let's see if I can add more information.
Does that help in any way? I don't know which socket Given an |
@tcwalther If you're cloning your process and using same ZMQ sockets from different processes then this could be the cause of your troubles. |
@tcwalther On second thought |
Yes that's how a pthread backtrace will start with |
@bluca Ah, you're back after being scared that Github is dead and all data is lost. :) |
@tcwalther For the socket count, you said that you have 21 sockets, but in the trace for thread 2 you have |
@bjovke oh yes, indeed. I had to restart my machine, and thus lost my old docker container (with the state of 24 sockets) right before my last comment (I've updated the comment accordingly). As such, all previous backtraces were from the same process, but from the comment 2 hours ago onwards, I was forced to debug a new process in a new container. In the new process, |
@tcwalther But this still does not explain blocking wait of |
@tcwalther On second thought, I think there should be one extra FD created internally by ZMQ and added to polling set exactly for the purpose of unblocking the |
How would you recommend going forward then? Is there other information you'd like me to provide? |
@tcwalther Well, the "easiest" way is to debug ZMQ and just follow the code of |
@tcwalther Even then it will be hard to catch it because you said that it happens every 10th time. |
@tcwalther There's another way, to attach debugger to Python process, having the source code of the exact same version of ZMQ you're using and having that ZMQ compiled with debug symbols. In that way you can break into blocking wait and inspect values of all variables. |
@bjovke I am already doing that. All the backtraces and variable inspections pasted above come from this. |
@tcwalther Good then. I was under impression that you have very limited abilities to analyze the process.
It's almost impossible for everyone else to run your complete software and try to catch the issue every 10th execution. |
@tcwalther I have similar issue in C++ and I've described it here: zeromq/cppzmq#139. I see in your Python code that you call |
@ovanes interesting - pyzmq doesn't return a value for the In the end, I solved it by using a patched Python context object. This may not help you in your C++ code, but maybe it gives you some inspiration. Have a look at the solution here: zeromq/pyzmq#1003 (comment) |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
I've run into this issue again, but this time using regular contexts, not asyncio. I can reproduce this regression in PyZMQ 18.1, and it doesn't happen in 18.0.x. Again, it's very hard to isolate the bug into a small example script. Were there any significant changes in 18.1 that could lead to ZeroMQ not terminating on exit? |
I have an edge-case in a Python program where ZeroMQ can hang indefinitely. I have an extremely hard time isolating the bug; the most reliable way for me to reproduce it is to repeatedly run our entire test suite until that suite hangs (every 10th time on average). I've already posted a corresponding bug report at zeromq/pyzmq#1003, but I think I've found out a bit more now which hopefully justifies creating an issue here.
The basic problem is that I have created contexts and sockets all with
LINGER=0
andBLOCKY=False
, and still, sometimes it hangs forever inzmq::ctx_t_terminate()
. I am using zeromq 4.2.1. Looking at the GDB output, I can see that it hangs in https://github.com/zeromq/libzmq/blob/v4.2.1/src/ctx.cpp#L194. Here is the GDB output:And the first 30 frames of the stack trace:
Unfortunately, I have not yet been able reproduce this bug in a debug build of Python, hence the nondescriptive stacktrace in the Python part.
Looking at the
terminate()
function, I don't understand the rationale of lettingint rc = term_mailbox.recv (&cmd, -1)
wait forever. It is my understanding that I specifically setLINGER=0
to not have this behaviour. What am I missing?Update: I thought it might be useful to include thread information. There are 11 threads in total; the backtrace above is from thread 1. Threads 4-11 are from Tensorflow, and can probably be neglected. I don't know what threads 2 and 3 are good for. Given that
term_mailbox.recv
waits for the reaper thread to terminate all sockets (according to the comment above said line of code), I wonder which one that would be.The backtraces of threads 4 to 11 are identical:
Thread 2 and 3's backtraces are identical, too, but their
*this
variable in frame 1 differ significantly:The text was updated successfully, but these errors were encountered: