-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-order message and slow performance when closing and reconnecting socket #3057
Comments
zmq_close is asynchronous. You cannot expect ordering between different connections - it just doesn't make sense. Use the same socket if you require ordering between messages, or implement reordering in your application using monotonically increasing sequence number. |
I think I'm ok with the async order (it's unexpected but I'll accept that). I think this is a question of the close() call. Can you tell me though what actually would happen when you call close()? Since the connect/send/close calls are all async, is it possible to have the close() happen before the connection is established so the send can be lost? |
Yes, depending on the linger value (EDIT: and of course on external contingencies like the network stack, kernel, thread scheduling, etc etc) - to test it, simply connect to a non-existing endpoint and close the socket. |
In general though, constantly creating and destroying sockets is an anti-pattern - it's configuration code paths, so performance will be terrible, and they are generally async operations. |
I don't usually disconnect/reconnect the socket like my test case. I was debugging why a message would get dropped upon one reconnect call. The sequence is really:
For some very difficult to reproduce case either send_multipart() message would get dropped occasionally. This is even if I set the linger for the socket to be 5000 (5 seconds), which I figured would be enough time to send before the socket is closed. The server should not be busy so I'd think it should at least made good attempt to deliver, so that seems odd to me. I only peeked a little into the code and it seems the zmq_close() would only set a flag and not wait for anything. Do you know if the flag is set what would happen to the send_multipart() calls even if you have linger set? Would it be possible since I'm reconnecting to the same socket/url the earlier/later message would be dropped? I ended up putting a sleep after the socket.close() and my code is now more reliable. I just want to understand more about what's going on since I do not like putting some random sleep everywhere where this can be an issue. |
Multiparts messages are atomic - either all parts are sent or none is. |
The problem is the server that I'm connecting to can be down, so setting a linger of -1 would be hanging the process forever. |
Then change your application protocol to send a receipt message back instead of relying on random sleeps |
Please use this template for reporting suspected bugs or requests for help.
Issue description
I have posted this on pyzmq but it may be an issue in the library itself (zeromq/pyzmq#1171). Essentially I have an application where it would do a connect, send a message, close. It then can potentially reconnect, send another message, and close the socket. I've experienced messages being dropped occasionally even with linger set to what I think is reasonable.
Environment
Minimal test code / Steps to reproduce the issue
What's the actual result? (include assertion message & call stack if applicable)
I found two issues with the results. First is that it can appear out of order:
According to @minrk, this can be due to close() being async (which doesn't really close the socket), which causes multiple sockets being connected at the same time. I think this behavior is somewhat confusing. Is there some what to actually enforce the close() and linger to happen in a synchronous way?
I'd have to emphasize that in my actual code there shouldn't be that many connections (with close() happening in parallel). It's typically only a few, but it's just very difficult to reproduce the problem so I had to write this simple case.
The second problem, which is more serious, is that the code can hang. It seems like the linger of 1000 (1 second) is not sufficient. Thus messages are dropped and the main thread (recv_multipart()) can not received the same number of message. If I adjust the linger to 5000 I can get through with no hanging, but the output sometimes can stutter for a few seconds, which I didn't expect to happen.
What's the expected result?
I'd expect a more predictable close() or an alternative to make sure the linger happen before close() is returning. I understand the linger can drop messages, but since the alternative can be hanging (when the other end point goes away), I found myself always trying to set the appropriate linger. It does seem though with the close() behavior being unpredictable it's hard for me to set the correct value.
The text was updated successfully, but these errors were encountered: