Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: HWM in ZMQ_DGRAM socket does not respect multipart message #3268

Closed
gabm opened this issue Oct 8, 2018 · 7 comments
Closed

Problem: HWM in ZMQ_DGRAM socket does not respect multipart message #3268

gabm opened this issue Oct 8, 2018 · 7 comments

Comments

@gabm
Copy link
Contributor

gabm commented Oct 8, 2018

Please use this template for reporting suspected bugs or requests for help.

Issue description

When sending a lot of messages over a ZMQ_DGRAM socket, I hit an assertion in: Assertion failed: check () (/home/gabm/zeromq/src/msg.cpp:347).

Environment

  • libzmq version (commit hash if unreleased):
  • OS:

Minimal test code / Steps to reproduce the issue

    auto ctx = zmq_ctx_new();
    auto sender = zmq_socket(ctx, ZMQ_DGRAM);

    int hwm = 3;
    zmq_setsockopt(sender, ZMQ_SNDHWM, &hwm, sizeof(hwm));
    zmq_bind(sender, "udp://127.0.0.1:5556");

    std::string remote = "127.0.0.1:5555";
    for (int i = 0; i < 100000; i++) {
        zmq_send(sender, remote.c_str(), remote.length(), ZMQ_SNDMORE);

        zmq_msg_t msg;
        zmq_msg_init_size(&msg, 8000);
        zmq_msg_send(&msg, sender, 0);
        zmq_msg_close(&msg);
    }

    zmq_close(sender);
    zmq_ctx_destroy(ctx);

What's the actual result? (include assertion message & call stack if applicable)

The callstack is:

1  raise                                                         0x7ffff7705d7f 
2  abort                                                         0x7ffff76f0672 
3  zmq::zmq_abort                            err.cpp         88  0x7ffff7d2170e 
4  zmq::msg_t::size                          msg.cpp         347 0x7ffff7d2ce6e 
5  zmq::udp_engine_t::out_event              udp_engine.cpp  380 0x7ffff7d80db6 
6  zmq::epoll_t::loop                        epoll.cpp       202 0x7ffff7d2085c 
7  zmq::worker_poller_base_t::worker_routine poller_base.cpp 139 0x7ffff7d3d50d 
8  thread_routine                            thread.cpp      182 0x7ffff7d6736e 
9  start_thread                                                  0x7ffff7bd3a9d 
10 clone                                                         0x7ffff77c9a43 

So it happens in udp_engine.cpp at line 380. The rc value of the second pull_msg call is -1.

What's the expected result?

that it works 😄

Cause

Digging a bit deeper I found out that it happens when hitting the HWM. I started my program a very small HWM and it happens immediatly. The DGRAM socket interface requires two messages to be sent: the destination address and the body message. I think that one of these two gets discarded when hitting the HWM, the other survives. Then when processing the messages in the udp_engine, we cannot assume to get "both messages", because one might be dropped due to the HWM.

@gabm
Copy link
Contributor Author

gabm commented Oct 8, 2018

I tried a naive fix like this:

    msg_t group_msg;
    int rc = _session->pull_msg(&group_msg);
    errno_assert(rc == 0 || (rc == -1 && errno == EAGAIN));
    
    // if poll failed or the message is not the _first_ part of a multi-part message
    // then return
    if (rc != 0 || !(group_msg.flags () & msg_t::more)) {
        reset_pollout(_handle);
        return;
    }
    
    msg_t body_msg;
    rc = _session->pull_msg(&body_msg);
    errno_assert(rc == 0 || (rc == -1 && errno == EAGAIN));
    
    // if poll failed or the message is not the _final_ part of a multi-part message
    // then return
    if (rc != 0 || (group_msg.flags () & msg_t::more)) {
        reset_pollout(_handle);
        return;
    }

But now everything seems to hang when I hit the HWM, so I'm still digging, not sure what happens ...

/EDIT: now it is blocking when we hit the HWM...

@bluca
Copy link
Member

bluca commented Oct 8, 2018

dgram socket does not and will not support multipart messages

@gabm
Copy link
Contributor Author

gabm commented Oct 8, 2018

dgram socket does not and will not support multipart messages

Yes, true... bit isn't the interface something like this?:

zmq_send(sender, remote.c_str(), remote.length(), ZMQ_SNDMORE);

zmq_msg_t msg;
zmq_msg_init_size(&msg, 8000);
zmq_msg_send(&msg, sender, 0);
zmq_msg_close(&msg);

found it here as well: https://github.com/zeromq/libzmq/blob/master/tests/test_dgram.cpp

@bluca
Copy link
Member

bluca commented Oct 8, 2018

ah that's the address, never mind then

@gabm
Copy link
Contributor Author

gabm commented Oct 8, 2018

aaah sorry, of course..

@gabm
Copy link
Contributor Author

gabm commented Oct 8, 2018

Aaah I see, I got confused because of a wrong return value...

When hitting the HWM, then the errno was EINVAL... The reason is because the flag that indicates the multi-part state is handle wrong in the dgram...

I'll provide a PR

bluca pushed a commit that referenced this issue Oct 8, 2018
…ven if the send fails (#3270)

* ZMQ_DGRAM: flip more flag after successful send

In the dgram socket we have a flag that indicates the next expected message type to ensure that always a pair of "address" + "body" messages gets sent. The first one MUST have the sendmore flag, the second MUST NOT.

In case the message does not get sent because of HWM full, then the function returns EAGAIN as it should. But unfortunately the next expected message type-flag gets flipped as well. When the socket_base::send function now tries to resend the message, it became the wrong message type... If you don't stop sending pairs of messages here (like me) then the next message that gets through will be of the wrong type, which in turn crashes the udp_engine function as described in #3268
@bluca
Copy link
Member

bluca commented Oct 8, 2018

Fixed by #3270
Thanks for the PR

@bluca bluca closed this as completed Oct 8, 2018
MohammadAlTurany pushed a commit to FairRootGroup/libzmq that referenced this issue Jan 28, 2019
…ven if the send fails (zeromq#3270)

* ZMQ_DGRAM: flip more flag after successful send

In the dgram socket we have a flag that indicates the next expected message type to ensure that always a pair of "address" + "body" messages gets sent. The first one MUST have the sendmore flag, the second MUST NOT.

In case the message does not get sent because of HWM full, then the function returns EAGAIN as it should. But unfortunately the next expected message type-flag gets flipped as well. When the socket_base::send function now tries to resend the message, it became the wrong message type... If you don't stop sending pairs of messages here (like me) then the next message that gets through will be of the wrong type, which in turn crashes the udp_engine function as described in zeromq#3268
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants