Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang or assert during zmq_ctx_term after zmq_socket_monitor was used #1279

Closed
oliora opened this issue Dec 18, 2014 · 20 comments
Closed

Hang or assert during zmq_ctx_term after zmq_socket_monitor was used #1279

oliora opened this issue Dec 18, 2014 · 20 comments

Comments

@oliora
Copy link

oliora commented Dec 18, 2014

If monitor is used and sockets are closed in specific order, ZMQ hangs or fails because of assertion during zmq_ctx_term. The bug appears rearely, like 1 in 10 executions. Please see more details in code comments.
Code demonstrating the problem:

#include <thread> // for sleep
#include <zmq.h>

static void close_zero_linger(void *socket)
{
    int linger = 0;
    zmq_setsockopt(socket, ZMQ_LINGER, &linger, sizeof(linger));
    zmq_close(socket);
}

static void bounce(void *client, void *server)
{
    char send_buf[] = "Hello";
    char recv_buf[sizeof(send_buf) / sizeof(send_buf[0])];

    zmq_send(client, send_buf, sizeof(send_buf), 0);
    zmq_recv(server, recv_buf, sizeof(recv_buf), 0);
    zmq_send(server, send_buf, sizeof(send_buf), 0);
    zmq_recv(client, recv_buf, sizeof(recv_buf), 0);
}

int main(void)
{
    void *ctx = zmq_ctx_new();

    void *client = zmq_socket(ctx, ZMQ_DEALER);
    void *server = zmq_socket(ctx, ZMQ_DEALER);

    zmq_socket_monitor(client, "inproc://monitor-client", ZMQ_EVENT_ALL);
    void *client_mon = zmq_socket(ctx, ZMQ_PAIR);
    zmq_socket_monitor(server, "inproc://server-client", ZMQ_EVENT_ALL);
    void *server_mon = zmq_socket(ctx, ZMQ_PAIR);

    zmq_connect(client_mon, "inproc://monitor-client");
    zmq_connect(server_mon, "inproc://server-client");

    zmq_bind(server, "tcp://127.0.0.1:9998");
    zmq_connect(client, "tcp://127.0.0.1:9998");
    bounce(client, server);

    // No Sleep or duration is small => no deadlock.
    std::this_thread::sleep_for(std::chrono::seconds(1));

    close_zero_linger(client_mon);  // Note the order of closing!
    close_zero_linger(client_mon);  //
    close_zero_linger(server);      // Only this one leads to deadlock/assert.
    close_zero_linger(client);      // Any different works perfect.

    zmq_ctx_term(ctx);              // Deadlock or assert!!!
}

Platform: Windows 7 64bit (real) and Windows Server 2008 R2 64bit (VM)
Compiler: MS Visual Studio 2013
ZMQ: 4.0.5, build from sources with the same compiler
Code build in Release-x86 configuration.
Assertion: mailbox.cpp line 82 (function zmq::mailbox_t::recv)
Call stack:

    KernelBase.dll!_RaiseException@16() Unknown
>   libzmq.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_, int timeout_) Line 82    C++
    libzmq.dll!zmq::socket_base_t::process_commands(int timeout_, bool throttle_) Line 915  C++
    libzmq.dll!zmq::socket_base_t::send(zmq::msg_t * msg_, int flags_) Line 736 C++
    libzmq.dll!zmq::socket_base_t::monitor_event(zmq_event_t event_, const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & addr_) Line 1261  C++
    libzmq.dll!zmq::tcp_listener_t::close() Line 125    C++
    libzmq.dll!zmq::tcp_listener_t::process_term(int linger_) Line 76   C++
    libzmq.dll!zmq::object_t::process_command(zmq::command_t & cmd_) Line 138   C++
    libzmq.dll!zmq::io_thread_t::in_event() Line 74 C++
    libzmq.dll!zmq::select_t::loop() Line 185   C++
    libzmq.dll!thread_routine(void * arg_) Line 35  C++
    [External Code] 

Can not reproduce the issue on OSX platform, looks like it's Windows specific.

UPD: information about assert and call stack added.

@xaqq
Copy link
Member

xaqq commented Dec 18, 2014

Hello,
I can only confirm it works fine on linux, both with zmq 4.0.6, 4.1 and 4.2 (current dev).

Can you try with the latest version ?

@oliora
Copy link
Author

oliora commented Dec 18, 2014

Same with ZMQ from master (4.2?).
It's interesting that app hangs only if run from cmd. If run from Explorer it works perfect.

@rodgert
Copy link
Contributor

rodgert commented Dec 18, 2014

This was discovered in a test case for https://github.com/zeromq/azmq

It works fine on Linux and OSX, but fails as Andrey describes on Windows.

On Wednesday, December 17, 2014, KAPP Arnaud notifications@github.com
wrote:

Hello,
I can only confirm it works fine on linux, both with zmq 4.0.6, 4.1 and
4.2 (current dev).

Can you try with the latest version ?


Reply to this email directly or view it on GitHub
#1279 (comment).

@oliora
Copy link
Author

oliora commented Dec 18, 2014

Looks like the problem is tightly tied to order of sockets closing. It must be either:

  1. close client monitor, close server, close client, or
  2. close server monitor, close server and client in any order
    Otherwise lib doesn't hang.

@oliora oliora changed the title Hang in zmq_ctx_term after zmq_socket_monitor was used Hang or assert during zmq_ctx_term after zmq_socket_monitor was used Jan 26, 2015
@AwesomeDolphin
Copy link

I'm able to use that test program to trigger a hang on OS X. The stack trace is:

(lldb) proc att --name zmqtest
btProcess 3636 stopped
Executable module set to "/Users/khenrick/zmqtest/./zmqtest".
Architecture set to: x86_64-apple-macosx.
(lldb) bt all
* thread #1: tid = 0x74bd4, 0x00007fff971e45c2 libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff971e45c2 libsystem_kernel.dylib`poll + 10
    frame #1: 0x0000000108c0e4f7 zmqtest`zmq::signaler_t::wait(this=0x00007f84a2c04e40, timeout_=-1) + 119 at signaler.cpp:222
    frame #2: 0x0000000108beba4e zmqtest`zmq::mailbox_t::recv(this=0x00007f84a2c04de0, cmd_=0x00007fff5703c368, timeout_=-1) + 126 at mailbox.cpp:72
    frame #3: 0x0000000108bc642b zmqtest`zmq::ctx_t::terminate(this=0x00007f84a2c04d30) + 3435 at ctx.cpp:152
    frame #4: 0x0000000108c505b1 zmqtest`zmq_ctx_term(ctx_=0x00007f84a2c04d30) + 81 at zmq.cpp:171
    frame #5: 0x0000000108c7123f zmqtest`main + 351 at zmqtest.cc:49
    frame #6: 0x00007fff8d6b05c9 libdyld.dylib`start + 1

  thread #2: tid = 0x74bd6, 0x00007fff971e421a libsystem_kernel.dylib`kevent + 10
    frame #0: 0x00007fff971e421a libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000108be93a5 zmqtest`zmq::kqueue_t::loop(this=0x00007f84a2c05280) + 293 at kqueue.cpp:170
    frame #2: 0x0000000108be9245 zmqtest`zmq::kqueue_t::worker_routine(arg_=0x00007f84a2c05280) + 21 at kqueue.cpp:211
    frame #3: 0x0000000108c35c4c zmqtest`thread_routine(arg_=0x00007f84a2c052d8) + 284 at thread.cpp:81
    frame #4: 0x00007fff95189268 libsystem_pthread.dylib`_pthread_body + 131
    frame #5: 0x00007fff951891e5 libsystem_pthread.dylib`_pthread_start + 176
    frame #6: 0x00007fff9518741d libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x74bd7, 0x00007fff971e421a libsystem_kernel.dylib`kevent + 10
    frame #0: 0x00007fff971e421a libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000108be93a5 zmqtest`zmq::kqueue_t::loop(this=0x00007f84a2c05590) + 293 at kqueue.cpp:170
    frame #2: 0x0000000108be9245 zmqtest`zmq::kqueue_t::worker_routine(arg_=0x00007f84a2c05590) + 21 at kqueue.cpp:211
    frame #3: 0x0000000108c35c4c zmqtest`thread_routine(arg_=0x00007f84a2c055e8) + 284 at thread.cpp:81
    frame #4: 0x00007fff95189268 libsystem_pthread.dylib`_pthread_body + 131
    frame #5: 0x00007fff951891e5 libsystem_pthread.dylib`_pthread_start + 176
    frame #6: 0x00007fff9518741d libsystem_pthread.dylib`thread_start + 13
(lldb) 

The OS X version I'm running on is:

ProductName:    Mac OS X
ProductVersion: 10.10.2
BuildVersion:   14C109

Using zmq 4.0.5

@hitstergtd
Copy link
Member

@somdoron - thoughts? I know it is quite old, but I have managed to reproduce this issue on OS X.

@hitstergtd
Copy link
Member

In my case, it is a reproducible hang, every time, not every 1 in 10 executions. Every execution.

@hitstergtd
Copy link
Member

After further investigation - it seems that if the second call to close_linger(client_mon); is changed to close_linger(server_mon), then the hang disappears. I am not sure why this is the case, even if ZMQ_BLOCKY is disabled with zmq_ctx_set(..). Furthermore, monitor sockets should be turned have their linger flag turned off.

@somdoron
Copy link
Member

somdoron commented May 1, 2016

I think the problem is that the internal monitor socket is access from multiple threads. the tcp_listener_t.close is accessing the monitor socket while from the background IO thread and the closing the monitor socket is happening from the user thread. This is internal libzmq problem which will not be easy to solve.

I suggest to stop the monitoring before disposing any socket with zmq_socket_monitor called with NULL at address. It might solve the problem but I'm not sure.

@somdoron
Copy link
Member

somdoron commented May 1, 2016

and some more thoughts, the monitored socket must be closed before the monitor socket. because when sending the events the operation is blocking this is way it is some time blocks (no socket on the other side).

Also we can stop the monitoring with calling zmq_socket_monitor with NULL as address, then close the monitor socket and then the socket (the order of the last two is not important if you stop the monitoring).

Last option is to fix the code to non blocking (just adding the flag), here:

https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L1732

@hitstergtd
Copy link
Member

@somdoron
I am able to close the monitor sockets before monitored sockets, reproducibly, without causing hangs.

As to your response on the internal sockets, that is an interesting thought and would require further investigation. I think it would be good to commit this test case into the repository, with the hope that we can fix it in the near future.

@hitstergtd
Copy link
Member

I will post a DTrace for this hang, hopefully it can shed some more light.

@hoditohod
Copy link
Contributor

I bumped into the same problem (assert or hang at context termination), but in a slightly different scenario: I have a single threaded C++ (zmq.hpp) message broker using monitor for diagnostic logs. In unit tests testClients use their own zmq context, and at test termination the destruction of the production code's context asserts or hangs.

Minimal code example that demonstrates the effect 8/10 times (very similar to monitor example):

#include <iostream>
#include <vector>
#include <chrono>
#include <zmq.hpp>
#include <zmq_addon.hpp>

using namespace std::chrono_literals;

struct MonitorEvent {
    uint16_t event;
    int32_t  value;
} __attribute__((__packed__));

int main()
{
    zmq::context_t ctx;
    zmq::context_t testctx;

    // sockets
    zmq::socket_t server(ctx, zmq::socket_type::dealer);
    zmq::socket_t client(testctx, zmq::socket_type::dealer);    //if testctx is replaced with ctx everything's fine

    // monitor
    zmq_socket_monitor (static_cast<void*>(server), "inproc://monitor-server", ZMQ_EVENT_ALL);
    zmq::socket_t server_mon(ctx, zmq::socket_type::pair);
    server_mon.connect("inproc://monitor-server");

    // connect and send/receive a message
    server.bind("tcp://*:9998");
    client.connect("tcp://127.0.0.1:9998");

    zmq::multipart_t msg1;
    msg1.addstr("hello");
    msg1.send(client);

    zmq::multipart_t msg2(server);  //recv
    std::cout << msg2.popstr() << std::endl;

    // read events until no more for 1 sec
    while(true)
    {
        std::vector<zmq::pollitem_t> pollset = { { static_cast<void*>(server_mon), 0, ZMQ_POLLIN, 0 } };

        if (zmq::poll(pollset, 1s) <= 0)
            break;

        if (pollset[0].revents & ZMQ_POLLIN)
        {
            zmq::multipart_t msg(server_mon);
            auto event = msg.poptyp<MonitorEvent>();
            auto name = msg.popstr();

            std::cout << "endpoint: " << name << ", event: " << event.event << ", value: " << event.value << std::endl;
        }
    }

    return 0;
}

Here's the assert I get:

Assertion failed: pfd.revents & POLLIN (src/signaler.cpp:243)
#0  0x4188a0 in ./repro (zmq::zmq_abort(char const*)+0x9)
#1  0x423199 in ./repro (zmq::signaler_t::wait(int)+0x1b7)
#2  0x4190d7 in ./repro (zmq::mailbox_t::recv(zmq::command_t*, int)+0x132)
#3  0x424f72 in ./repro (zmq::socket_base_t::process_commands(int, bool)+0x50)
#4  0x4254f0 in ./repro (zmq::socket_base_t::send(zmq::msg_t*, int)+0x2ad)
#5  0x40b1fd in ./repro (zmq_msg_send+0x3d)
#6  0x425d5d in ./repro (zmq::socket_base_t::monitor_event(int, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x61)
#7  0x42f601 in ./repro (zmq::tcp_listener_t::close()+0x3a)
#8  0x42fc5a in ./repro (zmq::tcp_listener_t::process_term(int)+0x23)
#9  0x418c13 in ./repro (zmq::io_thread_t::in_event()+0x63)
#10  0x418353 in ./repro (zmq::epoll_t::loop()+0xee)
#11  0x4300fe in ./repro (thread_routine+0x45)
#12  0x7f242e453675 in /lib/x86_64-linux-gnu/libpthread.so.0 (start_thread+0xca)
#13  0x7f242d4c9bf3 in /lib/x86_64-linux-gnu/libc.so.6 (clone+0x6d)
#14  0x6d in ? (+0x6d)

Here's the backtrace in case of a hang:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f560c008700 (LWP 73166)):
#0  0x00007f560cb33153 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000418347 in zmq::epoll_t::loop (this=0x18a8780)
    at src/epoll.cpp:159
#2  0x0000000000430055 in thread_routine (arg_=0x18a8800) at src/thread.cpp:100
#3  0x00007f560dabc6fa in start_thread (arg=0x7f560c008700)
    at pthread_create.c:333
#4  0x00007f560cb32b5d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7f560c809700 (LWP 73165)):
#0  0x00007f560cb33153 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000418347 in zmq::epoll_t::loop (this=0x18a7ff0)
    at src/epoll.cpp:159
#2  0x0000000000430055 in thread_routine (arg_=0x18a8070) at src/thread.cpp:100
#3  0x00007f560dabc6fa in start_thread (arg=0x7f560c809700)
    at pthread_create.c:333
#4  0x00007f560cb32b5d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

---Type <return> to continue, or q <return> to quit---
Thread 1 (Thread 0x7f560e130740 (LWP 73164)):
#0  0x00007f560cb26e8d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00000000004231da in poll (__timeout=-1, __nfds=1, __fds=0x7ffdee34eb10)
    at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  zmq::signaler_t::wait (this=this@entry=0x18a4d28, 
    timeout_=timeout_@entry=-1) at src/signaler.cpp:222
#3  0x0000000000419052 in zmq::mailbox_t::recv (this=this@entry=0x18a4cc0, 
    cmd_=cmd_@entry=0x7ffdee34ebc0, timeout_=timeout_@entry=-1)
    at src/mailbox.cpp:81
#4  0x000000000041494d in zmq::ctx_t::terminate (this=0x18a4c20)
    at src/ctx.cpp:192
#5  0x0000000000408ebc in zmq::context_t::~context_t (this=<optimized out>, 
    __in_chrg=<optimized out>)
    at include/zmq.hpp:413
#6  0x00000000004081cf in main ()
    at test/repro.cpp:17

If I modify the test program to use a single context (as the monitor example on the manpage), it runs without any issue.

Platform: Ubuntu 16.04 64bit
Compiler: gcc 5.3.1-14ubuntu2.1
ZMQ: git latest 2016-aug-12


I also checked the code in the first comment (it closes the client monitor twice!), and it hangs 10/10. If I modify the close order of the first 2 to: server_mon, client_mon; or client_mon, server_mon the hang disappears and the app properly exist after 1sec.

@hoditohod
Copy link
Contributor

As a workaround I'm using the following destruction order in the code:

  • delete monitored sockets (server in example)
  • sleep a bit (100ms)
  • delete monitoring socket (server_mon)

This seems to be stable: no assert or hang for a few hundred test iterations.

@hirenshah05
Copy link

Is this still an issue? I am seeing same issue when I use ZmqMonitor.

@bluca
Copy link
Member

bluca commented Mar 1, 2017

I think there was a fix recently, but I'm writing from the phone so can't look it up, are you building from the libzmq master branch?

@hoditohod
Copy link
Contributor

I re-ran the example code I previously attached, and the issue is not reproducible any more. The test program always exits correctly. The issue it triggered is fixed.

Env: Ubuntu 16.04 64bit, libzmq 4.2.2 stable release

@bluca
Copy link
Member

bluca commented Mar 1, 2017

Great, thanks for confirming, closing now

@bluca bluca closed this as completed Mar 1, 2017
@hirenshah05
Copy link

Thanks for prompt reply.

@bluca
Copy link
Member

bluca commented Mar 1, 2017

@hoditohod - if you have time, any chance you could please turn that repro code into a unit test? Would be very handy to avoid regressions in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants