Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems forking a program with open zmq sockets #3223

Closed
jcaden opened this issue Aug 16, 2018 · 15 comments
Closed

Problems forking a program with open zmq sockets #3223

jcaden opened this issue Aug 16, 2018 · 15 comments

Comments

@jcaden
Copy link

jcaden commented Aug 16, 2018

Issue description

I have a program that creates a DEALER socket using c API and connects to a remote host. At some point, when a particular message is received, the process forks and the child tries to correctly close previous zmq socket and open a new one, creating a new context to handle this new connection.

The program just has one active thread (appart from zmq ones) and the fork is done from that thread. I know that zmq threads are not cloned to the child process only the forker thread is. That's the reason I'm creating a new context, to restart zmq threads on child process.

The code works most of the time. But from time to time I get an abort in the child process in a zmq thread. This is the stacktrace:

#5  0x00007f44715df8e8 in abort () from /lib64/libc.so.6
#6  0x00007f447438a769 in zmq::zmq_abort (errmsg_=<optimized out>) at src/err.cpp:84
#7  0x00007f447438a260 in zmq::epoll_t::set_pollout (this=<optimized out>, handle_=<optimized out>) at src/epoll.cpp:119
#8  0x00007f447438b199 in zmq::io_object_t::set_pollout (this=this@entry=0x7f4440000ec0, handle_=<optimized out>) at src/io_object.cpp:85
#9  0x00007f44743aaf3f in zmq::stream_engine_t::restart_output (this=0x7f4440000ec0) at src/stream_engine.cpp:411
#10 0x00007f44743a0f1e in zmq::session_base_t::read_activated (this=0x23ea100, pipe_=0x2431100) at src/session_base.cpp:264
#11 0x00007f447438b3ac in zmq::io_thread_t::in_event (this=0x2399cc0) at src/io_thread.cpp:83
#12 0x00007f447438a3ee in zmq::epoll_t::loop (this=0x2399f40) at src/epoll.cpp:176
#13 0x00007f44743b38a0 in thread_routine (arg_=0x2399fc0) at src/thread.cpp:96
#14 0x00007f447853ede5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f44716a130d in clone () from /lib64/libc.so.6

Sometimes the error is different while running the same code.

#5  0x00007f1bc3bdd8e8 in abort () from /lib64/libc.so.6
#6  0x00007f1bc6988769 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7f1bc3d28290 "Socket operation on non-socket") at src/err.cpp:84
#7  0x00007f1bc69b08e0 in zmq::tcp_connecter_t::connect (this=this@entry=0x7f1b880008c0) at src/tcp_connecter.cpp:337
#8  0x00007f1bc69b0a1c in zmq::tcp_connecter_t::out_event (this=0x7f1b880008c0) at src/tcp_connecter.cpp:126
#9  0x00007f1bc69883ca in zmq::epoll_t::loop (this=0x14303b0) at src/epoll.cpp:172
#10 0x00007f1bc69b18a0 in thread_routine (arg_=0x1430430) at src/thread.cpp:96
#11 0x00007f1bcab3cde5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f1bc3c9f30d in clone () from /lib64/libc.so.6

Environment

  • libzmq version 4.1.4
  • OS: Linux
@bluca
Copy link
Member

bluca commented Aug 16, 2018

Some things were fixed related to fork flags, check the latest release

@jcaden
Copy link
Author

jcaden commented Aug 17, 2018

I'll try latest release. Thank you.

@jcaden
Copy link
Author

jcaden commented Aug 17, 2018

I tested with latest release I see in github - 4.2.5

And it's still reproducible

#5  0x00007fca135668e8 in abort () from /lib64/libc.so.6
#6  0x00007fca16322779 in zmq::zmq_abort (errmsg_=<optimized out>) at src/err.cpp:87
#7  0x00007fca1632386a in zmq::unblock_socket (s_=<optimized out>) at src/ip.cpp:124
#8  0x00007fca163522a3 in zmq::tcp_connecter_t::open (this=this@entry=0x7fc9d80008c0) at src/tcp_connecter.cpp:306
#9  0x00007fca163529ee in zmq::tcp_connecter_t::start_connecting (this=0x7fc9d80008c0) at src/tcp_connecter.cpp:189
#10 0x00007fca1632d3f6 in zmq::object_t::process_command (this=0x7fc9d80008c0, cmd_=...) at src/object.cpp:87
#11 0x00007fca16323744 in zmq::io_thread_t::in_event (this=0x24d6680) at src/io_thread.cpp:88
#12 0x00007fca16321b1e in zmq::epoll_t::loop (this=0x24cb760) at src/epoll.cpp:197
#13 0x00007fca16353c2d in thread_routine (arg_=0x24cb7b8) at src/thread.cpp:181
#14 0x00007fca1a4efde5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fca1362830d in clone () from /lib64/libc.so.6
#4  0x00007fca135668e8 in abort () from /lib64/libc.so.6
#5  0x00007fca16322779 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7fca136ae0c7 "No such file or directory") at src/err.cpp:87
#6  0x00007fca16322542 in zmq::epoll_t::rm_fd (this=0x2496340, handle_=<optimized out>) at src/epoll.cpp:103
#7  0x00007fca16323279 in zmq::io_object_t::rm_fd (this=this@entry=0x7fc9d0000f30, handle_=<optimized out>) at src/io_object.cpp:70
#8  0x00007fca1634a5ba in zmq::stream_engine_t::unplug (this=this@entry=0x7fc9d0000f30) at src/stream_engine.cpp:272
#9  0x00007fca1634a735 in zmq::stream_engine_t::error (this=0x7fc9d0000f30, reason=zmq::stream_engine_t::timeout_error) at src/stream_engine.cpp:984
#10 0x00007fca163359e9 in zmq::poller_base_t::execute_timers (this=this@entry=0x2496340) at src/poller_base.cpp:98
#11 0x00007fca16321a38 in zmq::epoll_t::loop (this=0x2496340) at src/epoll.cpp:165
#12 0x00007fca16353c2d in thread_routine (arg_=0x2496398) at src/thread.cpp:181
#13 0x00007fca1a4efde5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fca1362830d in clone () from /lib64/libc.so.6

I also tried modifying zmq to not abort on this cases and I haven't found issues so far.

@bluca
Copy link
Member

bluca commented Aug 17, 2018

Can you provide a a working example to reproduce the issue?

@jcaden
Copy link
Author

jcaden commented Aug 17, 2018

I'll try to extract something simple that reproduces the issue.

@jcaden
Copy link
Author

jcaden commented Aug 22, 2018

I'm trying to provide a simple example that shows the issue but I'm not able to reproduce it even with the same versions than I'm using in the real case.

@bluca
Copy link
Member

bluca commented Aug 31, 2018

@jcaden any chance in coming up with a test case?

@jcaden
Copy link
Author

jcaden commented Sep 1, 2018

I have a program that is quite similar to the real case, but I can't reproduce the problem with it. This week I couldn't work on that but I'll try to get some time next week.

@jcaden
Copy link
Author

jcaden commented Sep 10, 2018

I can't reproduce with a simple test. I think I can come back later to this ticket (or a new one if you prefer) if I manage to create a test program to reproduce the issue. Thank you for your time.

@dit8
Copy link

dit8 commented Jul 30, 2019

I'm getting the same issue (src/epoll.cpp:119)
signal 6

@stale
Copy link

stale bot commented Jul 29, 2020

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2020
@stale stale bot closed this as completed Sep 26, 2020
@sainath40
Copy link

sainath40 commented Oct 12, 2022

I am hitting this issue in my environment that uses libzmq 4.2.5 and here is the BT:

Program terminated with signal 6, Aborted.
#0 0x00007f89a767f3d7 in raise () from /lib64/libc.so.6

Thread 1 (Thread 0x7f89a0a14700 (LWP 15)):
#0 0x00007f89a767f3d7 in raise () from /lib64/libc.so.6
#1 0x00007f89a7680ac8 in abort () from /lib64/libc.so.6
#2 0x00000000005ea519 in zmq::zmq_abort(char const*) ()
#3 0x000000000061762f in zmq::stream_engine_t::restart_output() ()
#4 0x000000000061529f in zmq::session_base_t::read_activated(zmq::pipe_t*) ()
---Type to continue, or q to quit---
#5 0x00000000005ea8ac in zmq::io_thread_t::in_event() ()
#6 0x00000000005e9d2e in zmq::epoll_t::loop() ()
#7 0x0000000000600a96 in thread_routine ()
#8 0x00007f89a8b5bea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f89a77479fd in clone () from /lib64/libc.so.6

@pthreadself
Copy link

pthreadself commented Oct 24, 2022

same here.

Thread 1 (Thread 0x7fc810fd4700 (LWP 30725)):
#0 0x00007fc818ad01d7 in raise () from /lib64/libc.so.6
#1 0x00007fc818ad18c8 in abort () from /lib64/libc.so.6
#2 0x00007fc8163a1569 in zmq::zmq_abort (errmsg_=) at src/err.cpp:88
#3 0x00007fc8163a10bf in zmq::epoll_t::set_pollout (this=, handle_=) at src/epoll.cpp:145
#4 0x00007fc8163a2079 in zmq::io_object_t::set_pollout (this=this@entry=0x396a000, handle_=) at src/io_object.cpp:85
#5 0x00007fc8163cf0a2 in set_pollout (this=0x396a000) at src/stream_engine_base.hpp:126
#6 zmq::stream_engine_base_t::restart_output (this=0x396a000) at src/stream_engine_base.cpp:404
#7 0x00007fc8163a253c in zmq::io_thread_t::in_event (this=0x2b7c0e0) at src/io_thread.cpp:91
#8 0x00007fc8163a0d4e in zmq::epoll_t::loop (this=0x2b7c000) at src/epoll.cpp:206
#9 0x00007fc8163d5c68 in thread_routine (arg_=0x2b7c058) at src/thread.cpp:257
#10 0x00007fc81e39edc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fc818b9273d in clone () from /lib64/libc.so.6

errno is 2 and errmsg is "no such file".
the pe->fd in the epoll_ctl is a Unix Domain Socket fd.

@pthreadself
Copy link

same here.

Thread 1 (Thread 0x7fc810fd4700 (LWP 30725)): #0 0x00007fc818ad01d7 in raise () from /lib64/libc.so.6 #1 0x00007fc818ad18c8 in abort () from /lib64/libc.so.6 #2 0x00007fc8163a1569 in zmq::zmq_abort (errmsg_=) at src/err.cpp:88 #3 0x00007fc8163a10bf in zmq::epoll_t::set_pollout (this=, handle_=) at src/epoll.cpp:145 #4 0x00007fc8163a2079 in zmq::io_object_t::set_pollout (this=this@entry=0x396a000, handle_=) at src/io_object.cpp:85 #5 0x00007fc8163cf0a2 in set_pollout (this=0x396a000) at src/stream_engine_base.hpp:126 #6 zmq::stream_engine_base_t::restart_output (this=0x396a000) at src/stream_engine_base.cpp:404 #7 0x00007fc8163a253c in zmq::io_thread_t::in_event (this=0x2b7c0e0) at src/io_thread.cpp:91 #8 0x00007fc8163a0d4e in zmq::epoll_t::loop (this=0x2b7c000) at src/epoll.cpp:206 #9 0x00007fc8163d5c68 in thread_routine (arg_=0x2b7c058) at src/thread.cpp:257 #10 0x00007fc81e39edc5 in start_thread () from /lib64/libpthread.so.0 #11 0x00007fc818b9273d in clone () from /lib64/libc.so.6

errno is 2 and errmsg is "no such file". the pe->fd in the epoll_ctl is a Unix Domain Socket fd.

After hours of debugging, I found that the reason is that this pe->fd was (wrongfully) closed by user code.
It's not libzmq's fault.
I suggest anyone experiencing similar problems use strace to trace your program's behaviour.
For example, if this pe->fd was closed (or manipulated in any other way) by your user code, then you have a problem.

@zhonxinya
Copy link

Hi pthreadself,
I have encountered a similar problem.Do you have any easy way to reproduce this issue?
I closed socket and destory context after client receive message from server and there is only one place to close socket.
What you said "I found that the reason is that this pe->fd was (wrongfully) closed by user code." is that you use the wrong api to close socket?
图片

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants