Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

Closed
ItsNayabSD opened this issue Dec 13, 2017 · 15 comments
Closed

Comments

@ItsNayabSD
Copy link

Description

I have a dish socket bind to a multicast address. And dish socket receiving messages in a while loop.
It works fine when there is an Ethernet interface is up.
I did sudo ifconfig enp7s0 0.0.0.0 and observed the following:

No such device (src/udp_engine.cpp:142)
Aborted (core dumped)

I am using this dish socket in an application in which the ip address of the interface often becomes 0.0.0.0.
Is there any way I could exit the while loop safely without core dumping the entire application?

Environment

  • Version: Zeromq 4.2.1
  • OS: Ubuntu 16.04
@bluca
Copy link
Member

bluca commented Dec 13, 2017

I guess the UDP engine needs the same hardening the TCP one got a few months ago, against network errors. That's a setsockopt failing. PRs welcome.

@ItsNayabSD
Copy link
Author

ItsNayabSD commented Dec 14, 2017

I am not much into C++. :(
And I see that core dump is happening even for the interface which has static IP but not connected to the network.

Any condition I could check and break the loop?

@ItsNayabSD
Copy link
Author

ItsNayabSD commented Dec 19, 2017

Hi,

I had to add

route add -net 224.0.0.0 netmask 240.0.0.0 eth0

so that radio socket finds a way for multicast traffic. Now core dump is not happening. :)

Thanks..

@bluca
Copy link
Member

bluca commented Dec 19, 2017

Great, happy you found a workaround and thanks for sharing it.

I'll reopen and retitle the issue, as the UDP implementation should be hardened anyway before it can be declared stable.

@bluca bluca reopened this Dec 19, 2017
@bluca bluca changed the title [Help] zeromq: Aborted (device not found). How to return safely when interface is down Problem: UDP engine aborts on networking-related errors from socket syscalls Dec 19, 2017
@ItsNayabSD
Copy link
Author

I am able to reproduce the crash with v4.2.3 also. But the scenario is different.
This time GDB prints:

(gdb) bt
#0  0xb6bf8424 in __GI_raise (sig=sig@entry=6) at libpthread/nptl/sysdeps/unix/sysv/linux/raise.c:67
#1  0xb6bf27f0 in __GI_abort () at libc/stdlib/abort.c:89
#2  0xb6ed0e14 in zmq::zmq_abort (errmsg_=errmsg_@entry=0xb6c07210 <mylock> "") at src/err.cpp:87
#3  0xb6f07744 in zmq::udp_engine_t::out_event (this=<optimized out>) at src/udp_engine.cpp:285
#4  0xb6f06ca4 in zmq::udp_engine_t::restart_output (this=0x2061d0) at src/udp_engine.cpp:307
#5  0xb6eeea08 in zmq::session_base_t::read_activated (this=0x1fddd8, pipe_=0xb6edd454 <zmq::object_t::process_command(zmq::command_t&)+220>) at src/session_base.cpp:288
#6  0xb6ed1ea4 in zmq::io_thread_t::in_event (this=0x1fb2e8) at src/io_thread.cpp:85
#7  0xb6ed05e8 in zmq::epoll_t::loop (this=0x1fb808) at src/epoll.cpp:188
#8  0xb6f049a8 in thread_routine (arg_=0x1fb854) at src/thread.cpp:109
#9  0xb6f33b04 in start_thread (arg=0xb66f1520) at libpthread/nptl/pthread_create.c:297
#10 0xb6bf7b44 in clone () at libpthread/nptl/sysdeps/unix/sysv/linux/arm/../../../../../../../libc/sysdeps/linux/arm/clone.S:126
#11 0xb6bf7b44 in clone () at libpthread/nptl/sysdeps/unix/sysv/linux/arm/../../../../../../../libc/sysdeps/linux/arm/clone.S:126
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up 3
#3  0xb6f07744 in zmq::udp_engine_t::out_event (this=<optimized out>) at src/udp_engine.cpp:285
289	        errno_assert (rc != -1);
(gdb) p errno
Cannot find thread-local variables on this target
(gdb) p strerror(errno)
Cannot find thread-local variables on this target

And root cause for crash is there is no gateway entry. We had to add some dummy gateway entry to route table.

@simias
Copy link
Contributor

simias commented May 14, 2018

I'd like to look into that but I'm not sure how to report the error to the caller given that the out_event method returns void. I tried to look at the TCP implementation to figure out how it's handled there but the code seems so wildly different that I couldn't really figure out if anything was applicable for UDP.

What's the correct way of dealing with this error? I think returning an error in zmq_send would make the most sense but given the threading going on it's not obvious to me how that would be done. Maybe the error should be saved and returned on subsequent calls?

@bluca
Copy link
Member

bluca commented May 14, 2018

With TCP, on recoverable/temporary errors the I/O thread engine simply tries again later. Can the UDP engine do that too?

@simias
Copy link
Contributor

simias commented May 14, 2018

I'd have to look into that. In the case of UDP I wonder if it makes a lot of sense though, given the best effort nature of the protocol (especially in multicast). After all even if the kernel manages to send the packet you never have any guarantee that it'll ever reach its destination.

What happens if the messages pile up with TCP, I assume eventually they're simply dropped?

In my case the error returned by the sendto is EADDRNOTAVAIL, I don't know if it should really be considered recoverable or temporary. I think in TCP the error would be caught earlier during the connect call which obviously has no counterpart for UDP.

@bluca
Copy link
Member

bluca commented May 14, 2018

With TCP I think the messages will fill the queue, and what happens depends on the HWM settings at that point

@simias
Copy link
Contributor

simias commented May 14, 2018

Now that I think about it even the calls to bind() and other syscalls in zmq::udp_engine_t::plug ought to report an error somehow instead of aborting. They seem less likely to fail "spuriously" but still.

@bluca
Copy link
Member

bluca commented May 14, 2018

The way to report status on the handshake and related statuses is via socket monitor events, if they happen in the I/O thread

@simias
Copy link
Contributor

simias commented May 14, 2018

Ah yeah that would work, do you think it could be used to handle UDP send errors as well or is inappropriate?

@bluca
Copy link
Member

bluca commented May 14, 2018

IMHO that would be way too much traffic, and as you said UDP is unreliable by nature

@ItsNayabSD
Copy link
Author

We upgraded the package and still I can see the crash when zmq_send fails. :(

@simias
Copy link
Contributor

simias commented Feb 7, 2019

Yeah I hit that again. For the moment I still have an ugly hack were I comment the assert in zmq::udp_engine_t::out_event to ignore the failure. Obviously it's not great...

I'd be interested in implementing a cleaner solution but I'm still unsure what to do. I tried taking inspiration from the TCP code but (as mentioned in previous comments) I don't really think it makes sense to retry when sendto fails. That being said, completely ignoring the error and not reporting it to the sender also seems like a poor idea.

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019
bluca pushed a commit that referenced this issue Aug 22, 2019
…2862 (#3638)

* UDP engine aborts on networking-related errors from socket syscalls #2862

* Add relicense statement
@bluca bluca closed this as completed Aug 22, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019
atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 25, 2019
…eromq#2862 (revert changes in error list in zmq::assert_success_or_recoverable)
bluca pushed a commit that referenced this issue Aug 25, 2019
#2862 (#3640)

* UDP engine aborts on networking-related errors from socket syscalls #2862
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants