New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUB crash when SUB exceeded SNDHWM #2942
Comments
I can't seem to reproduce the crash. Did you build the library with or without draft apis? |
@bluca Oh... I built libzmq-4.2.3 without any configuration arguments. I guess low zmq_setsockopt(sub, ZMQ_SNDHWM, (void *)10, 0); |
Here's my gdb backtrace: [Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
4.2.3
[New Thread 0x7ffff6abc700 (LWP 23970)]
[New Thread 0x7ffff62bb700 (LWP 23972)]
Assertion failed: erased == 1 (src/mtrie.cpp:297)
Thread 1 "a.out" received signal SIGABRT, Aborted.
0x00007ffff77b8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff77b8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff77ba02a in __GI_abort () at abort.c:89
#2 0x00007ffff7b73539 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff7bb75c6 "erased == 1") at src/err.cpp:87
#3 0x00007ffff7b7de6f in zmq::mtrie_t::rm_helper (this=this@entry=0xa34380, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:297
#4 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34360, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#5 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34340, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#6 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34320, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#7 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34300, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#8 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x61c7c0, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#9 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x960c50, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#10 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x61f020, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#11 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x617ce8, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#12 0x00007ffff7b7e235 in zmq::mtrie_t::rm (this=this@entry=0x617ce8, prefix_=<optimized out>, size_=<optimized out>, pipe_=<optimized out>) at src/mtrie.cpp:288
#13 0x00007ffff7baddc9 in zmq::xpub_t::xread_activated (this=0x617740, pipe_=0xade9b0) at src/xpub.cpp:115
#14 0x00007ffff7b92f0f in zmq::socket_base_t::process_commands (this=this@entry=0x617740, timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1378
#15 0x00007ffff7b95829 in zmq::socket_base_t::connect (this=0x617740, addr_=0x7fffffffe0a0 "tcp://127.0.0.1:40429") at src/socket_base.cpp:709
#16 0x0000000000400b53 in main () |
No change. The program just spins forever in one of the getsockopt_events_within_many_subscriptions loops |
I am running with libsodium:
It doesn't make much sense - unless you are using CURVE, and it that example it's not, sodium or tweetnacl will make no difference at all. Are you sure you didn't have multiple versions of the libraries laying around? And maybe building with one, but running with an old one or viceversa? |
I'm sorry my comment about Here's the comment:
|
Yes, there's only 1 libzmq on my system. I want to explain the environment difference between us. I downloaded zeromq-4.2.3 from zeromq-4.2.3.tar.gz. I built it with $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
$ uname -a
Linux fantine 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
You are not running ldconfig - are you sure there are no other libs? What does |
Oh, I ran The
|
Anyways, I'm still researching my crash. I found a symptom that |
I found the condition. A If the dropping is a normal behavior of |
#2252 what I reported 14 months ago came to my mind. At that time, you pointed dropped subscriptions due to the HWM. But I couldn't reproduce it without PyZMQ so I tried and gave up to fix it in PyZMQ. So essentially this issue is same as that issue. But this time, I can reproduce it without PyZMQ. @bluca I want to make it work in your environment also. I'll replace the subject with "PUB crash when SUB exceeded SNDHWM". |
Show code
Use #2942 (comment) instead. |
Or this? (more duplicated subscriptions, lowest #include "zmq.h"
#include <stdio.h>
void getsockopt_events_within_many_subscriptions(void* sub)
{
char opt[256];
size_t opt_len = 256;
char topic[8];
int n = 100;
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
}
}
int main()
{
printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);
void *context = zmq_ctx_new();
void *pub = zmq_socket(context, ZMQ_PUB);
void *sub;
char addr[256]; size_t addr_len = 256;
char opt[256]; size_t opt_len = 256;
int lowest_hwm = 1;
zmq_bind(pub, "tcp://127.0.0.1:*");
zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
zmq_setsockopt(pub, ZMQ_RCVHWM, &lowest_hwm, sizeof(lowest_hwm));
for (int i = 0; i < 100; ++i)
{
sub = zmq_socket(context, ZMQ_SUB);
zmq_setsockopt(sub, ZMQ_SNDHWM, &lowest_hwm, sizeof(lowest_hwm));
zmq_connect(sub, addr);
zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
getsockopt_events_within_many_subscriptions(sub);
}
} |
Here's another variance. I should make a code to drop #include "zmq.h"
#include <stdio.h>
int main()
{
printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);
void *context = zmq_ctx_new();
void *pub = zmq_socket(context, ZMQ_PUB);
void *sub;
char addr[256]; size_t addr_len = 256;
char opt[256]; size_t opt_len = 256;
char topic[8];
int lowest_hwm = 1;
int n = 10000;
zmq_setsockopt(pub, ZMQ_RCVHWM, &lowest_hwm, sizeof(lowest_hwm));
zmq_bind(pub, "tcp://127.0.0.1:*");
zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
for (int i = 0; i < 100; ++i)
{
sub = zmq_socket(context, ZMQ_SUB);
zmq_setsockopt(sub, ZMQ_SNDHWM, &lowest_hwm, sizeof(lowest_hwm));
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
}
zmq_connect(sub, addr);
zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
}
} |
Both programs work fine for me on both 4.2.2 and latest master - this is on Debian 9 x86_64 |
I tried to run this (with VS), but get the following error at the end of getsockopt_events_within_many_subscriptions:
It's in my branch https://github.com/sigiesec/libzmq/tree/fix-issue-2943 |
"topic" is an array so it shouldn't pass its address I think - gcc is more forgiving |
@bluca it's not a compiler, but a run-time error |
Does it also succeed? I've switched the connection topology to #include "zmq.h"
#include <stdio.h>
int main()
{
printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);
int hwm = 1;
int n = 10000;
char addr[256]; size_t addr_len = 256;
char opt[256]; size_t opt_len = 256;
char topic[8];
void *context = zmq_ctx_new();
void *pub, *sub;
pub = zmq_socket(context, ZMQ_PUB);
zmq_setsockopt(pub, ZMQ_RCVHWM, &hwm, sizeof(hwm));
for (int i = 0; i < 100; ++i)
{
sub = zmq_socket(context, ZMQ_SUB);
zmq_setsockopt(sub, ZMQ_SNDHWM, &hwm, sizeof(hwm));
zmq_bind(sub, "tcp://127.0.0.1:*");
zmq_getsockopt(sub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
}
zmq_connect(pub, addr);
for (int j = 0; j < n; ++j)
{
sprintf(topic, "%08x", j);
zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
}
} |
Yeah they all succeed, last one included, apart from the very first one which seems to just run forever. |
Is there any more information that you can give? Have you tried with the latest libzmq master? Using any non-standard compiler/compiler flag? Have you tried using just the packaged library from Ubuntu? |
Okay, I should stop to edit the code. I'm trying to reproduce the crash in Travis CI to make clear any environment differences. I tried it with |
It is reproduced at Travis CI! Can you check a crashed result and the script in CI?
|
I noticed that topic is too small, it must be made size 9 to store an 8 byte string plus 0 termination. When I change this, I get a similar assertion in VS:
|
@sigiesec Oh, thank you for the point! I updated at https://github.com/sublee/zmq-pubsub-crash. |
@bluca May this be a bug in mtrie? I wanted to add unittests for that anyway, since it is not well covered by the existing tests. @sublee I have the test also in a branch, where I have migrated the tests to unity: https://github.com/sigiesec/libzmq/tree/fix-issue-2942 |
@sigiesec Thank you! There's a little mistake. It is the issue How do you think how we can fix this crash? I think if |
@bluca I don't think this is a stack corruption. There are several proofs:
I didn't have a crash with
|
It is libzmq issue. PyZMQ doesn't have a responsibility to fix it. The issue is reported at zeromq/libzmq#2942.
@bluca Can we exclude the topic length from the suspects? Is there any risk if we remove the assertion of |
I think at the moment the assertion can be removed. If the mtrie is completely empty, e.g., rm simply returns false. |
Sorry but I'm not a huge fan of removing error checks - at the very least we need to understand what condition it was added to guard against, and why it is wrong and how to relax it. And possibly with tests - how is the unit test for mtrie coming along? Especially in this case - compiling with address sanitizer shows some stash smashing is going on:
Use -fsanitize=address with a decently recent version of GCC to see it |
@bluca I'm not doubtful of The So I suggest removing the assertion simply. Or we can check the pipe's subscription existence before processing |
@sigiesec Thank you for the idea! I'll send a PR you soon. |
Please don't - again:
Error checks must not be removed casually, they are there for a reason |
I am doubtful - the topic is 8 bytes, so why is your example requiring a larger array? It doesn't make any sense |
@bluca Okay, I won't remove the assertion. But is it okay that checking existence before |
I'm sorry for trying to remove the assertion hastily. But I want to move the conversation onto unpaired I think |
I've tried removing the assert Unhandled exception at 0x00007FFC3305DE0E (ntdll.dll) in test_1.exe: 0xC0000005: Access violation writing location 0x0000000000000024. It's because of simultaneous write access to family_entries map from multiple threads.
In Windows debug builds there's a mutx inside std::map which is being locked on writes. I'm not sure what is this now... Regardless of that, this fixes the crash without changing anything in libzmq, adding delays around
It seems that the issue is with During the actual assert the
|
With 1 millisecond delay on Windows sometimes unhandled exception still happens but with 10 millisecond delay everything works fine. |
@bjovke Thank you to inspect this issue.
When I appended some logs to debug it, I didn't see duplicated
If you append some logs, you will see the same behaviors. |
Well I didn't say that ZMQ_UNSUBSCRIBE messages are duplicates. Anyway, this should not be happening in any case. |
@bjovke Oh, I misunderstood. Anyways, int hwm = 0;
I don't agree your guess. Here're counterexamples: This crash is not only in the example code. My production servers are having it currently(I'll fix it with the unlimited HWM at the next maintenance window.) The servers are distributed on multiple machines. The crash happens when 2 types of events:
A fresh cluster has same |
Yes, you're right about the network latency, I've removed that text from my comment. Everything works fine with hwm = 0 for some time and I get an already mentioned unhandled exception on Windows. But this is not related to this assert issue. Either the logic for detecting cases like this needs to be added to the code or it might be enough just to remove the assert. It needs to be investigated. |
I prefer removing the assertion and making Before we fix this issue, we must warn about the crash for the PUB/SUB pattern users. The |
@bluca Maybe I find some time tomorrow to add sufficient tests, so that we can discuss consistency of mtrie behaviour. At the moment my impression is that the assertion is too strict within mtrie, but it may well be worth an assertion at the call site. I did not dig into the larger picture yet. |
The assert is probably there in order to make sure that unsubscribe from a SUB actually deleted a corresponding pipe from But in this case SUB is sending unsubscribe to a subscription which never reached the PUB, thus the subscription is non-existent on PUB side. I guess then there will be no dangling pipes in the PUB? As far as I've looked at the code, the There's this code in
If the assert is removed
But, then again, this is definitely caused by I personally think that this needs to be fixed on both the PUB and the SUB, by removing assert on the PUB and fixing the |
@bjovke Thank you to agree my hypothesis! I saw |
@bluca I'm looking forward to your opinion on my hypothesis: successful |
@sublee please try again from latest master |
@bluca Okay, I'll see. |
@bluca There's no more crash. Finally, this issue seems to be fixed! Thank you guys so much for the hard work to fix it. |
Please use this template for reporting suspected bugs or requests for help.
Issue description
When all of these conditions are satisfied, the assertion failure from
mtrie.cpp
occurs:PUB
socket and manySUB
sockets.SUB
socket subscribe/unsubscribe many prefixes.zmq_getsockopt()
withZMQ_EVENTS
forSUB
sockets.Environment
Minimal test code / Steps to reproduce the issue
To reproduce this crash, we should prepare a
PUB
socket and manySUB
sockets.We will call this sequence (pseudo-code):
pub.connect(sub) or sub.connect(pub); pub.getsockopt(ZMQ_EVENTS); sub.subscribe(prefix); sub.getsockopt(ZMQ_EVENTS); sub.unsubscribe(prefix); sub.getsockopt(ZMQ_EVENTS)
. There will be many prefixes to subscribe/unsubscribe.Calling
getsockopt(ZMQ_EVENTS)
afterSUB
'sSUBSCRIBE
/UNSUBSCRIBE
, orPUB
'szmq_connect()
will produce a crash due to the assertion failure inmtrie_t::rm_helper
.You can switch
PUB<->SUB
connection topology by thepub_to_sub
variable.What's the actual result? (include assertion message & call stack if applicable)
What's the expected result?
When
SUB
sockets connect to thePUB
socket, this crash doesn't happen.The text was updated successfully, but these errors were encountered: