Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New invalid argument error in libzmq, mutex.hpp:142 #2991

Closed
nightlark opened this issue Mar 14, 2018 · 33 comments
Closed

New invalid argument error in libzmq, mutex.hpp:142 #2991

nightlark opened this issue Mar 14, 2018 · 33 comments

Comments

@nightlark
Copy link

Issue description

One of the unit test suites we have finishes all of the test cases (which involve starting up and shutting down zmq, probably at least 50 times), but then there is an invalid argument error by zmq after boost reports the number of test cases that have passed. The first time observing this issue was on March 9th, but we cache builds of dependencies. The cached version on our develop branch was last updated February 27. Switching our builds to use the last tagged release (v4.2.3) makes the error go away.

The output for the zmq error: Invalid argument (/home/travis/build/GMLC-TDC/HELICS-src/libzmq/src/mutex.hpp:142)

Environment

Travis CI Linux VMs using gcc-6, gcc-4.9, and clang-3.6

  • libzmq version (commit hash if unreleased): One of the commits between February 27 and March 9.
  • OS: Linux

Minimal test code / Steps to reproduce the issue

  1. Checkout and build GMLC-TDC/HELICS-src develop branch
  2. Run key-tests test suite. I haven't narrowed it down to a specific zeromq commit, or line in our test suite. If that changes I'll provide an update here.

What's the actual result? (include assertion message & call stack if applicable)

https://travis-ci.org/GMLC-TDC/HELICS-src/jobs/353188544 - see error at the bottom

What's the expected result?

https://travis-ci.org/GMLC-TDC/HELICS-src/jobs/353393510

@bluca
Copy link
Member

bluca commented Mar 14, 2018

It's fixed in the latest master

@bluca
Copy link
Member

bluca commented Mar 15, 2018

@nightlark did you have any chance to try the latest master?

@nightlark
Copy link
Author

I tried the latest commit that was made to master yesterday, and it still had the issue. I'm rerunning the build with the commit made today.

@bluca
Copy link
Member

bluca commented Mar 15, 2018

Today's changes won't affect that. Could you try with this commit? 9bd2d3f

@nightlark
Copy link
Author

nightlark commented Mar 15, 2018

I'm running a build with that commit now. Pretty much what I found yesterday from the merged PR commits is:

2960 - works
2967 - works
2970 - works
2971 - doesn't work
2973 - doesn't work
2974 - works (merge for commit 9bd2d3f)
2976 - doesn't work (merge for reverting 2974)
2956 - doesn't work

@sigiesec
Copy link
Member

@bluca You probably assume that this is related to the mutex in random.cpp, right? Maybe that isn't the case at all. It would be good to have some more information, i.e. the stack traces of all threads when the assertion occurs.

@nightlark
Copy link
Author

Commit 9bd2d3f works. I need to figure out how to get a stack trace out of travis.

@sigiesec
Copy link
Member

Okay, then it probably is related to the destruction order somehow. I had a look at where context_t is used in your code, but did not find something suspicous after giving it a quick look.
I think it will be quite complex to get a stack trace out of travis.

Maybe you can reproduce it locally within gdb?

@nightlark
Copy link
Author

I'll see if I can update my Linux vm and get a stack trace later today or tomorrow.

@nightlark
Copy link
Author

I was able to reproduce it locally within gdb.

Program received signal SIGABRT, Aborted.
0x00002aaaacb9e1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

Thread 2 (Thread 0x2aaaad12c700 (LWP 158139)):

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00002aaaac1dc15c in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)
    at /tmp/mockbuild/spack-stage/spack-stage-P3jF0z/spack-build/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=<optimized out>, __lock=...)
    at /builddir/build/BUILD/gccspack/spack/var/spack/stage/gcc-4.9.3-34vmmvov5ct2riknhkfcmxieffb5pyex/gcc-4.9.3/libstdc++-v3/src/c++11/condition_variable.cc:52
#3  0x000000000062a1f4 in BlockingQueue<std::pair<int, std::string> >::pop() ()
#4  0x00000000006280e4 in helics::LoggingCore::processingLoop() ()
#5  0x00000000006309bb in void std::_Mem_fn<void (helics::LoggingCore::*)()>::operator()<, void>(helics::LoggingCore*) const ()
#6  0x00000000006308cf in void std::_Bind_simple<std::_Mem_fn<void (helics::LoggingCore::*)()> (helics::LoggingCore*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) ()
#7  0x000000000063076d in std::_Bind_simple<std::_Mem_fn<void (helics::LoggingCore::*)()> (helics::LoggingCore*)>::operator()()
    ()
#8  0x0000000000630686 in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (helics::LoggingCore::*)()> (helics::LoggingCore*)> >::_M_run() ()
#9  0x00002aaaac1dfdc0 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
    at /builddir/build/BUILD/gccspack/spack/var/spack/stage/gcc-4.9.3-34vmmvov5ct2riknhkfcmxieffb5pyex/gcc-4.9.3/libstdc++-v3/src/c++11/thread.cc:84
#10 0x00002aaaac954e25 in start_thread (arg=0x2aaaad12c700) at pthread_create.c:308
#11 0x00002aaaacc6134d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x2aaaaab19940 (LWP 158132)):

#0  0x00002aaaacb9e1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00002aaaacb9f8e8 in __GI_abort () at abort.c:90
#2  0x00002aaaab2082be in zmq::zmq_abort(char const*) () from /g/g19/usrname/helics/zmq-dep/lib64/libzmq.so.5
#3  0x00002aaaab21f3cc in zmq::random_close() () from /g/g19/usrname/helics/zmq-dep/lib64/libzmq.so.5
#4  0x00002aaaab1fae52 in zmq::ctx_t::~ctx_t() () from /g/g19/usrname/helics/zmq-dep/lib64/libzmq.so.5
#5  0x00002aaaab1fb4e9 in zmq::ctx_t::terminate() () from /g/g19/usrname/helics/zmq-dep/lib64/libzmq.so.5
#6  0x00002aaaab2481ad in zmq_ctx_term () from /g/g19/usrname/helics/zmq-dep/lib64/libzmq.so.5
#7  0x0000000000646027 in zmq::context_t::close() ()
#8  0x0000000000645ffc in zmq::context_t::~context_t() ()
#9  0x0000000000646940 in std::default_delete<zmq::context_t>::operator()(zmq::context_t*) const ()
#10 0x00000000006462f9 in std::unique_ptr<zmq::context_t, std::default_delete<zmq::context_t> >::~unique_ptr() ()
#11 0x0000000000645dfe in zmqContextManager::~zmqContextManager() ()
#12 0x0000000000645e3e in zmqContextManager::~zmqContextManager() ()
#13 0x0000000000647716 in std::_Sp_counted_ptr<zmqContextManager*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
#14 0x00000000004197d0 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#15 0x00000000004181cf in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() ()
#16 0x00000000006121b0 in std::__shared_ptr<zmqContextManager, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() ()
#17 0x00000000006121ca in std::shared_ptr<zmqContextManager>::~shared_ptr() ()
#18 0x0000000000647612 in std::pair<std::string const, std::shared_ptr<zmqContextManager> >::~pair() ()
#19 0x000000000064763c in void __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >::destroy<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >(std::pair<std::string const, std::shared_ptr<zmqContextManager> >*) ()
#20 0x000000000064748d in std::enable_if<std::__and_<std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > > >::__destroy_helper<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >::type>::value, void>::type std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > > >::_S_destroy<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >(std::allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >&, std::pair<std::string const, std::shared_ptr<zmqContextManager> >*) ()
#21 0x0000000000647398 in void std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > > >::destroy<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >(std::allocator<std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >&, std::pair<std::string const, std::shared_ptr<zmqContextManager> >*) ()
#22 0x0000000000647095 in std::_Rb_tree<std::string, std::pair<std::string const, std::shared_ptr<zmqContextManager> >, std::_Select1st<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >, std::less<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >::_M_destroy_node(std::_Rb_tree_node<std::pair<std::string const, std::sha---Type <return> to continue, or q <return> to quit---
red_ptr<zmqContextManager> > >*) ()
#23 0x0000000000646b1d in std::_Rb_tree<std::string, std::pair<std::string const, std::shared_ptr<zmqContextManager> >, std::_Select1st<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >, std::less<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >*) ()
#24 0x00000000006464da in std::_Rb_tree<std::string, std::pair<std::string const, std::shared_ptr<zmqContextManager> >, std::_Select1st<std::pair<std::string const, std::shared_ptr<zmqContextManager> > >, std::less<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >::~_Rb_tree() ()
#25 0x00000000006476dc in std::map<std::string, std::shared_ptr<zmqContextManager>, std::less<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<zmqContextManager> > > >::~map() ()
#26 0x00002aaaacba1a69 in __run_exit_handlers (status=0, listp=0x2aaaacf256c8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#27 0x00002aaaacba1ab5 in __GI_exit (status=<optimized out>) at exit.c:99
#28 0x00002aaaacb8ac0c in __libc_start_main (main=0x4144b5 <main>, argc=1, ubp_av=0x7fffffffbda8, init=<optimized out>,
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffbd98) at ../csu/libc-start.c:308
#29 0x0000000000414369 in _start ()

@bluca
Copy link
Member

bluca commented Mar 16, 2018

What's zmqContextManager ?

@nightlark
Copy link
Author

It's a singleton wrapper around a pointer to the zmq context. Maybe there's some other things that should be getting done before the zmq context is released?

https://github.com/GMLC-TDC/HELICS-src/blob/master/src/helics/common/zmqContextManager.cpp
https://github.com/GMLC-TDC/HELICS-src/blob/master/src/helics/common/zmqContextManager.h

@sigiesec
Copy link
Member

Is this cleanup done by a function explicitly registered by atexit? From the stack trace, it appears so (__run_exit_handlers). This cleanup function may only be registered after the first zmq context is created, since this triggers a library initialization, and the cleanup must be done in the reverse order.
Unfortunately, this libzmq library init/cleanup cannot be controlled explicitly at the moment.

@bluca
Copy link
Member

bluca commented Mar 19, 2018

@sigiesec I think this fix is causing more troubles than the original bug it fixed :-( that's on me, as I started this whole global init/deinit rigamarole.

I think we should stop doing this with libsodium. The API does say it's not thread safe, but nobody ever reported an issue with it in all these years.

The original issue is that with Tweetnacl there was a file descriptor leak: #2632 which is nasty as at some point it causes long running processes which constantly create&destroy contexts (an annoying anti-pattern, but folks use it) to crash when the file descriptor limit is reached.

We don't recommend to use Tweetnacl in production anyway - so anybody who faces this initialisation problem can just build with libsodium instead (they should anyway).
And even with Tweetnacl, I've recently implemented support for the new getrandom() system call, which removes completely the need to do any initialisation (there's no need to open /dev/urandom anymore). This is available with recent kernel & glibc which will start to become common this year (new Ubuntu LTS will have it for example). So it won't be a problem for much longer.

What do you think?

@sigiesec
Copy link
Member

@bluca I am not completely sure what you are exactly referring to by "this fix". Do you suggest to revert #2636?

If that's the case, I tend to agree :)

This is still not a clean AND easy-to-use solution would be to provide zmq_global_init and zmq_global_cleanup functions, which must be explicitly called by the application and by default initialize/cleanup libsodium, but can be configured (at compile time or via a parameter at runtime) not to initialize/cleanup libsodium implicitly. This would need to be used, if the application, or another library libfoo used by the application, also depends on libsodium and uses the same instance of the library. The same would need to be done in libfoo, and only the application would call libsodiums init and cleanup functions. This is a general problem, when any library with multiple incoming dependencies requires initialization and cleanup.

But as you said, since no one ever reported an issue with this in years, this probably has no priority. This is probably because an application usually initializes a single zmq context, or at least the first zmq context is initialized from the main thread, and probably does not use libsodium other than through libzmq.

Another "solution" might be to completely forbid static builds of libzmq, but this would break many existing uses (including ours).

@bluca
Copy link
Member

bluca commented Mar 19, 2018

Not completely reverting - just removing if defined (ZMQ_USE_LIBSODIUM) so that it applies only to older Linux using Tweetnacl, which should be a minority of use cases or nil.

I really don't like the idea of adding more mandatory APIs - it makes using the library more difficult and convoluted. It's also a backward-incompatible change so we can't do it anyway :-/
Same for static builds, as you said it would break a lot of applications - the license was specifically amended to allow for static linking so it's a very popular feature.

@sigiesec
Copy link
Member

Ok, but you refer to part of the changes from #2636. If you create a PR, I will give it another look to be sure we talk about the same thing.

Hm yes, maybe it could be done in a way that is not mandatory. rabbitmq-c also has a similar problem with its use of openssl, but the solution there is also not completely clean. It allows to optionally explicitly initialize openssl, and in that case, it skips its later implicit initialization of openssl. libprotobuf has an optional cleanup function. If it is not explicitly called, some tools might report memory leaks. But I am not sure if these cases are comparable.

bluca added a commit to bluca/libzmq that referenced this issue Mar 19, 2018
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on
*NIX when using /dev/urandom, ie: without the new Linux getrandom()
syscall.

Existing applications might use atexit to register cleanup functions
(like CZMQ does), and the current change as-is imposes an ordering
that did not exist before - the context MUST be created BEFORE
registering the cleanup with atexit. This is a backward incompatible
change that is reported to cause aborts in some applications.

Although libsodium's documentation says that its initialisation APIs
is not thread-safe, nobody has ever reported an issue with it, so
avoiding the global init/deinit in the libsodium case is the less
risky option we have.

Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and
Linux kernel 3.17) are not affected by the original issue.

Fixes zeromq#2991
bluca added a commit to bluca/libzmq that referenced this issue Mar 19, 2018
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on
*NIX when using /dev/urandom, ie: without the new Linux getrandom()
syscall.

Existing applications might use atexit to register cleanup functions
(like CZMQ does), and the current change as-is imposes an ordering
that did not exist before - the context MUST be created BEFORE
registering the cleanup with atexit. This is a backward incompatible
change that is reported to cause aborts in some applications.

Although libsodium's documentation says that its initialisation APIs
is not thread-safe, nobody has ever reported an issue with it, so
avoiding the global init/deinit in the libsodium case is the less
risky option we have.

Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and
Linux kernel 3.17) are not affected by the original issue.

Fixes zeromq#2991
bluca added a commit to bluca/libzmq that referenced this issue Mar 19, 2018
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on
*NIX when using /dev/urandom, ie: without the new Linux getrandom()
syscall.

Existing applications might use atexit to register cleanup functions
(like CZMQ does), and the current change as-is imposes an ordering
that did not exist before - the context MUST be created BEFORE
registering the cleanup with atexit. This is a backward incompatible
change that is reported to cause aborts in some applications.

Although libsodium's documentation says that its initialisation APIs
is not thread-safe, nobody has ever reported an issue with it, so
avoiding the global init/deinit in the libsodium case is the less
risky option we have.

Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and
Linux kernel 3.17) are not affected by the original issue.

Fixes zeromq#2991
@nightlark
Copy link
Author

After trying with the changes in #3001, I'm still seeing the same mutex error. Is there a CMake option that can be used to tell it not to use the global random init/deinit? Something that achieves the same effect as overriding the value of ZMQ_HAVE_THREADSAFE_STATIC_LOCAL_INIT might work.

@bluca
Copy link
Member

bluca commented Mar 19, 2018

There is no option and you shouldn't define anything new - are you using libsodium or tweetnacl?

@nightlark
Copy link
Author

The build for the zmq library says it is using tweetnacl, but I don't think we are using any encryption in our code.

@bluca
Copy link
Member

bluca commented Mar 19, 2018

Switch to libsodium - tweetnacl shouldn't be used anywhere but in development

@nightlark
Copy link
Author

Thanks! libsodium doesn't have the mutex problem -- since we aren't using tweetnacl or libsodium, would ENABLE_CURVE=OFF disable both of them?

@bluca
Copy link
Member

bluca commented Mar 19, 2018

Yes that's correct

@SergueiEK
Copy link

SergueiEK commented Jul 10, 2019

Hello
Using zeromq 4.3.1 I got the same exact error. Is this a regression ot the issue has never been fixed? Here is how it looks like in valgrind:

Invalid argument (src/mutex.hpp:142)
==400901==
==400901== Process terminating with default action of signal 6 (SIGABRT): dumping core
==400901== at 0xA1C9277: raise (in /usr/lib64/libc-2.17.so)
==400901== by 0xA1CA967: abort (in /usr/lib64/libc-2.17.so)
==400901== by 0xF656DAA: zmq::zmq_abort(char const*) (err.cpp:88)
==400901== by 0xF6842E3: lock (mutex.hpp:142)
==400901== by 0xF6842E3: scoped_lock_t (mutex.hpp:180)
==400901== by 0xF6842E3: manage_random(bool) (random.cpp:142)
==400901== by 0xF65FD69: zmq::ctx_t::~ctx_t() (ctx.cpp:120)
==400901== by 0xF660300: zmq::ctx_t::terminate() (ctx.cpp:202)
==400901== by 0xF6AA929: zmq_ctx_term (zmq.cpp:156)
==400901== by 0xF41131F: zsys_shutdown (zsys.c:247)
==400901== by 0xA1CCBD8: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==400901== by 0xA1CCC26: exit (in /usr/lib64/libc-2.17.so)
==400901== by 0xA1B544B: (below main) (in /usr/lib64/libc-2.17.so)

@bluca
Copy link
Member

bluca commented Jul 10, 2019

It cannot be fixed - if your linux distro doesn't have getrandom(), either use libsodium (should do that anyway if running in production) or disable curve entirely

@SergueiEK
Copy link

Thank you for the clear answer!

@nightlark
Copy link
Author

nightlark commented Jul 10, 2019

I came across this error recently again — apparently the copy of zmq distributed by homebrew on macOS uses tweetnacl. So macOS users may want to compile zmq themselves.

@bluca
Copy link
Member

bluca commented Jul 10, 2019

That's not good. This is the formula: https://github.com/Homebrew/homebrew-core/blob/master/Formula/zeromq.rb are you able to test a change and send them a PR? I don't have OSX available

@nightlark
Copy link
Author

I think I can. Should the default package be getting compiled with libsodium (or curve disabled)?

@bluca
Copy link
Member

bluca commented Jul 10, 2019

I think defaulting to libsodium is a good idea, but not sure what the policy guidelines for homebrew are

lerwys added a commit to lnls-dig/bpm-app that referenced this issue Feb 27, 2020
lerwys added a commit to lnls-dig/halcs that referenced this issue Jul 1, 2020
This is needed for libzmq 4.2.5, so we use
libsodium and avoid issues zeromq/libzmq#2991
and zeromq/czmq#1299
lerwys added a commit to lnls-dig/halcs that referenced this issue Jul 1, 2020
This is needed for libzmq 4.2.5, so we use
libsodium and avoid issues zeromq/libzmq#2991
and zeromq/czmq#1299
nightlark added a commit to nightlark/homebrew-core that referenced this issue Dec 17, 2020
Using TweetNaCl can result in shutdown errors from libzmq, and it
is not recommended for "production" use. libsodium is the default
recommended by the libzmq maintainers as per the discussion in:
zeromq/libzmq#2991 (comment)
BrewTestBot pushed a commit to Homebrew/homebrew-core that referenced this issue Dec 17, 2020
Using TweetNaCl can result in shutdown errors from libzmq, and it
is not recommended for "production" use. libsodium is the default
recommended by the libzmq maintainers as per the discussion in:
zeromq/libzmq#2991 (comment)

Closes #67039.

Signed-off-by: Sean Molenaar <1484494+SMillerDev@users.noreply.github.com>
Signed-off-by: BrewTestBot <1589480+BrewTestBot@users.noreply.github.com>
@nightlark
Copy link
Author

The zeromq bottle on Homebrew now uses libsodium by default.

@themightyoarfish
Copy link

I'm still seeing this problem with 4.3.4.

@themightyoarfish
Copy link

When rebuilding with libsodium (sudo apt -y install libsodium-dev) the error seems to go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants