Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when reallocating in rcl_clock_remove_jump_callback #938

Closed
maxlein opened this issue Dec 6, 2019 · 14 comments · Fixed by #954
Closed

Crash when reallocating in rcl_clock_remove_jump_callback #938

maxlein opened this issue Dec 6, 2019 · 14 comments · Fixed by #954
Labels
bug Something isn't working

Comments

@maxlein
Copy link

maxlein commented Dec 6, 2019

I don't really know how I can describe this best.
My robot is running 24/7. About 7 nodes running. (driver, laser, imu, localization, navigation, etc)

And somehow the localization_node from slam_toolbox is dying after ~1 to 3 hours, no other node.
It seems obvious that the toolbox does something wrong, but the stacktraces say something different.
-> toolbox issue
Crashes on executor, tf subscriber, message_filter ( issue )

Maybe someone of you guys has any idea what's going on here.
I am a bit clueless here.

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04, Kernel 5.0, Core i5, 8GB RAM
  • Installation type:
    • binaries
  • Version or commit hash:
    • Eloquent
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Some setup for getting slam_toolbox working is needed.

I am running a "standard" robot stack.
No simulation, real hardware.

scan->base_link->odom->map
localization_node of slam_toolbox needs to run and after some time ( 1to 3 hours ) it crashes.

Expected behavior

No node crashes.

Actual behavior

localization_node crashes with different errors

@jacobperron
Copy link
Member

Based on your latter two stack traces it looks like the double-free is coming from rcl.

trace
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f66f8bc4380 (LWP 23341))]
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f66f6c60801 in __GI_abort () at abort.c:79
#2  0x00007f66f6ca9897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f66f6dd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f66f6cb090a in malloc_printerr (str=str@entry=0x7f66f6dd8828 "double free or corruption (fasttop)") at malloc.c:5350
#4  0x00007f66f6cb31d5 in _int_free (av=0x7f66f700bc40 , p=0x5606adb31200, have_lock=) at malloc.c:4230
#5  0x00007f66f6cb579e in _int_realloc (av=av@entry=0x7f66f700bc40 , oldp=oldp@entry=0x5606adb2fc40, oldsize=oldsize@entry=5616, nb=nb@entry=5568) at malloc.c:4650
#6  0x00007f66f6cb8f9b in __GI___libc_realloc (oldmem=0x5606adb2fc50, bytes=5560) at malloc.c:3230
#7  0x00007f66f6562349 in rcl_clock_remove_jump_callback () from /opt/ros/eloquent/lib/librcl.so
#8  0x00007f66f6563d62 in rcl_timer_fini () from /opt/ros/eloquent/lib/librcl.so
#9  0x00007f66f7d29a92 in ?? () from /opt/ros/eloquent/lib/librclcpp.so
#10 0x00007f66f7cb05d2 in rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy >::remove_null_handles(rcl_wait_set_t*) () from /opt/ros/eloquent/lib/librclcpp.so
#11 0x00007f66f7ca2327 in rclcpp::executor::Executor::wait_for_work(std::chrono::duration >) () from /opt/ros/eloquent/lib/librclcpp.so
#12 0x00007f66f7ca2775 in rclcpp::executor::Executor::get_next_executable(rclcpp::executor::AnyExecutable&, std::chrono::duration >) () from /opt/ros/eloquent/lib/librclcpp.so
#13 0x00007f66f7ca66e0 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so
#14 0x00007f66f7ca33e2 in rclcpp::spin(std::shared_ptr) () from /opt/ros/eloquent/lib/librclcpp.so

In particular the function rcl_clock_remove_jump_callback() during a realloc. And since there's only one reallocation happening in that function, it must be this line:

https://github.com/ros2/rcl/blob/024c3ee50d9b79a85665efe1c191cc1225026519/rcl/src/rcl/time.c#L454-L456

Though, nothing wrong about the implementation stands out to me. It's possible there's a race between threads at the rclcpp layer.

@maxlein
Copy link
Author

maxlein commented Dec 11, 2019

So I now throttled the scan rate from 25Hz to 12.5Hz, and it runs for about 10 hours now.
As I never reached 10 hours before, it looks like its at least reducing the error frequency.

@ivanpauno
Copy link
Member

It sounds like this

// *INDENT-OFF*
// create shared_ptr that removes the callback automatically when all copies are destructed
// TODO(dorezyuk) UB, if the clock leaves scope before the JumpHandler
.

@ivanpauno ivanpauno added the bug Something isn't working label Dec 12, 2019
@maxlein
Copy link
Author

maxlein commented Dec 16, 2019

Now I tried amcl also, and it resultet in a seg fault after ~5 hours.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f926aacfd40 in ?? ()
[Current thread is 1 (Thread 0x7f9294cbf700 (LWP 5894))]
(gdb) bt
#0  0x00007f926aacfd40 in ?? ()
#1  0x00007f92a0ae25d2 in rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::remove_null_handles(rcl_wait_set_t*) () from /opt/ros/eloquent/lib/librclcpp.so
#2  0x00007f92a0ad4327 in rclcpp::executor::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#3  0x00007f92a0ad4775 in rclcpp::executor::Executor::get_next_executable(rclcpp::executor::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#4  0x00007f92a0ad86e0 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so
#5  0x00007f929dba0faa in nav2_util::NodeThread::<lambda()>::operator() (__closure=0x56132f4a3638) at /home/kin/playground_ws/src/navigation2/nav2_util/src/node_thread.cpp:29
#6  std::__invoke_impl<void, nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#7  std::__invoke<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#8  std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::_M_invoke<0> (this=0x56132f4a3638) at /usr/include/c++/7/thread:234
#9  std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::operator() (this=0x56132f4a3638) at /usr/include/c++/7/thread:243
#10 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > > >::_M_run(void) (this=0x56132f4a3630)

@ivanpauno
Copy link
Member

@maxlein That error seems unrelated to the previous one.
Please, if you have a simple example for reproducing the error, it would be help for investigating the issue.

About the first one, check my previous comment #938 (comment).
I'm pretty sure the error comes from that, though I don't have time now for patching it.

@maxlein maxlein changed the title Strange behavior in slam_toolbox, seg faults, double free or corruption, etc.. Crash when reallocating in rcl_clock_remove_jump_callback Dec 19, 2019
@maxlein
Copy link
Author

maxlein commented Dec 19, 2019

I don't feel I have the knowledge about rcl to fix this, I may break more than fix ;)
But is there anyone who can look at this?
Because I see this as a major bug if the system is not stable.
Only in the last few hours amcl and nav2 controller crashed with this error.

So either it happens only on my system or I am the only person who uses ROS2 for longer periods of time?

@ivanpauno
Copy link
Member

@maxlein #954 should fix the original issue that you reported (the one related with jump callback).
Can you check if that PR solves the problem?

@maxlein
Copy link
Author

maxlein commented Dec 20, 2019

Thanks! This will take time to test, as I have no source build on the robot yet ...

@ivanpauno
Copy link
Member

ivanpauno commented Dec 23, 2019

Reopening, until @maxlein confirms that #954 fixed the issue.

@ivanpauno
Copy link
Member

@maxlein Friendly ping. Are you going to be able to test this?
If not, I will proced to close and we can reopen later if needed.

@ivanpauno ivanpauno removed their assignment Jan 9, 2020
@maxlein
Copy link
Author

maxlein commented Jan 10, 2020

Problem is on my dev notebook I can't reproduce this bug although nearly the same system...
Differs just simulated hardware/sensors.

Is there a date when this fix is released in a deb pkg?

@ivanpauno
Copy link
Member

Problem is on my dev notebook I can't reproduce this bug although nearly the same system...
Differs just simulated hardware/sensors.

Is there a date when this fix is released in a deb pkg?

The problem is that the fix breaks ABI ...
Usually, we try to avoid breaking ABI, except there is no other option.

If that PR is really fixing something, I can ask it to be backported, and discuss if breaking ABI worth it.
But without confirmation, I'm not convinced to include it in a patch release.

Do you have a way to test the fix without a deb pkg?

@maxlein
Copy link
Author

maxlein commented Jan 10, 2020

I have time to set up a source build on the robot in about a month earliest.
And I am not sure if I can reproduce this. Since a few days the crashes disappeared without any significant changes which is a bit worrying to me.
But I have to watch this.

@ivanpauno
Copy link
Member

I have time to set up a source build on the robot in about a month earliest.
And I am not sure if I can reproduce this. Since a few days the crashes disappeared without any significant changes which is a bit worrying to me.
But I have to watch this.

Sounds good.
I'm going to close this, to avoid hitting it again in other ros2 triage meetings.
But of course, if you are able to still reproduce the problem, please reopen the issue.

Thanks @maxlein!

nnmm pushed a commit to ApexAI/rclcpp that referenced this issue Jul 9, 2022
…#938)

* Handle RCL_WARN_UNUSED
* Avoid duplicate type names

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants