Crash when reallocating in rcl_clock_remove_jump_callback #938

maxlein · 2019-12-06T20:40:58Z

I don't really know how I can describe this best.
My robot is running 24/7. About 7 nodes running. (driver, laser, imu, localization, navigation, etc)

And somehow the localization_node from slam_toolbox is dying after ~1 to 3 hours, no other node.
It seems obvious that the toolbox does something wrong, but the stacktraces say something different.
-> toolbox issue
Crashes on executor, tf subscriber, message_filter ( issue )

Maybe someone of you guys has any idea what's going on here.
I am a bit clueless here.

Bug report

Required Info:

Operating System:
- Ubuntu 18.04, Kernel 5.0, Core i5, 8GB RAM
Installation type:
- binaries
Version or commit hash:
- Eloquent
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Some setup for getting slam_toolbox working is needed.

I am running a "standard" robot stack.
No simulation, real hardware.

scan->base_link->odom->map
localization_node of slam_toolbox needs to run and after some time ( 1to 3 hours ) it crashes.

Expected behavior

No node crashes.

Actual behavior

localization_node crashes with different errors

The text was updated successfully, but these errors were encountered:

jacobperron · 2019-12-06T23:31:38Z

Based on your latter two stack traces it looks like the double-free is coming from rcl.

trace

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f66f8bc4380 (LWP 23341))]
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f66f6c60801 in __GI_abort () at abort.c:79
#2  0x00007f66f6ca9897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f66f6dd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f66f6cb090a in malloc_printerr (str=str@entry=0x7f66f6dd8828 "double free or corruption (fasttop)") at malloc.c:5350
#4  0x00007f66f6cb31d5 in _int_free (av=0x7f66f700bc40 , p=0x5606adb31200, have_lock=) at malloc.c:4230
#5  0x00007f66f6cb579e in _int_realloc (av=av@entry=0x7f66f700bc40 , oldp=oldp@entry=0x5606adb2fc40, oldsize=oldsize@entry=5616, nb=nb@entry=5568) at malloc.c:4650
#6  0x00007f66f6cb8f9b in __GI___libc_realloc (oldmem=0x5606adb2fc50, bytes=5560) at malloc.c:3230
#7  0x00007f66f6562349 in rcl_clock_remove_jump_callback () from /opt/ros/eloquent/lib/librcl.so
#8  0x00007f66f6563d62 in rcl_timer_fini () from /opt/ros/eloquent/lib/librcl.so
#9  0x00007f66f7d29a92 in ?? () from /opt/ros/eloquent/lib/librclcpp.so
#10 0x00007f66f7cb05d2 in rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy >::remove_null_handles(rcl_wait_set_t*) () from /opt/ros/eloquent/lib/librclcpp.so
#11 0x00007f66f7ca2327 in rclcpp::executor::Executor::wait_for_work(std::chrono::duration >) () from /opt/ros/eloquent/lib/librclcpp.so
#12 0x00007f66f7ca2775 in rclcpp::executor::Executor::get_next_executable(rclcpp::executor::AnyExecutable&, std::chrono::duration >) () from /opt/ros/eloquent/lib/librclcpp.so
#13 0x00007f66f7ca66e0 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so
#14 0x00007f66f7ca33e2 in rclcpp::spin(std::shared_ptr) () from /opt/ros/eloquent/lib/librclcpp.so

In particular the function rcl_clock_remove_jump_callback() during a realloc. And since there's only one reallocation happening in that function, it must be this line:

https://github.com/ros2/rcl/blob/024c3ee50d9b79a85665efe1c191cc1225026519/rcl/src/rcl/time.c#L454-L456

Though, nothing wrong about the implementation stands out to me. It's possible there's a race between threads at the rclcpp layer.

maxlein · 2019-12-11T13:14:42Z

So I now throttled the scan rate from 25Hz to 12.5Hz, and it runs for about 10 hours now.
As I never reached 10 hours before, it looks like its at least reducing the error frequency.

ivanpauno · 2019-12-11T17:05:58Z

It sounds like this

rclcpp/rclcpp/src/rclcpp/clock.cpp

Lines 129 to 131 in 4a5eed9

    
           // *INDENT-OFF* 
        
           // create shared_ptr that removes the callback automatically when all copies are destructed 
        
           // TODO(dorezyuk) UB, if the clock leaves scope before the JumpHandler

.

maxlein · 2019-12-16T08:42:12Z

Now I tried amcl also, and it resultet in a seg fault after ~5 hours.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f926aacfd40 in ?? ()
[Current thread is 1 (Thread 0x7f9294cbf700 (LWP 5894))]
(gdb) bt
#0  0x00007f926aacfd40 in ?? ()
#1  0x00007f92a0ae25d2 in rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::remove_null_handles(rcl_wait_set_t*) () from /opt/ros/eloquent/lib/librclcpp.so
#2  0x00007f92a0ad4327 in rclcpp::executor::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#3  0x00007f92a0ad4775 in rclcpp::executor::Executor::get_next_executable(rclcpp::executor::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#4  0x00007f92a0ad86e0 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so
#5  0x00007f929dba0faa in nav2_util::NodeThread::<lambda()>::operator() (__closure=0x56132f4a3638) at /home/kin/playground_ws/src/navigation2/nav2_util/src/node_thread.cpp:29
#6  std::__invoke_impl<void, nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#7  std::__invoke<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#8  std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::_M_invoke<0> (this=0x56132f4a3638) at /usr/include/c++/7/thread:234
#9  std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::operator() (this=0x56132f4a3638) at /usr/include/c++/7/thread:243
#10 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > > >::_M_run(void) (this=0x56132f4a3630)

ivanpauno · 2019-12-16T17:58:01Z

@maxlein That error seems unrelated to the previous one.
Please, if you have a simple example for reproducing the error, it would be help for investigating the issue.

About the first one, check my previous comment #938 (comment).
I'm pretty sure the error comes from that, though I don't have time now for patching it.

maxlein · 2019-12-19T08:36:12Z

I don't feel I have the knowledge about rcl to fix this, I may break more than fix ;)
But is there anyone who can look at this?
Because I see this as a major bug if the system is not stable.
Only in the last few hours amcl and nav2 controller crashed with this error.

So either it happens only on my system or I am the only person who uses ROS2 for longer periods of time?

ivanpauno · 2019-12-19T20:49:58Z

@maxlein #954 should fix the original issue that you reported (the one related with jump callback).
Can you check if that PR solves the problem?

maxlein · 2019-12-20T12:39:39Z

Thanks! This will take time to test, as I have no source build on the robot yet ...

ivanpauno · 2019-12-23T21:33:42Z

Reopening, until @maxlein confirms that #954 fixed the issue.

ivanpauno · 2020-01-09T18:36:11Z

@maxlein Friendly ping. Are you going to be able to test this?
If not, I will proced to close and we can reopen later if needed.

maxlein · 2020-01-10T09:22:10Z

Problem is on my dev notebook I can't reproduce this bug although nearly the same system...
Differs just simulated hardware/sensors.

Is there a date when this fix is released in a deb pkg?

ivanpauno · 2020-01-10T12:56:14Z

Problem is on my dev notebook I can't reproduce this bug although nearly the same system...
Differs just simulated hardware/sensors.

Is there a date when this fix is released in a deb pkg?

The problem is that the fix breaks ABI ...
Usually, we try to avoid breaking ABI, except there is no other option.

If that PR is really fixing something, I can ask it to be backported, and discuss if breaking ABI worth it.
But without confirmation, I'm not convinced to include it in a patch release.

Do you have a way to test the fix without a deb pkg?

maxlein · 2020-01-10T13:49:54Z

I have time to set up a source build on the robot in about a month earliest.
And I am not sure if I can reproduce this. Since a few days the crashes disappeared without any significant changes which is a bit worrying to me.
But I have to watch this.

ivanpauno · 2020-01-10T16:01:34Z

I have time to set up a source build on the robot in about a month earliest.
And I am not sure if I can reproduce this. Since a few days the crashes disappeared without any significant changes which is a bit worrying to me.
But I have to watch this.

Sounds good.
I'm going to close this, to avoid hitting it again in other ros2 triage meetings.
But of course, if you are able to still reproduce the problem, please reopen the issue.

Thanks @maxlein!

…#938) * Handle RCL_WARN_UNUSED * Avoid duplicate type names Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

ivanpauno added the bug Something isn't working label Dec 12, 2019

maxlein changed the title ~~Strange behavior in slam_toolbox, seg faults, double free or corruption, etc..~~ Crash when reallocating in rcl_clock_remove_jump_callback Dec 19, 2019

ivanpauno mentioned this issue Dec 19, 2019

Avoid possible UB in Clock jump callbacks #954

Merged

ivanpauno closed this as completed in #954 Dec 20, 2019

ivanpauno reopened this Dec 23, 2019

wjwwood assigned ivanpauno Jan 9, 2020

ivanpauno removed their assignment Jan 9, 2020

ivanpauno closed this as completed Jan 10, 2020

maxlein mentioned this issue Jan 30, 2020

SIGABRT: realloc(): invalid next size at remove_null_handles #974

Closed

daisukes mentioned this issue May 18, 2020

Controller_server and planner_server dies during execution ros-navigation/navigation2#1727

Closed

nnmm pushed a commit to ApexAI/rclcpp that referenced this issue Jul 9, 2022

Fix up documentation build for rcl_lifecycle when using rosdoc2 (ros2…

991240d

…#938) * Handle RCL_WARN_UNUSED * Avoid duplicate type names Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when reallocating in rcl_clock_remove_jump_callback #938

Crash when reallocating in rcl_clock_remove_jump_callback #938

maxlein commented Dec 6, 2019

jacobperron commented Dec 6, 2019

maxlein commented Dec 11, 2019

ivanpauno commented Dec 11, 2019

maxlein commented Dec 16, 2019

ivanpauno commented Dec 16, 2019

maxlein commented Dec 19, 2019

ivanpauno commented Dec 19, 2019

maxlein commented Dec 20, 2019

ivanpauno commented Dec 23, 2019 •

edited

Loading

ivanpauno commented Jan 9, 2020

maxlein commented Jan 10, 2020

ivanpauno commented Jan 10, 2020

maxlein commented Jan 10, 2020

ivanpauno commented Jan 10, 2020

Crash when reallocating in rcl_clock_remove_jump_callback #938

Crash when reallocating in rcl_clock_remove_jump_callback #938

Comments

maxlein commented Dec 6, 2019

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

jacobperron commented Dec 6, 2019

maxlein commented Dec 11, 2019

ivanpauno commented Dec 11, 2019

maxlein commented Dec 16, 2019

ivanpauno commented Dec 16, 2019

maxlein commented Dec 19, 2019

ivanpauno commented Dec 19, 2019

maxlein commented Dec 20, 2019

ivanpauno commented Dec 23, 2019 • edited Loading

ivanpauno commented Jan 9, 2020

maxlein commented Jan 10, 2020

ivanpauno commented Jan 10, 2020

maxlein commented Jan 10, 2020

ivanpauno commented Jan 10, 2020

ivanpauno commented Dec 23, 2019 •

edited

Loading