Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rclcpp clock problems #1144

Closed
charlotteheggem opened this issue May 28, 2020 · 7 comments
Closed

Rclcpp clock problems #1144

charlotteheggem opened this issue May 28, 2020 · 7 comments

Comments

@charlotteheggem
Copy link

  • Operating System: Ubuntu 18.04.4
  • Installation type: Eloquent binaries
  • Version or commit hash: 0.8.4-1bionic.20200218.183043
  • DDS implementation: FastRTPS (also tested for Cyclone DDS)
  • Client library (if applicable): rclcpp

Hi,

We have experienced problems when running Navigation2, and after debugging with GDB it seems to be due to rclcpp. (ros-navigation/navigation2#1727). We are using binary packages for ROS Eloquent and rclcpp. Navigation2 is installed from source from the Eloquent branch. We are observing the same problem on two different computers. This is really bothering us, as these errors occur every time we are trying to run Navigation. Does anyone have any suggestions?
We have already tried to remove the binaries of rclcpp, and install them again. This did not help. The following tracebacks are from different runs of Navigation2.

traceback 1:

[INFO] [local_costmap.local_costmap_rclcpp_node]: Message Filter dropping message: frame 'scan_2' at time 1590649370.291 for reason 'Unknown'
double free or corruption (fasttop)
Thread 12 "controller_serv" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe5be8700 (LWP 11290)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff492c801 in __GI_abort () at abort.c:79
#2 0x00007ffff4975897 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ffff4aa2b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff497c90a in malloc_printerr (
str=str@entry=0x7ffff4aa4828 "double free or corruption (fasttop)")
at malloc.c:5350
#4 0x00007ffff497f1d5 in _int_free (av=0x7ffff4cd7c40 <main_arena>,
p=0x555555c311d0, have_lock=) at malloc.c:4230
#5 0x00007ffff498179e in _int_realloc (
av=av@entry=0x7ffff4cd7c40 <main_arena>, oldp=oldp@entry=0x555555c2c010,
oldsize=oldsize@entry=20976, nb=nb@entry=20928) at malloc.c:4650
#6 0x00007ffff4984f9b in __GI___libc_realloc (oldmem=0x555555c2c020,
bytes=20920) at malloc.c:3230
#7 0x00007ffff649e359 in rcl_clock_remove_jump_callback ()
from /opt/ros/eloquent/lib/librcl.so
#8 0x00007ffff649fd72 in rcl_timer_fini ()
from /opt/ros/eloquent/lib/librcl.so
#9 0x00007ffff796aac2 in ?? () from /opt/ros/eloquent/lib/librclcpp.so
#10 0x00007ffff796ae59 in rclcpp::TimerBase::~TimerBase() ()
from /opt/ros/eloquent/lib/librclcpp.so
#11 0x00007ffff31f8259 in ?? () from /opt/ros/eloquent/lib/libtf2_ros.so
#12 0x00007ffff31f8722 in tf2_ros::CreateTimerROS::remove(unsigned long const&)

traceback 2:

[Thread 0x7fff79f98700 (LWP 17801) exited]
double free or corruption (fasttop)
Thread 36 "planner_server" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff7b7fe700 (LWP 17768)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff6844801 in __GI_abort () at abort.c:79
#2 0x00007ffff688d897 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ffff69bab9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff689490a in malloc_printerr (
str=str@entry=0x7ffff69bc828 "double free or corruption (fasttop)")
at malloc.c:5350
#4 0x00007ffff68971d5 in _int_free (av=0x7ffff6befc40 <main_arena>,
p=0x555556679d90, have_lock=) at malloc.c:4230
#5 0x00007ffff689979e in _int_realloc (
av=av@entry=0x7ffff6befc40 <main_arena>, oldp=oldp@entry=0x555556675eb0,
oldsize=oldsize@entry=16128, nb=nb@entry=16096) at malloc.c:4650
#6 0x00007ffff689cf9b in __GI___libc_realloc (oldmem=0x555556675ec0,
bytes=16080) at malloc.c:3230
#7 0x00007ffff5242359 in rcl_clock_remove_jump_callback ()
from /opt/ros/eloquent/lib/librcl.so
#8 0x00007ffff5243d72 in rcl_timer_fini ()
from /opt/ros/eloquent/lib/librcl.so
#9 0x00007ffff76e9ac2 in rclcpp::TimerBase::<lambda(rcl_timer_t*)>::operator()(rcl_timer_t *) (__closure=0x7fff8c0dad20, timer=0x7fff8c106520)
at ./src/rclcpp/timer.cpp:43
#10 0x00007ffff76e9e59 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff8c0dad10) at /usr/include/c++/7/bits/shared_ptr_base.h:154

traceback 3:

malloc_consolidate(): invalid chunk size

Thread 10 "controller_serv" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffd75f5700 (LWP 13731)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff492c801 in __GI_abort () at abort.c:79
#2 0x00007ffff4975897 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ffff4aa2b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff497c90a in malloc_printerr (
str=str@entry=0x7ffff4aa43f0 "malloc_consolidate(): invalid chunk size")
at malloc.c:5350
#4 0x00007ffff497cbae in malloc_consolidate (
av=av@entry=0x7ffff4cd7c40 <main_arena>) at malloc.c:4441
#5 0x00007ffff49807d8 in _int_malloc (
av=av@entry=0x7ffff4cd7c40 <main_arena>, bytes=bytes@entry=2993)
at malloc.c:3703
#6 0x00007ffff49818f5 in _int_realloc (
av=av@entry=0x7ffff4cd7c40 <main_arena>, oldp=oldp@entry=0x555555c48ba0,
oldsize=oldsize@entry=2976, nb=nb@entry=3008) at malloc.c:4570
#7 0x00007ffff4984f9b in __GI___libc_realloc (oldmem=0x555555c48bb0,
bytes=3000) at malloc.c:3230
#8 0x00007ffff649e150 in rcl_clock_add_jump_callback ()
from /opt/ros/eloquent/lib/librcl.so
#9 0x00007ffff649ed6c in rcl_timer_init ()
from /opt/ros/eloquent/lib/librcl.so
#10 0x00007ffff796b7c5 in rclcpp::TimerBase::TimerBase(std::shared_ptrrclcpp::Clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::shared_pt

traceback 4:

[INFO] [global_costmap.global_costmap_rclcpp_node]: Message Filter dropping message: frame 'scan' at time 1590674529.213 for reason 'Unknown'

[rcutils|error_handling.c:107] rcutils_set_error_state()
This error state is being overwritten:

'jump callback was not found, at /tmp/binarydeb/ros-eloquent-rcl-0.8.4/src/rcl/time.c:445'

with this new error message:

'callback/user_data are already added to this clock, at /tmp/binarydeb/ros-eloquent-rcl-0.8.4/src/rcl/time.c:403'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [rclcpp]: Couldn't initialize rcl timer handle: callback/user_data are already added to this clock, at /tmp/binarydeb/ros-eloquent-rcl-0.8.4/src/rcl/time.c:403
Thread 24 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa27fc700 (LWP 18617)]
0x00007ffff5243eb1 in rcl_timer_is_canceled ()
from /opt/ros/eloquent/lib/librcl.so
(gdb) backtrace
#0 0x00007ffff5243eb1 in rcl_timer_is_canceled ()
from /opt/ros/eloquent/lib/librcl.so
#1 0x00007ffff52460ca in rcl_wait () from /opt/ros/eloquent/lib/librcl.so
#2 0x00007ffff7662331 in rclcpp::executor::Executor::wait_for_work (
this=this@entry=0x555555d28108, timeout=timeout@entry=...)
at ./src/rclcpp/executor.cpp:470
#3 0x00007ffff76627a5 in rclcpp::executor::Executor::get_next_executable (
this=this@entry=0x555555d28108, any_executable=...,
timeout=timeout@entry=...) at ./src/rclcpp/executor.cpp:573
#4 0x00007ffff7666710 in rclcpp::executors::SingleThreadedExecutor::spin (
this=0x555555d28108)
at ./src/rclcpp/executors/single_threaded_executor.cpp:35
#5 0x00007ffff62fd43a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(std::shared_ptrrclcpp::node_interfaces::NodeBaseInterface)::{lambda()#1}> > >::_M_run() ()
from /home/ninamwa/navigation2/install/nav2_util/lib/libnav2_util_core.so
#6 0x00007ffff6eca6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007ffff73af6db in start_thread (arg=0x7fffa27fc700)
at pthread_create.c:463
#8 0x00007ffff692588f in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@clalancette
Copy link
Contributor

It is worthwhile to test out this branch: https://github.com/ros2/rclcpp/tree/clalancette/global-clock-lock . I'll be curious to know if that solves your problem.

@fujitatomoya
Copy link
Collaborator

traceback 1, 2 and 3 comes from multi-thread access to clock->jump_callbacks, related to #1056.

@ninamwa
Copy link

ninamwa commented Jun 4, 2020

Hi! We did check out the branch a week ago but it did not solve the problem. After the comment this morning about multi threads, we did a proper cleanup in some old threads. It still did not work with the debian rclcpp eloquent package, but the global-clock-lock branch worked like a charm! Thank you very much. We have been running Navigation2 all day without a problem! 🙂

@SteveMacenski
Copy link
Collaborator

@clalancette I think we should reopen this ticket until that PR lands in master

@clalancette
Copy link
Contributor

@clalancette I think we should reopen this ticket until that PR lands in master

Just to be clear, there is a fix for this on master (and will be in Foxy). The problem with that fix is that it breaks ABI, so that particular patch isn't eligible for Eloquent.

The global-clock-lock branch is an attempt to fix this in an ABI-compatible way for Eloquent. It does exactly what it says; it creates a global std::mutex that gets locked every time we manipulate the the underlying rcl clock structures. Being global, the downside to it is that it is shared across all Nodes in a process, so in a component-process, this can hurt.

#1056 is an attempt to mitigate some of that while still being ABI compatible for Eloquent, but I ran into some testing bugs with it that I haven't had time to look at it.

So given all of that, do you still want to keep this open? Personally I think just having the open PR for Eloquent is enough, but we can open this back up if you want to track the fix for Eloquent with this ticket.

@SteveMacenski
Copy link
Collaborator

SteveMacenski commented Jun 4, 2020

Got it, thanks. If its been merged into master, I'm OK with closing it.

@ControlLI
Copy link

Could any one can tell me, is that problem dealed in galactic branch? It seems I meet the same problem in my robot with galactic branch!
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants