Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclcpp_lifecycle::LifecycleNode::get_current_state is not thread safe #1746

Closed
igorbanfi opened this issue Aug 9, 2021 · 7 comments · Fixed by #1756
Closed

rclcpp_lifecycle::LifecycleNode::get_current_state is not thread safe #1746

igorbanfi opened this issue Aug 9, 2021 · 7 comments · Fixed by #1756
Assignees
Labels
bug Something isn't working

Comments

@igorbanfi
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.4
  • Installation type:
    • binaries
  • Version or commit hash:
    • foxy
  • DDS implementation:
    • fastDDS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

In code below get_current_state is called in 2 threads and after random number of iterations Error in state! Internal state_handle is NULL. is thrown when calling id() state.

The problem disappears if only one thread calls get_current_state function.

#include <chrono>
#include <iostream>
#include <memory>
#include <thread>
#include "lifecycle_msgs/msg/transition.hpp"
#include "rclcpp/rclcpp.hpp"
#include "rclcpp/publisher.hpp"
#include "rclcpp_lifecycle/lifecycle_node.hpp"
#include "rcutils/logging_macros.h"

/// Get current node state in a loop
void checkState(std::shared_ptr<rclcpp_lifecycle::LifecycleNode> node) {

   // infinite loop
   while (1) {
      // This call fails
      node->get_current_state().id();
   }
}

int main(int argc, char * argv[])
{
   setvbuf(stdout, NULL, _IONBF, BUFSIZ);

   rclcpp::init(argc, argv);

   std::shared_ptr<rclcpp_lifecycle::LifecycleNode> lc_node =
           std::make_shared<rclcpp_lifecycle::LifecycleNode>("lc_talker");

   lc_node->trigger_transition(lifecycle_msgs::msg::Transition::TRANSITION_CONFIGURE);

   // Created multiple threads
   std::thread thread_object_1(checkState, lc_node);
   std::thread thread_object_2(checkState, lc_node);

   // wait
   thread_object_1.join();

   rclcpp::shutdown();

   return 0;
}
@devrite
Copy link
Contributor

devrite commented Aug 9, 2021

This is true. Current states queries from impl, sets to local var and returns it.

const State &
get_current_state()
{
current_state_ = State(state_machine_.current_state);
return current_state_;
}

Solution (if it is safe to query the internal state from multiple threads): Do not set the local var, just return the constructed value.

But since a state may change parallel (outside service call proc by some cb) of others threads querying it there probably should be locks probably.

@fujitatomoya fujitatomoya added the bug Something isn't working label Aug 9, 2021
@fujitatomoya
Copy link
Collaborator

This is a bug, confirmed that core crash happens on mainline.

Do not set the local var, just return the constructed value.

for doing this, I think we need to change the API to return the object.

But since a state may change parallel (outside service call proc by some cb) of others threads querying it there probably should be locks probably.

right. I believe that state_machine_ needs to be protected by mutex lock. (note that we need to release the lock when calling the user callback.)

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Aug 11, 2021
@devrite
Copy link
Contributor

devrite commented Aug 11, 2021

Seems reasonable to unlock while doing user cb. Requires, pre, cb, post not to be an atomic operation. If that is the case an unlock can be done. Otherwise if no other changing/writing calls from other threads are allowed during the release of pre and reacquiring of cb or post, we can only unlock if other writing API calls recognize an ongoing atomic transaction (due to code in pre). This would just abort any further transaction with ongoing transaction error or maybe a possibility (conditionvar) to wait.

The other option is to provide an API impl to the callback that does not need to lock, but this would be a breaking change I guess.

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Aug 25, 2021

(Rephrase precisely)

right. I believe that state_machine_ needs to be protected by mutex lock.

This is still correct but not exactly direct reason for coredump, we can see the following backtrace.

Click to expand!
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f6558cb6859 in __GI_abort () at abort.c:79
#2  0x00007f6558d213ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f6558e4b285 "%s\n")
    at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f6558d2947c in malloc_printerr (str=str@entry=0x7f6558e4d1e0 "munmap_chunk(): invalid pointer") at malloc.c:5347
#4  0x00007f6558d296cc in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5  0x00007f65590c0029 in __default_deallocate () from /root/ros2_ws/colcon_ws/install/rcutils/lib/librcutils.so
#6  0x00007f6558c898f7 in rcl_lifecycle_state_fini () from /root/ros2_ws/colcon_ws/install/rcl_lifecycle/lib/librcl_lifecycle.so
#7  0x00007f6559ee0132 in rclcpp_lifecycle::State::reset (this=0x55a9101de3a0)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_lifecycle/src/state.cpp:157
#8  0x00007f6559edfd27 in rclcpp_lifecycle::State::operator= (this=0x55a9101de3a0, rhs=...)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_lifecycle/src/state.cpp:97
#9  0x00007f6559ea191c in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::get_current_state (this=0x55a9101de310)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp:338
#10 0x00007f6559e9db4c in rclcpp_lifecycle::LifecycleNode::get_current_state (this=0x55a9101c4440)
    at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_lifecycle/src/lifecycle_node.cpp:525
#11 0x000055a90f5f978f in checkState (node=std::shared_ptr<class rclcpp_lifecycle::LifecycleNode> (use count 3, weak count 1) = {...})
    at /root/ros2_ws/colcon_ws/src/ros2/ros2_test_prover/prover_rclcpp/src/rclcpp_1746.cpp:18
#12 0x000055a90f5fd1f2 in std::__invoke_impl<void, void (*)(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>), std::shared_ptr<rclcpp_lifecycle::LifecycleNode> > (__f=@0x55a91023efd8: 0x55a90f5f976b <checkState(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>)>)
    at /usr/include/c++/9/bits/invoke.h:60
#13 0x000055a90f5fd0b5 in std::__invoke<void (*)(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>), std::shared_ptr<rclcpp_lifecycle::LifecycleNode> > (__fn=@0x55a91023efd8: 0x55a90f5f976b <checkState(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>)>)
    at /usr/include/c++/9/bits/invoke.h:95
#14 0x000055a90f5fcfad in std::thread::_Invoker<std::tuple<void (*)(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>), std::shared_ptr<rclcpp_lifecycle::LifecycleNode> > >::_M_invoke<0ul, 1ul> (this=0x55a91023efc8) at /usr/include/c++/9/thread:244
#15 0x000055a90f5fce96 in std::thread::_Invoker<std::tuple<void (*)(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>), std::shared_ptr<rclcpp_lifecycle::LifecycleNode> > >::operator() (this=0x55a91023efc8) at /usr/include/c++/9/thread:251
#16 0x000055a90f5fcd82 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::shared_ptr<rclcpp_lifecycle::LifecycleNode>), std::shared_ptr<rclcpp_lifecycle::LifecycleNode> > > >::_M_run (this=0x55a91023efc0) at /usr/include/c++/9/thread:195
#17 0x00007f6558f99de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007f6558e8c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x00007f6558db3293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

protecting state_machine_ with mutex does solve the invalid pointer problem, but State::state_handle_ still has racy condition w/o mutex lock.

@fujitatomoya
Copy link
Collaborator

address racy condition with #1756

Do not set the local var, just return the constructed value.
for doing this, I think we need to change the API to return the object.

besides, probably we want to do this for user experience? any opinion?

@alsora
Copy link
Collaborator

alsora commented Sep 9, 2021

+1 on this.
Thank you @fujitatomoya for the work!

I also noticed the crash when concurrently invoking a lifecycle node state transition while at the same time the node was checking the current state.

@clalancette clalancette self-assigned this Sep 23, 2021
@fujitatomoya
Copy link
Collaborator

protecting state_machine_ with mutex does solve the invalid pointer problem, but State::state_handle_ still has racy condition w/o mutex lock.

453bfa8 can resolve this racy condition, confirmed with https://github.com/fujitatomoya/ros2_test_prover/blob/master/prover_rclcpp/src/rclcpp_1746.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants