Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply a fix for LifecycleNode::get_current_state thread safety which is equivalent to #1756 into Humble #2172

Closed
schornakj opened this issue Apr 25, 2023 · 5 comments
Labels
backlog bug Something isn't working

Comments

@schornakj
Copy link

schornakj commented Apr 25, 2023

There is a thread safety bug in LifecycleNode which was originally documented in #1746 and fixed in #1756. However, the way the fix was implemented involves a breaking ABI change, and since it was merged after the Humble release it cannot be directly backported to Humble.

#1746 is actually a quite severe bug, especially for systems which are using ros2_control's ControllerManager, which is tightly integrated with the lifecycle node concept. My context here is that I work with several robots which need to stay up and continuously running 24/7, and every few days I see a segfault like the ones posted in ros-controls/ros2_control#979. From my perspective the fix for this certainly seems like something that should be backported into the LTS release as well.

From reading the discussion in #1756 and related PRs, one reason why a fix took so long to get merged and involved an ABI break is that the work to implement the fix uncovered several other bugs in LifecycleNode, and the fixes for those bugs needed to get merged as a precondition for merging #1756. If I want to apply an equivalent fix to Humble, I'll need to follow a different path since I need to preserve ABI compatibility. Does anyone who's already familiar with these libraries have ideas for how to approach this?

@clalancette
Copy link
Contributor

I'll leave a few comments:

  • Backporting Cleanup the lifecycle implementation #2027 (one of the underlying PRs) is not going to be possible, as it breaks API and ABI. So skip this one.
  • Backporting Make lifecycle impl get_current_state() const. #2031 (one of the underlying PRs) breaks ABI, so we can't backport it. This one is minor anyway.
  • That leaves backporting Bugfix 20210810 get current state #1756 itself. If you go back to my original comments from Bugfix 20210810 get current state #1756 (comment), my major concern here was returning an internal variable member from get_current_state. We aren't going to be able to fix that one because of ABI problems, but it may be enough to just take the rest of that PR as it was at the time and leave the problematic get_current_state as it is.
  • Note that even with the previous bullet point, we still have to maintain ABI in humble. So, for instance, we can't add the new state_handle_mutex_ member to the State class or the state_machine_mutex_ to the LifecycleNodeInterface class. One trick I've used in the past to get around this is to introduce a global map with a pointer to the class as the key and the mutex you want to use as the value. Then when you want to lock in a function, you look up the mutex using this and then lock it. Note that the global map also needs its own mutex to avoid multiple threads accessing it at once.

I hope this helps point in the correct direction.

@clalancette clalancette added bug Something isn't working backlog labels Apr 27, 2023
@schornakj
Copy link
Author

Note that even with the previous bullet point, we still have to maintain ABI in humble. So, for instance, we can't add the new state_handle_mutex_ member to the State class or the state_machine_mutex_ to the LifecycleNodeInterface class. One trick I've used in the past to get around this is to introduce a global map with a pointer to the class as the key and the mutex you want to use as the value. Then when you want to lock in a function, you look up the mutex using this and then lock it. Note that the global map also needs its own mutex to avoid multiple threads accessing it at once.

The bulk of the changes in #1756 are to lifecycle_node_interface_impl.hpp and lifecycle_node_interface_impl.cpp, and I'm wondering if the LifecycleNodeInterfaceImpl class which is defined in those files is really a part of the public interface? It's used as a private unique_ptr member with a forward declaration here in LifecycleNode, and both the header and source file for LifecycleNodeInterfaceImpl are in rclcpp_lifecycle/src. That seems like a textbook example of the pimpl pattern commonly used to allow making internal changes to a class while preserving ABI compatibility. @clalancette am I thinking along the right track here?

@clalancette
Copy link
Contributor

The bulk of the changes in #1756 are to lifecycle_node_interface_impl.hpp and lifecycle_node_interface_impl.cpp, and I'm wondering if the LifecycleNodeInterfaceImpl class which is defined in those files is really a part of the public interface? It's used as a private unique_ptr member with a forward declaration here in LifecycleNode, and both the header and source file for LifecycleNodeInterfaceImpl are in rclcpp_lifecycle/src. That seems like a textbook example of the pimpl pattern commonly used to allow making internal changes to a class while preserving ABI compatibility. @clalancette am I thinking along the right track here?

You are correct w.r.t. state_machine_mutex_ in the LifecycleNodeInterfaceImpl class; thanks for pointing that out. That part doesn't have an ABI problem, since it is all implementation.

However, we still have the ABI problem with state_handle_mutex_ in the State class. If we can figure out a way to preserve ABI there (ala my previous comments), then the rest of #1756 should be backportable.

@schornakj
Copy link
Author

@clalancette I followed your suggested strategy for protecting state_handle_ in the State class and pushed up #2183. Let me know what you think!

@schornakj
Copy link
Author

#2183 got merged, so I'll close this as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants