Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug mentioned in #3958 && a futhuer suggestion for child-LifecycleNode mechanism design #3971

Closed
GoesM opened this issue Nov 20, 2023 · 2 comments

Comments

@GoesM
Copy link
Contributor

GoesM commented Nov 20, 2023

Bug report

Required Info:

  • Operating System:
    • all
  • ROS2 Version:
    • all
  • Version or commit hash:
    • newest
  • DDS implementation:

Steps to reproduce issue


Expected behavior

expected behavior what the degisner wish should be like

costmap_ros_->deactivate() works during planner/controller->on_deactivate(), as well as costmap_ros_->cleanup() works during planner/controller->on_cleanup()

.../nav2_planner/src/planner_server.cpp#L224 and L257, and similar code in nav2_controller

Actual behavior

As an individual LifecycleNode, costmap_ros_ may react to rcl_preshutdown randomly, and always be closed much earlier than planner/controller->deactivate()

it would lead to many unexpected crashes like bug mentioned in #3958 :

during shutdown_period, costmap_ros_ may react to rcl_preshutdown earlier, so that planner/controller may try to access the costmap_ros_ though it has been a nullptr. As a result it would lead to a crash during shutdown period.

such crashes may lead to bad situation that :

program shut down in an unplanned manner and result in residual processes.


Additional information

costmap is set as a Nav2::util LifecycleNode : ../nav2_costmap_2d/include/nav2_costmap_2d/costmap_2d_ros.hpp#L 73

all LifecycleNode would react to Ctrl+C ../nav2_util/src/lifecycle_node.cpp#L 90

all LifecycleNode would react to rcl_preshutdown randomly, and similar issue has been discussed before : ../ros2/rclcpp/issues/2096

this explanation should have been clear enough: ../nav2_planner/src/planner_server.cpp #L 214-L225

the bug happening in ISSUE: #3958 is caused by this.

because during shutdown_period, costmap_ros_ may react rcl_preshutdown earlier, so that planner/controller may try to access the costmap_ros_ though it has been a nullptr.


our provided solution

it seems that costmap is only used as a child-thread of planner/controller, ../nav2_planner/src/planner_server.cpp#L 89

so can we make it do nothing when reacting to Ctrl+C ?

doing so, its lifecycle would be completely bond to planner/controller's lifecycle, and also going to death in order.

//in file `costmap_2d_ros.hpp`
  /**
   * @brief override on_rcl_preshutdown() as empty
   * [reason] costmap only could be created by its parents like planner/controller_server
   * [reason] costmap may react to ctrl+C earlier than its parents to cause nullptr-accessed
   * so the costmap's reaction must be later than its parent to avoid nullptr-accessed
   * 
   * thus, it's a more perfect way to override on_rcl_preshutdown() as empty function
   * and its deactivate must be joint in planner/controller_server->on_deactivate()
   * and its cleanup must be joint in planner/controller_server->on_cleanup()
   * 
   * !!! a mention !!!
   * if any other place use this class (Costmap2DROS)
   * it's necessary to let costmap->deactivate() joint in parent->deactivate()
   * and let costmap->cleanup() joint in parent->cleanup() 
   */
  void on_rcl_preshutdown() override{
    //do nothing
    return ;
  }

because the costmp_ros_ is just a thread created by planner/controller,

whether program shutdown correctly or not, or even dead unexceptionably, the child-thread could be closed all the time since parent-LifecycleNode sub-process dead.

Thus there's no need to worry whether it would be still alive after whole program exit finally.

in addition

if our solution being adopted, many checks could be removed :

these double checks are also designed to face to bugs in situation that costmap_ros_ may be closed earlier than planner/controller. After our code modification, these checks are useless anymore.

what's more, our solution performs perfectly during our local tests, much more than our past solutions mentioned in #3958



Further Suggestion

We specially name nodes like costmap_ros_ as child-LifecycleNode, which is born from their parent-LifecycleNode like planner_server and controller_server.

our suggestion is:
Could we set up a mechanism specially for such situations like child-LifecycleNode, to make sure they're completely controlled by their parent?

these child-LifecycleNode should have some basic specialist like:

  • they must be created by a parent-LifecycleNode
  • they must be created as a thread but not a sub-process

such mechanism could be like:

  • child's->activate should be joint into its their parent's on_configure
  • child->deactivate() must be joint into its parent's on_deactivate
  • child->cleanup() must be joint into its parent's on_cleanup
  • and so on

Perhaps such mechanism could be used for any program in ROS2 ?

@GoesM GoesM mentioned this issue Nov 20, 2023
7 tasks
@GoesM
Copy link
Contributor Author

GoesM commented Nov 20, 2023

we've created a new PR #3972 to fix the bug.

@SteveMacenski
Copy link
Member

Discussing inline in PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants