Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve component_manager_isolated shutdown #2085

Merged
merged 3 commits into from
Jan 23, 2023

Conversation

mjcarroll
Copy link
Member

This eliminates a potential hang when the isolated container is being shutdown via the rclcpp::SignalHandler.

Previously, when the SignalHandler triggers the shutdown of the rest of the system, the ComponentManagerIsolated object would cancel and destroy the executor associated with each loaded component.

Part of the cancel process checks if the executor is currently spinning as there may have been a race condition where cancel_executor is called before the thread is created. While I didn't personally observe this condition, the comment indicates that it may happen.

    while (!executor_wrapper.executor->is_spinning()) {
      // This is an arbitrarily small delay to avoid busy looping
      rclcpp::sleep_for(std::chrono::milliseconds(1));
    }

The issue is that there is a race where the executor is already shutdown/cancelled (so is_spinning is always false), causing an infinite loop.

This code introduces a guard atomic to check that the loop has been started before continuing to check the is_spinning as well as checking the overall ROS context for shutdown.

Closes #2083

Signed-off-by: Michael Carroll michael@openrobotics.org

@mjcarroll
Copy link
Member Author

@clalancette Now that I have this implemented, one open question is that if the is_spinning logic is still required with this atomic flag?

This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler

Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
Signed-off-by: Michael Carroll <michael@openrobotics.org>
@mjcarroll mjcarroll force-pushed the mjcarroll/isolated_container_shutdown branch from 0dda81c to 8017ca7 Compare January 19, 2023 18:00
@alsora
Copy link
Collaborator

alsora commented Jan 19, 2023

@mjcarroll I think that the is_spinning check is still required and I'm actually not sure if you really need the atomic variable or the check on the context is sufficient.

Without the is_spinning there may be a data race where the thread starts (atomic becomes true) but then it's pre-empted before starting to spin.

@mjcarroll
Copy link
Member Author

I think that the is_spinning check is still required and I'm actually not sure if you really need the atomic variable or the check on the context is sufficient.

Chris and I looked at this together, and we thought that the safest thing was to include both. The context check by itself was enough to prevent it from hanging on my local computer, but that does not guarantee that it is completely safe.

The issue stems from a little bit of under-documentation in the way that the executor's cancel works. The question is, should an executor be allowed to spin after cancel is called, and what is the state when cancel is called before the first spin. In this case, this guard exists to prevent the cancel from being called before the first spin.

A potential alternative would be to spin_once, set the atomic, and then spin, to guarantee that the spin has happened before the cancel.

The original fix here was to delay until the executor is spinning, but we ended up in a state where the executor is not spinning so the loop cannot be broken.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only left suggestions to clean up comments a bit. Otherwise, this looks good to me.

…isolated.hpp


Signed-off-by: Michael Carroll <michael@openrobotics.org>

Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
@fujitatomoya
Copy link
Collaborator

CI(Linux only):

  • Linux Build Status

@mjcarroll mjcarroll merged commit ab71df3 into rolling Jan 23, 2023
@delete-merged-branch delete-merged-branch bot deleted the mjcarroll/isolated_container_shutdown branch January 23, 2023 15:46
alsora pushed a commit to irobot-ros/rclcpp that referenced this pull request Apr 29, 2023
This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
alsora pushed a commit to irobot-ros/rclcpp that referenced this pull request Apr 29, 2023
This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
alsora pushed a commit to irobot-ros/rclcpp that referenced this pull request May 3, 2023
This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
alsora pushed a commit to irobot-ros/rclcpp that referenced this pull request May 3, 2023
This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
@tonynajjar
Copy link
Contributor

tonynajjar commented Aug 8, 2023

I spent 2 hours debugging this on humble before finally finding this fix 😄 Thanks for the fix! Can you please backport to humble? I cherrypicked it locally and it worked well

tonynajjar pushed a commit to pixel-robotics/rclcpp that referenced this pull request Aug 8, 2023
This eliminates a potential hang when the isolated container is being
shutdown via the rclcpp SignalHandler.

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Chris Lalancette <clalancette@openrobotics.org>
@fujitatomoya
Copy link
Collaborator

@mjcarroll @clalancette

adding explicit default constructor should be fine to keep the ABI? i think that is okay but need to be sure...

@clalancette
Copy link
Contributor

adding explicit default constructor should be fine to keep the ABI? i think that is okay but need to be sure...

I think that is OK, but the addition of the atomic_bool to the inner struct definitely breaks ABI. Someone will have to find another way to backport this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

it takes too long to terminate when using component_container_isolated
5 participants