Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lifecycle destructor calls shutdown while in shuttingdown intermediate state #2520

Closed
oysstu opened this issue May 2, 2024 · 45 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@oysstu
Copy link
Contributor

oysstu commented May 2, 2024

Bug report

Required Info:

  • Operating System: Ubuntu 24.04
  • Installation type: Both
  • Version or commit hash: iron latest
  • DDS implementation: default
  • Client library (if applicable): N/A

Steps to reproduce issue

Shutdown transition was added to the lifecycle node destructor in #2450 (iron backport: #2490). This currently triggers shutdown if the node state is anything other than finalized. Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition? I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

Not sure if it is related to the publisher being invalid or not, but the node is clearly in the shuttingdown state, and not one of the primary states.

Expected behavior

Only attempt transition in primary states unconfigured, inactive, and active.

Actual behavior

Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

Ref. the following destructor snippet

if (LifecycleNode::get_current_state().id() !=
lifecycle_msgs::msg::State::PRIMARY_STATE_FINALIZED)
{
auto ret = rclcpp_lifecycle::node_interfaces::LifecycleNodeInterface::CallbackReturn::ERROR;
auto finalized = LifecycleNode::shutdown(ret);
if (finalized.id() != lifecycle_msgs::msg::State::PRIMARY_STATE_FINALIZED ||
ret != rclcpp_lifecycle::node_interfaces::LifecycleNodeInterface::CallbackReturn::SUCCESS)
{
RCLCPP_WARN(
rclcpp::get_logger("rclcpp_lifecycle"),
"Shutdown error in destruction of LifecycleNode: final state(%s)",
finalized.label().c_str());
}
}

@fujitatomoya fujitatomoya self-assigned this May 2, 2024
@fujitatomoya
Copy link
Collaborator

Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition?

i think this makes sense. we cannot change the state when it is in transition state anyway.
if that is not in the primary state, we can print the warning that it cannot shutdown during destructor.

Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

lifecycle node com interface cannot be finalized yet, https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl_lifecycle/src/rcl_lifecycle.c#L264 since std::unique_ptr<LifecycleNodeInterfaceImpl> impl_; should not be destroyed at LifecycleNode's dtor.

@fujitatomoya
Copy link
Collaborator

#2522 reverts the #2450, i will try once jazzy release is settled.

@fujitatomoya
Copy link
Collaborator

Just posting what needs to be done after jazzy release.

@fujitatomoya
Copy link
Collaborator

So cont to #2520 (comment), i took a quick check at the implementation.

I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

this could be different problem for destruction order. @oysstu can you provide the minimal example to reproduce the issue?

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

this is NOT because the node is transition state. this is because the context is already destroyed before lifecycle node dtor.

this is the same problem that #2522 (comment) reported. this can be observed with rclcpp_lifecycle/test_lifecycle_publisher. and this is the bug in the test code. rclcpp::shutdown() (TearDown) will be called before LifecycleNode::~LifecycleNode, that said the context will be invalid at node dtor, so #2450 generates the error because it tries to publish the state after transition.

1st we need to fix this, #2527

@fujitatomoya
Copy link
Collaborator

@oysstu
Copy link
Contributor Author

oysstu commented May 11, 2024

I'm traveling for the next couple of weeks, but I'll try to follow up as best as I can. I noticed now that the rclcpp context is shut down by the signal handler, and not by me after the executor stops spinning. See the following output:

^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[WARNING] [launch_ros.actions.lifecycle_node]: Abandoning wait for the '/namespace/node/change_state' service response, due to shutdown.
[node_exec-1] [INFO] [1715459945.276584858] [rclcpp]: signal_handler(signum=2)
[node_exec-1] [DEBUG] [1715459945.276609123] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459945.276634450] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459945.276641853] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459945.276674554] [rclcpp]: deferred_signal_handler(): shutting down rclcpp::Context @ 0x5ef029a5b700, because it had shutdown_on_signal == true
[node_exec-1] [DEBUG] [1715459945.276678071] [rcl]: Shutting down ROS client library, for context at address: 0x5ef029a5ba40
[node_exec-1] [DEBUG] [1715459945.276689672] [rcl]: Finalizing publisher
[node_exec-1] [DEBUG] [1715459945.276920120] [rcl]: Publisher finalized
[node_exec-1] [DEBUG] [1715459945.276944746] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[ERROR] [node_exec-1]: process[node_exec-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [node_exec-1]: sending signal 'SIGTERM' to process[node_exec-1]
[node_exec-1] [INFO] [1715459950.335149560] [rclcpp]: signal_handler(signum=15)
[node_exec-1] [DEBUG] [1715459950.335167252] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459950.335197609] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459950.335203910] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459950.335207287] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[node_exec-1] [ERROR] [1715459954.328207103] [namespace.node]: Failed to finish transition 1. Current state is now: unconfigured (Could not publish transition: publisher's context is invalid, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368)

Shutdown is called from here:

context_ptr->shutdown("signal handler");

@oysstu
Copy link
Contributor Author

oysstu commented May 12, 2024

Some possible solutions come to mind.

  1. Check if the context/publisher is still valid in the destructor and skip transition if invalid
  2. Register a shutdown callback for the context in the lifecycle node constructor and handle the transitioning there. Remove callback in destructor. This is something I've used to do shutdown transitioning previously.
  3. A combination of the two to ensure transitioning both when the context is shutdown first or destructor is called first

@fujitatomoya
Copy link
Collaborator

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

@oysstu
Copy link
Contributor Author

oysstu commented May 13, 2024

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

Assuming that the shutdown callbacks are called in the reverse order of registration, the callback registered in the constructor would be the last one to be called relative to the lifetime of the node (i.e., any references to the node would be obtained by the user after the constructor). If the shutdown callback handles the shutdown transition on rclcpp::shutdown, and the node destructor handles it otherwise, are not all cases covered? Assuming that the node destructor does nothing if the context/publisher is invalid.

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

Hmm, I don't know enough about the lifecycle QoS and it's intended design to comment on this.

@fujitatomoya
Copy link
Collaborator

Assuming that the shutdown callbacks are called in the reverse order of registration,

this is true.

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment? that was my concern off the top of my head...

maybe there could be more ideas, lets keep this open for the fix! thanks for iterating, i will try to bring this issue for some WG meeting.

@oysstu
Copy link
Contributor Author

oysstu commented May 14, 2024

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment?

I've used a weak_ptr to refer to the node without extending it's lifetime. This only works when the node is managed by a shared_ptr though. If the lifecycle interface was instead defined in a NodeLifecycleInterface class and contained in the node as a shared_ptr, this could be guaranteed (i.e., similar to the other interfaces such as NodeBaseInterface).

I've also de-registered the context shutdown callback in the lifecycle destructor, so that also should avoid dangling callbacks. That is good if an application creates and deletes a bunch of nodes. This alone is not sufficient though, since there could be a race condition between the destructor and shutdown callback (which is why a shared_ptr/weak_ptr is needed).

@fujitatomoya
Copy link
Collaborator

@oysstu just fyi,

Only attempt transition in primary states unconfigured, inactive, and active.
Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

this original problem should be now completed as LifecycleNode class.

#2520 (comment) signal case can be discussed more here. (when the context is shutdown gracefully with deferred signal thread)

@g-arjones
Copy link

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

As demonstrated in #2553 that issue cannot be addressed from the dtor because the subclasses' dtors are called before the base class' and it's illegal to call methods after the dtor so I think the proper way to address that would be to use the context's shutdown callback as suggested by @oysstu which is what we have been using (and is completely broken now with that change)

@fujitatomoya
Copy link
Collaborator

I think the proper way to address that would be to #2520 (comment) is what we have been using (and is completely broken now with that change)

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

@g-arjones
Copy link

g-arjones commented Jun 7, 2024

I think that's the best that can be done by the base class (that was actually what the author of the feature request was referring to since he mentioned CTRL+C). If a more fine-grained control of the state is required then the applications must handle the transitions themselves.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

@oysstu
Copy link
Contributor Author

oysstu commented Jun 7, 2024

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

The C++ RAII paradigm would suggest that since the subclass initializes the resource, it is up to the subclass dtor to handle releasing the resource properly. On a more distributed level, we use a manager node to handle transitioning when not terminating through exceptions or CTRL+C (e.g. something like the nav2 lifecycle manager). For us, the nominal shutdown pattern used is deactivate-cleanup-shutdown unless one of the intermediate transitions fail, in which case the direct shutdown transition is used (e.g. active-shutdown).

@g-arjones
Copy link

Same here 👍

@g-arjones
Copy link

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

@fujitatomoya
Copy link
Collaborator

@oysstu @g-arjones @gabrielfpacheco thanks for the discussion and opinions.

all comments here make sense to me.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

this is also true.

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

hmm, so that is user's responsibility, and base class never calls shutdown.

i am not sure at this moment, i am okay to roll back everything but before that i can bring this up with other maintainers.

@g-arjones
Copy link

i am okay to roll back everything but before that i can bring this up with other maintainers.

Thank you, I would really appreciate that. This change created a lot of problems for us.

@fujitatomoya
Copy link
Collaborator

yeah sorry about that happens, i really appreciate your feedback and comments.

@g-arjones
Copy link

No reason to apologize. Keep up the good work! 👍

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Jun 10, 2024

@oysstu @g-arjones @gabrielfpacheco
CC: @mjcarroll @wjwwood @clalancette

so i came to the conclusion, that is i will roll back call shutdown in LifecycleNode dtor including all backports.

the reason is pretty much we discussed here. we are generating the another complication (hard to expect and user cannot avoid #2554) to address the one issue(#2553 (comment)). until we can figure out more fine grained control for user application, we should roll back the behavior so that user can be responsible for the lifecycle management.

so at this moment, that is user application responsibility to make sure to call shutdown the device or sensor to avoid leaving them in unknown state. (either calling shutdown in sub-class or using context shutdown callback.)

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

as follow-up, I guess something we can improve is doc section that user needs to call shutdown if necessary and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

what do you think? any concerns and thoughts?

again, thanks for pointing out my oversight and sharing!

@g-arjones
Copy link

@fujitatomoya Thank you for the update 👍

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

Yeah, that's what makes the most sense to me too.

I guess something we can improve is doc section that user needs to call shutdown if necessary

Definitely...

and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

Yeah, I guess that's fine. I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

@fujitatomoya
Copy link
Collaborator

I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

i guess DEBUG. i am not sure if any user application intentionally leaves without shutting down, but i understand your concern that warning could be surprising for user all the sudden.

@oysstu
Copy link
Contributor Author

oysstu commented Jun 10, 2024

I think this is the correct move right now. The context shutting down before the node is cleaned up points to either a programming/logic error or something exceptional happening. It might be best to leave it to the user to decide the actions to take.

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

As I mentioned, it's possible to do this if the lifecycle functionality was implemented in a base interface contained in a shared pointer in the lifecycle class. I believe implementing this would enable some generic programming by treating the lifecycle node like a regular node through its base interfaces, but then have the additional lifecycle interface for the specific lifecycle functionality.

Right now, the lifecycle API is a weird mix of functional programming and inheritance. I think it would be better to have the lifecycle class intended for use with inheritance, and disallow the direct callback functions. If the lifecycle functionality was implemented in a base interface it could be added to any node if users prefer the functional style.

@gabrielfpacheco
Copy link

I appreciate your effort for looking into this, @fujitatomoya.

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

It makes sense to me that the user is the one responsible for rightfully finalizing the node since this may be application-specific.

Rolling back and improving documentation seems the right move for me as well.

@fujitatomoya
Copy link
Collaborator

@oysstu @g-arjones @gabrielfpacheco thanks for the quick response.

i will start reverting, i will try some doc and debug print follow-up after reverting.

@fujitatomoya
Copy link
Collaborator

see #2520 (comment), all related to PRs are rolled back and merged.

i will go ahead to close this issue, feel free to reopen if i miss anything.

@fujitatomoya
Copy link
Collaborator

@oysstu @g-arjones @gabrielfpacheco just FYI, this is closed.

@g-arjones
Copy link

@fujitatomoya Thanks a lot! Any idea when the PR to rosdistro will be open?

@fujitatomoya
Copy link
Collaborator

that is something i am not sure, you are talking about humble, right? CC: @audrow

@g-arjones
Copy link

you are talking about humble, right?

Exactly...

@fujitatomoya
Copy link
Collaborator

@audrow @clalancette do we have any plan for humble sync or plan?

fujitatomoya added a commit that referenced this issue Jun 12, 2024
  Currently it is user application responsibility to manage the all state control.
  See more details for #2520.

Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>
@audrow
Copy link
Member

audrow commented Jun 18, 2024

@audrow @clalancette do we have any plan for humble sync or plan?

I'll be doing syncs every two weeks or so. (A sync is in progress right now.)

As for a patch, I'm going to aim for the week of July 22nd. Let me know if I should do a patch sooner.

@fujitatomoya
Copy link
Collaborator

@audrow thanks for sharing the plan.
CC: @g-arjones @gabrielfpacheco

@g-arjones
Copy link

I would really appreciate if this could be released sooner. It's an important bug fix (kind of a critical one, at least for us). Is that feasible?

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-for-humble-hawksbill-2024-06-20/38277/2

@fujitatomoya
Copy link
Collaborator

@g-arjones
Copy link

It looks like rclcpp wasn't included, unfortunately. Still broken...

@fujitatomoya
Copy link
Collaborator

@g-arjones i had a chat with @audrow , that was just a sync, so just community packages only. next patch is being scheduled in middle of July at this moment.

@g-arjones
Copy link

I would have expected something like this to have a higher priority but I guess we will have to wait. Thank you for the notice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants