Lifecycle destructor calls shutdown while in shuttingdown intermediate state #2520

oysstu · 2024-05-02T13:53:42Z

Bug report

Required Info:

Operating System: Ubuntu 24.04
Installation type: Both
Version or commit hash: iron latest
DDS implementation: default
Client library (if applicable): N/A

Steps to reproduce issue

Shutdown transition was added to the lifecycle node destructor in #2450 (iron backport: #2490). This currently triggers shutdown if the node state is anything other than finalized. Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition? I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

Not sure if it is related to the publisher being invalid or not, but the node is clearly in the shuttingdown state, and not one of the primary states.

Expected behavior

Only attempt transition in primary states unconfigured, inactive, and active.

Actual behavior

Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

Ref. the following destructor snippet

rclcpp/rclcpp_lifecycle/src/lifecycle_node.cpp

Lines 156 to 169 in 5f912eb

    
           if (LifecycleNode::get_current_state().id() != 
        
             lifecycle_msgs::msg::State::PRIMARY_STATE_FINALIZED) 
        
           { 
        
             auto ret = rclcpp_lifecycle::node_interfaces::LifecycleNodeInterface::CallbackReturn::ERROR; 
        
             auto finalized = LifecycleNode::shutdown(ret); 
        
             if (finalized.id() != lifecycle_msgs::msg::State::PRIMARY_STATE_FINALIZED || 
        
               ret != rclcpp_lifecycle::node_interfaces::LifecycleNodeInterface::CallbackReturn::SUCCESS) 
        
             { 
        
               RCLCPP_WARN( 
        
                 rclcpp::get_logger("rclcpp_lifecycle"), 
        
                 "Shutdown error in destruction of LifecycleNode: final state(%s)", 
        
                 finalized.label().c_str()); 
        
             } 
        
           }

fujitatomoya · 2024-05-08T02:25:53Z

Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition?

i think this makes sense. we cannot change the state when it is in transition state anyway.
if that is not in the primary state, we can print the warning that it cannot shutdown during destructor.

Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

lifecycle node com interface cannot be finalized yet, https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl_lifecycle/src/rcl_lifecycle.c#L264 since std::unique_ptr<LifecycleNodeInterfaceImpl> impl_; should not be destroyed at LifecycleNode's dtor.

fujitatomoya · 2024-05-08T22:03:03Z

#2522 reverts the #2450, i will try once jazzy release is settled.

fujitatomoya · 2024-05-09T16:40:29Z

Just posting what needs to be done after jazzy release.

revert Revert "call shutdown in LifecycleNode dtor to avoid leaving the devi… #2522 (including jazzy, not iron/humble)
add the fix to call shutdown only when lifecycle node is in the primary state. (this fix needs to be applied to iron and humble)

fujitatomoya · 2024-05-11T15:04:11Z

So cont to #2520 (comment), i took a quick check at the implementation.

I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

this could be different problem for destruction order. @oysstu can you provide the minimal example to reproduce the issue?

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

this is NOT because the node is transition state. this is because the context is already destroyed before lifecycle node dtor.

this is the same problem that #2522 (comment) reported. this can be observed with rclcpp_lifecycle/test_lifecycle_publisher. and this is the bug in the test code. rclcpp::shutdown() (TearDown) will be called before LifecycleNode::~LifecycleNode, that said the context will be invalid at node dtor, so #2450 generates the error because it tries to publish the state after transition.

1st we need to fix this, #2527

fujitatomoya · 2024-05-11T18:54:05Z

@oysstu it would be really appreciated if you can take a look at,

rclcpp::shutdown should not be called before LifecycleNode dtor. #2527
call shutdown in LifecycleNode dtor to avoid leaving the device in unknown state (2nd) #2528

oysstu · 2024-05-11T20:51:16Z

I'm traveling for the next couple of weeks, but I'll try to follow up as best as I can. I noticed now that the rclcpp context is shut down by the signal handler, and not by me after the executor stops spinning. See the following output:

^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[WARNING] [launch_ros.actions.lifecycle_node]: Abandoning wait for the '/namespace/node/change_state' service response, due to shutdown.
[node_exec-1] [INFO] [1715459945.276584858] [rclcpp]: signal_handler(signum=2)
[node_exec-1] [DEBUG] [1715459945.276609123] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459945.276634450] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459945.276641853] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459945.276674554] [rclcpp]: deferred_signal_handler(): shutting down rclcpp::Context @ 0x5ef029a5b700, because it had shutdown_on_signal == true
[node_exec-1] [DEBUG] [1715459945.276678071] [rcl]: Shutting down ROS client library, for context at address: 0x5ef029a5ba40
[node_exec-1] [DEBUG] [1715459945.276689672] [rcl]: Finalizing publisher
[node_exec-1] [DEBUG] [1715459945.276920120] [rcl]: Publisher finalized
[node_exec-1] [DEBUG] [1715459945.276944746] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[ERROR] [node_exec-1]: process[node_exec-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [node_exec-1]: sending signal 'SIGTERM' to process[node_exec-1]
[node_exec-1] [INFO] [1715459950.335149560] [rclcpp]: signal_handler(signum=15)
[node_exec-1] [DEBUG] [1715459950.335167252] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459950.335197609] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459950.335203910] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459950.335207287] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[node_exec-1] [ERROR] [1715459954.328207103] [namespace.node]: Failed to finish transition 1. Current state is now: unconfigured (Could not publish transition: publisher's context is invalid, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368)

Shutdown is called from here:

rclcpp/rclcpp/src/rclcpp/signal_handler.cpp

Line 271 in 343b29b

context_ptr->shutdown("signal handler");

oysstu · 2024-05-12T01:23:31Z

Some possible solutions come to mind.

Check if the context/publisher is still valid in the destructor and skip transition if invalid
Register a shutdown callback for the context in the lifecycle node constructor and handle the transitioning there. Remove callback in destructor. This is something I've used to do shutdown transitioning previously.
A combination of the two to ensure transitioning both when the context is shutdown first or destructor is called first

fujitatomoya · 2024-05-13T02:26:34Z

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

oysstu · 2024-05-13T17:30:36Z

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

Assuming that the shutdown callbacks are called in the reverse order of registration, the callback registered in the constructor would be the last one to be called relative to the lifetime of the node (i.e., any references to the node would be obtained by the user after the constructor). If the shutdown callback handles the shutdown transition on rclcpp::shutdown, and the node destructor handles it otherwise, are not all cases covered? Assuming that the node destructor does nothing if the context/publisher is invalid.

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

Hmm, I don't know enough about the lifecycle QoS and it's intended design to comment on this.

fujitatomoya · 2024-05-13T23:53:14Z

Assuming that the shutdown callbacks are called in the reverse order of registration,

this is true.

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment? that was my concern off the top of my head...

maybe there could be more ideas, lets keep this open for the fix! thanks for iterating, i will try to bring this issue for some WG meeting.

oysstu · 2024-05-14T01:41:37Z

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment?

I've used a weak_ptr to refer to the node without extending it's lifetime. This only works when the node is managed by a shared_ptr though. If the lifecycle interface was instead defined in a NodeLifecycleInterface class and contained in the node as a shared_ptr, this could be guaranteed (i.e., similar to the other interfaces such as NodeBaseInterface).

I've also de-registered the context shutdown callback in the lifecycle destructor, so that also should avoid dangling callbacks. That is good if an application creates and deletes a bunch of nodes. This alone is not sufficient though, since there could be a race condition between the destructor and shutdown callback (which is why a shared_ptr/weak_ptr is needed).

fujitatomoya · 2024-05-29T23:05:40Z

@oysstu just fyi,

Only attempt transition in primary states unconfigured, inactive, and active.
Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

this original problem should be now completed as LifecycleNode class.

#2520 (comment) signal case can be discussed more here. (when the context is shutdown gracefully with deferred signal thread)

fujitatomoya · 2024-05-29T23:14:54Z

iron backport: lifecycle node dtor shutdown should be called only in primary state. #2543
humble backport: lifecycle node dtor shutdown should be called only in primary state. #2544

fujitatomoya · 2024-05-30T07:10:37Z

I end up having #2545 as follow up for #2528.

the following backport PRs need to include #2545, so holding.

call shutdown in LifecycleNode dtor to avoid leaving the device in unknown state (2nd) (backport #2528) #2542
lifecycle node dtor shutdown should be called only in primary state. #2543
lifecycle node dtor shutdown should be called only in primary state. #2544

g-arjones · 2024-06-07T18:44:15Z

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

As demonstrated in #2553 that issue cannot be addressed from the dtor because the subclasses' dtors are called before the base class' and it's illegal to call methods after the dtor so I think the proper way to address that would be to use the context's shutdown callback as suggested by @oysstu which is what we have been using (and is completely broken now with that change)

fujitatomoya · 2024-06-07T19:13:37Z

I think the proper way to address that would be to #2520 (comment) is what we have been using (and is completely broken now with that change)

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

g-arjones · 2024-06-07T19:19:57Z

I think that's the best that can be done by the base class (that was actually what the author of the feature request was referring to since he mentioned CTRL+C). If a more fine-grained control of the state is required then the applications must handle the transitions themselves.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

oysstu · 2024-06-07T19:34:05Z

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

The C++ RAII paradigm would suggest that since the subclass initializes the resource, it is up to the subclass dtor to handle releasing the resource properly. On a more distributed level, we use a manager node to handle transitioning when not terminating through exceptions or CTRL+C (e.g. something like the nav2 lifecycle manager). For us, the nominal shutdown pattern used is deactivate-cleanup-shutdown unless one of the intermediate transitions fail, in which case the direct shutdown transition is used (e.g. active-shutdown).

g-arjones · 2024-06-07T19:40:04Z

Same here 👍

g-arjones · 2024-06-07T19:47:07Z

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

fujitatomoya · 2024-06-07T20:05:37Z

@oysstu @g-arjones @gabrielfpacheco thanks for the discussion and opinions.

all comments here make sense to me.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

this is also true.

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

hmm, so that is user's responsibility, and base class never calls shutdown.

i am not sure at this moment, i am okay to roll back everything but before that i can bring this up with other maintainers.

g-arjones · 2024-06-07T20:09:19Z

i am okay to roll back everything but before that i can bring this up with other maintainers.

Thank you, I would really appreciate that. This change created a lot of problems for us.

fujitatomoya · 2024-06-07T20:10:44Z

yeah sorry about that happens, i really appreciate your feedback and comments.

g-arjones · 2024-06-07T20:12:47Z

No reason to apologize. Keep up the good work! 👍

fujitatomoya · 2024-06-10T19:01:08Z

@oysstu @g-arjones @gabrielfpacheco
CC: @mjcarroll @wjwwood @clalancette

so i came to the conclusion, that is i will roll back call shutdown in LifecycleNode dtor including all backports.

the reason is pretty much we discussed here. we are generating the another complication (hard to expect and user cannot avoid #2554) to address the one issue(#2553 (comment)). until we can figure out more fine grained control for user application, we should roll back the behavior so that user can be responsible for the lifecycle management.

so at this moment, that is user application responsibility to make sure to call shutdown the device or sensor to avoid leaving them in unknown state. (either calling shutdown in sub-class or using context shutdown callback.)

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

as follow-up, I guess something we can improve is doc section that user needs to call shutdown if necessary and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

what do you think? any concerns and thoughts?

again, thanks for pointing out my oversight and sharing!

g-arjones · 2024-06-10T19:16:50Z

@fujitatomoya Thank you for the update 👍

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

Yeah, that's what makes the most sense to me too.

I guess something we can improve is doc section that user needs to call shutdown if necessary

Definitely...

and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

Yeah, I guess that's fine. I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

fujitatomoya · 2024-06-10T19:28:29Z

I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

i guess DEBUG. i am not sure if any user application intentionally leaves without shutting down, but i understand your concern that warning could be surprising for user all the sudden.

oysstu · 2024-06-10T19:30:30Z

I think this is the correct move right now. The context shutting down before the node is cleaned up points to either a programming/logic error or something exceptional happening. It might be best to leave it to the user to decide the actions to take.

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

As I mentioned, it's possible to do this if the lifecycle functionality was implemented in a base interface contained in a shared pointer in the lifecycle class. I believe implementing this would enable some generic programming by treating the lifecycle node like a regular node through its base interfaces, but then have the additional lifecycle interface for the specific lifecycle functionality.

Right now, the lifecycle API is a weird mix of functional programming and inheritance. I think it would be better to have the lifecycle class intended for use with inheritance, and disallow the direct callback functions. If the lifecycle functionality was implemented in a base interface it could be added to any node if users prefer the functional style.

gabrielfpacheco · 2024-06-10T19:37:56Z

I appreciate your effort for looking into this, @fujitatomoya.

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

It makes sense to me that the user is the one responsible for rightfully finalizing the node since this may be application-specific.

Rolling back and improving documentation seems the right move for me as well.

fujitatomoya · 2024-06-10T19:45:57Z

@oysstu @g-arjones @gabrielfpacheco thanks for the quick response.

i will start reverting, i will try some doc and debug print follow-up after reverting.

fujitatomoya · 2024-06-10T20:45:58Z

rollback PRs,

to close the following issues,

this issue, Lifecycle destructor calls shutdown while in shuttingdown intermediate state #2520
Shutdown transition on base lifecycle node dtor may lead to segaults on subclass-registered shutdown callback #2553
on_shutdown callback not called when shutdown transition is triggered on dtor #2554
LifecycleNode: Take shutdown transition when exiting #997, i would not reopen this instead add comments that user needs to call shutdown at this moment.

fujitatomoya · 2024-06-12T16:53:06Z

see #2520 (comment), all related to PRs are rolled back and merged.

i will go ahead to close this issue, feel free to reopen if i miss anything.

fujitatomoya · 2024-06-12T16:53:31Z

@oysstu @g-arjones @gabrielfpacheco just FYI, this is closed.

g-arjones · 2024-06-12T17:08:36Z

@fujitatomoya Thanks a lot! Any idea when the PR to rosdistro will be open?

fujitatomoya · 2024-06-12T17:17:21Z

that is something i am not sure, you are talking about humble, right? CC: @audrow

g-arjones · 2024-06-12T17:19:20Z

you are talking about humble, right?

Exactly...

fujitatomoya · 2024-06-12T18:04:29Z

@audrow @clalancette do we have any plan for humble sync or plan?

Currently it is user application responsibility to manage the all state control. See more details for #2520. Signed-off-by: Tomoya Fujita <Tomoya.Fujita@sony.com>

audrow · 2024-06-18T15:17:54Z

@audrow @clalancette do we have any plan for humble sync or plan?

I'll be doing syncs every two weeks or so. (A sync is in progress right now.)

As for a patch, I'm going to aim for the week of July 22nd. Let me know if I should do a patch sooner.

fujitatomoya · 2024-06-18T17:13:18Z

@audrow thanks for sharing the plan.
CC: @g-arjones @gabrielfpacheco

g-arjones · 2024-06-18T17:15:30Z

I would really appreciate if this could be released sooner. It's an important bug fix (kind of a critical one, at least for us). Is that feasible?

ros-discourse · 2024-06-20T22:38:02Z

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-for-humble-hawksbill-2024-06-20/38277/2

fujitatomoya · 2024-06-20T22:38:30Z

@audrow can you confirm https://discourse.ros.org/t/new-packages-for-humble-hawksbill-2024-06-20/38277/2 ?

g-arjones · 2024-06-21T00:43:23Z

It looks like rclcpp wasn't included, unfortunately. Still broken...

fujitatomoya · 2024-06-21T04:16:16Z

@g-arjones i had a chat with @audrow , that was just a sync, so just community packages only. next patch is being scheduled in middle of July at this moment.

g-arjones · 2024-06-21T17:42:27Z

I would have expected something like this to have a higher priority but I guess we will have to wait. Thank you for the notice!

fujitatomoya self-assigned this May 2, 2024

fujitatomoya mentioned this issue May 8, 2024

Revert "call shutdown in LifecycleNode dtor to avoid leaving the devi… #2522

Merged

fujitatomoya added the bug Something isn't working label May 9, 2024

fujitatomoya mentioned this issue May 11, 2024

rclcpp::shutdown should not be called before LifecycleNode dtor. #2527

Merged

fujitatomoya mentioned this issue May 11, 2024

call shutdown in LifecycleNode dtor to avoid leaving the device in unknown state (2nd) #2528

Merged

fujitatomoya closed this as completed Jun 12, 2024

fujitatomoya mentioned this issue Jun 12, 2024

LifecycleNode bugfix and add test cases #2562

Open

Lifecycle destructor calls shutdown while in shuttingdown intermediate state #2520

Lifecycle destructor calls shutdown while in shuttingdown intermediate state #2520

Comments

oysstu commented May 2, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

fujitatomoya commented May 8, 2024

fujitatomoya commented May 8, 2024

fujitatomoya commented May 9, 2024

fujitatomoya commented May 11, 2024

fujitatomoya commented May 11, 2024

oysstu commented May 11, 2024 • edited Loading

oysstu commented May 12, 2024 • edited Loading

fujitatomoya commented May 13, 2024

oysstu commented May 13, 2024 • edited Loading

fujitatomoya commented May 13, 2024

oysstu commented May 14, 2024 • edited Loading

fujitatomoya commented May 29, 2024

fujitatomoya commented May 29, 2024

fujitatomoya commented May 30, 2024

g-arjones commented Jun 7, 2024

fujitatomoya commented Jun 7, 2024

g-arjones commented Jun 7, 2024 • edited Loading

oysstu commented Jun 7, 2024 • edited Loading

g-arjones commented Jun 7, 2024

g-arjones commented Jun 7, 2024

fujitatomoya commented Jun 7, 2024

g-arjones commented Jun 7, 2024

fujitatomoya commented Jun 7, 2024

g-arjones commented Jun 7, 2024

fujitatomoya commented Jun 10, 2024 • edited Loading

g-arjones commented Jun 10, 2024

fujitatomoya commented Jun 10, 2024

oysstu commented Jun 10, 2024 • edited Loading

gabrielfpacheco commented Jun 10, 2024

fujitatomoya commented Jun 10, 2024

fujitatomoya commented Jun 10, 2024

fujitatomoya commented Jun 12, 2024

fujitatomoya commented Jun 12, 2024

g-arjones commented Jun 12, 2024

fujitatomoya commented Jun 12, 2024

g-arjones commented Jun 12, 2024

fujitatomoya commented Jun 12, 2024

audrow commented Jun 18, 2024

fujitatomoya commented Jun 18, 2024

g-arjones commented Jun 18, 2024

ros-discourse commented Jun 20, 2024

fujitatomoya commented Jun 20, 2024

g-arjones commented Jun 21, 2024

fujitatomoya commented Jun 21, 2024

g-arjones commented Jun 21, 2024

oysstu commented May 11, 2024 •

edited

Loading

oysstu commented May 12, 2024 •

edited

Loading

oysstu commented May 13, 2024 •

edited

Loading

oysstu commented May 14, 2024 •

edited

Loading

g-arjones commented Jun 7, 2024 •

edited

Loading

oysstu commented Jun 7, 2024 •

edited

Loading

fujitatomoya commented Jun 10, 2024 •

edited

Loading

oysstu commented Jun 10, 2024 •

edited

Loading