Nodelet unloading broken #50

dseifert · 2016-10-19T21:59:24Z

I notice an issue with nodelets not being properly unloaded. Consider a launch file that starts n nodelets (with n > 9, I recommend n = 20) in the same nodelet manager:

<launch>
    <node pkg="nodelet" type="nodelet" name="nodelet_manager" output="screen" args="manager" />

    <node pkg="nodelet" type="nodelet" name="n01" output="screen" args="load nodelet_tutorial_math/Plus nodelet_manager" />
[start n02 etc here, too]
</launch>

and once these 20 nodelets are up and running, pressing Ctrl-C. One would expect that the shutdown process is quite linear, with all 20 nodelets and the manager being shut down quickly. However it seems the first 9 nodelets are shut down very quickly, and the other n-9 nodelets are shut down at a rate of one per second. Give this a try (either on Indigo, Jade or Kinetic).

Upon delving into the code, I found out what is happening.

First, the perception of 9 nodelets quitting quickly is wrong. As roslaunch uses 10 process killers in parallel, up to the 10 nodelets will be announced quickly immediately after hitting Ctrl-C (giving the wrong impression) but what actually happens is that the first n-9 nodelets take 1 second each to shutdown (in a serial manner). As soon as n-9 nodelets are taken down (or rather their loader nodes) the 10th process killer slot is filled with the nodelet manager. And sending SIGINT to the nodelet manager seems to just tear down all the remaining nodelets immediately. I am not sure this is the intended behaviour, or is it?

Second, the main issue here is that the entire nodelet unloading process seems to be broken due to bond status messages not being processed during unloading. When a nodelet loader process is signaled to quit, it will call the unload service of the nodelet manager. This in turn will try to break the bond which involves sending a bond status message "active = False" to the nodelet loader process and waiting for it to also send "active = False". However, if you look at nodelet.cpp you will see that no spinner is active. ros::spinOnce was last called before starting the unload procedure, and as such no further messages (incl. bond status messages) will be received or published anymore. Hence the 1 second timeout for breaking the bond is applying.

I tried to fix this quickly by introducing an AsyncSpinner. This improved the required shutdown time (by roughly n-9 seconds), however it triggered a new problem: for every nodelet I now get an error like /nodelet_manager: Failed to find nodelet with name '/n01' to unload. This, third, seems to be caused by LoaderROS::unload() to be called twice: first by the service call, and then when the bond is broken via the broken callback set up in loader.cpp:serviceLoad().

The text was updated successfully, but these errors were encountered:

mikaelarguedas · 2016-10-20T18:46:43Z

Thanks for the detailed report.
Tested your fix for 2 and 3, it is a great improvement. It seems that sometimes the nodelet manager is teared down before all nodelets are fully unloading, investigating right now.

Regarding one: I see how it can be confusing. the message "Unloading nodelet <nodelet_name> from manager <nodelet_manager_name>" means that a request to unload the nodelet has been sent. It doesn't mean that the nodelet is actually unloading or fully unloaded.
The one actually providing this information is

nodelet_core/nodelet/src/loader.cpp

Line 338 in 6dcff94

ROS_DEBUG ("Done unloading nodelet %s", name.c_str ());

Do you think that we should change the verbosity level of this message should be changed to be more transparent to the user?

dseifert · 2016-10-20T19:49:29Z

As far as I could see, the manager is torn down before all nodelets because roslaunch has 10 process killers running in parallel. Once it gets to the manager and the manager gets the SIGINT signal, it just stops without properly clearing the remaining nodelets. My guess would be that nodelet.cpp should install a signal handler as well for the "manager" case and initiate a shutdown for all nodes from this side. Or maybe I am reading this wrong.

As for the verbosity: maybe it should be the same as the "Unloading nodelet from manager" so it is paired (start & end).

dseifert · 2016-10-30T20:07:02Z

@mikaelarguedas re point 1: is it the intended behaviour that when quitting the nodelet manager, all remaining running nodelets are not being destructed? If not, I'd like to open a new issue for it in order to get #51 merged, as this is a slightly different issue.

To test this, prepare the following:

add a destructor to nodelet_core/test_nodelet/src/plus.cpp that has a ROS_INFO (or higher) output
Create this launch file:

<launch>
  <node pkg="nodelet" type="nodelet" name="nodelet_manager"  args="manager" output="screen"/>

  <node pkg="nodelet" type="nodelet" name="n1" output="screen" args="load test_nodelet/Plus nodelet_manager" />
</launch>

Now, perform these tests:

Test 1: run the launch file; use ps xa | grep nodelet to figure out the process ID of the nodelet load command and kill -SIGINT it ... you will see that the destructor is called
Test 2: run the launch file, use ps xa | grep nodelet to figure out the process ID of the nodelet manager command and kill -SIGINT it ... you will see that the destructor is NOT called

* add test whether bond breaking on unload works (tests #50) * disable callback for broken bond when we are breaking it This avoids the nodelet::LoaderROS::unload() method to be called twice for the same nodelet, causing an error output. * use AsyncSpinner for nodelet load in order for the shutdown procedure to work During shutdown, the bonds still need to communicate their status in order for the nodelet to properly/cleanly/quickly unload. This requires the node to spin. * add test whether LoaderROS::unload() is called twice (tests #50)

mikaelarguedas · 2017-03-09T13:30:07Z

closing this in favor of #55

dseifert mentioned this issue Oct 20, 2016

Fix bond handling during nodelet unloading #51

Merged

dseifert pushed a commit to dseifert/nodelet_core that referenced this issue Oct 30, 2016

add test whether bond breaking on unload works (tests ros#50)

c725791

dseifert pushed a commit to dseifert/nodelet_core that referenced this issue Oct 30, 2016

add test whether LoaderROS::unload() is called twice (tests ros#50)

4d758f9

mikaelarguedas mentioned this issue Mar 9, 2017

Nodelets not tear down when nodelet manager receives SIGINT #55

Open

mikaelarguedas closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodelet unloading broken #50

Nodelet unloading broken #50

dseifert commented Oct 19, 2016 •

edited

Loading

mikaelarguedas commented Oct 20, 2016 •

edited

Loading

dseifert commented Oct 20, 2016

dseifert commented Oct 30, 2016

mikaelarguedas commented Mar 9, 2017

Nodelet unloading broken #50

Nodelet unloading broken #50

Comments

dseifert commented Oct 19, 2016 • edited Loading

mikaelarguedas commented Oct 20, 2016 • edited Loading

dseifert commented Oct 20, 2016

dseifert commented Oct 30, 2016

mikaelarguedas commented Mar 9, 2017

dseifert commented Oct 19, 2016 •

edited

Loading

mikaelarguedas commented Oct 20, 2016 •

edited

Loading