Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodelet unloading broken #50

Closed
dseifert opened this issue Oct 19, 2016 · 4 comments
Closed

Nodelet unloading broken #50

dseifert opened this issue Oct 19, 2016 · 4 comments

Comments

@dseifert
Copy link
Contributor

dseifert commented Oct 19, 2016

I notice an issue with nodelets not being properly unloaded. Consider a launch file that starts n nodelets (with n > 9, I recommend n = 20) in the same nodelet manager:

<launch>
    <node pkg="nodelet" type="nodelet" name="nodelet_manager" output="screen" args="manager" />

    <node pkg="nodelet" type="nodelet" name="n01" output="screen" args="load nodelet_tutorial_math/Plus nodelet_manager" />
[start n02 etc here, too]
</launch>

and once these 20 nodelets are up and running, pressing Ctrl-C. One would expect that the shutdown process is quite linear, with all 20 nodelets and the manager being shut down quickly. However it seems the first 9 nodelets are shut down very quickly, and the other n-9 nodelets are shut down at a rate of one per second. Give this a try (either on Indigo, Jade or Kinetic).

Upon delving into the code, I found out what is happening.

First, the perception of 9 nodelets quitting quickly is wrong. As roslaunch uses 10 process killers in parallel, up to the 10 nodelets will be announced quickly immediately after hitting Ctrl-C (giving the wrong impression) but what actually happens is that the first n-9 nodelets take 1 second each to shutdown (in a serial manner). As soon as n-9 nodelets are taken down (or rather their loader nodes) the 10th process killer slot is filled with the nodelet manager. And sending SIGINT to the nodelet manager seems to just tear down all the remaining nodelets immediately. I am not sure this is the intended behaviour, or is it?

Second, the main issue here is that the entire nodelet unloading process seems to be broken due to bond status messages not being processed during unloading. When a nodelet loader process is signaled to quit, it will call the unload service of the nodelet manager. This in turn will try to break the bond which involves sending a bond status message "active = False" to the nodelet loader process and waiting for it to also send "active = False". However, if you look at nodelet.cpp you will see that no spinner is active. ros::spinOnce was last called before starting the unload procedure, and as such no further messages (incl. bond status messages) will be received or published anymore. Hence the 1 second timeout for breaking the bond is applying.

I tried to fix this quickly by introducing an AsyncSpinner. This improved the required shutdown time (by roughly n-9 seconds), however it triggered a new problem: for every nodelet I now get an error like /nodelet_manager: Failed to find nodelet with name '/n01' to unload. This, third, seems to be caused by LoaderROS::unload() to be called twice: first by the service call, and then when the bond is broken via the broken callback set up in loader.cpp:serviceLoad().

@mikaelarguedas
Copy link
Member

mikaelarguedas commented Oct 20, 2016

Thanks for the detailed report.
Tested your fix for 2 and 3, it is a great improvement. It seems that sometimes the nodelet manager is teared down before all nodelets are fully unloading, investigating right now.

Regarding one: I see how it can be confusing. the message "Unloading nodelet <nodelet_name> from manager <nodelet_manager_name>" means that a request to unload the nodelet has been sent. It doesn't mean that the nodelet is actually unloading or fully unloaded.
The one actually providing this information is

ROS_DEBUG ("Done unloading nodelet %s", name.c_str ());

Do you think that we should change the verbosity level of this message should be changed to be more transparent to the user?

@dseifert
Copy link
Contributor Author

As far as I could see, the manager is torn down before all nodelets because roslaunch has 10 process killers running in parallel. Once it gets to the manager and the manager gets the SIGINT signal, it just stops without properly clearing the remaining nodelets. My guess would be that nodelet.cpp should install a signal handler as well for the "manager" case and initiate a shutdown for all nodes from this side. Or maybe I am reading this wrong.

As for the verbosity: maybe it should be the same as the "Unloading nodelet from manager" so it is paired (start & end).

dseifert pushed a commit to dseifert/nodelet_core that referenced this issue Oct 30, 2016
dseifert pushed a commit to dseifert/nodelet_core that referenced this issue Oct 30, 2016
@dseifert
Copy link
Contributor Author

@mikaelarguedas re point 1: is it the intended behaviour that when quitting the nodelet manager, all remaining running nodelets are not being destructed? If not, I'd like to open a new issue for it in order to get #51 merged, as this is a slightly different issue.

To test this, prepare the following:

  1. add a destructor to nodelet_core/test_nodelet/src/plus.cpp that has a ROS_INFO (or higher) output
  2. Create this launch file:
<launch>
  <node pkg="nodelet" type="nodelet" name="nodelet_manager"  args="manager" output="screen"/>

  <node pkg="nodelet" type="nodelet" name="n1" output="screen" args="load test_nodelet/Plus nodelet_manager" />
</launch>

Now, perform these tests:

  • Test 1: run the launch file; use ps xa | grep nodelet to figure out the process ID of the nodelet load command and kill -SIGINT it ... you will see that the destructor is called
  • Test 2: run the launch file, use ps xa | grep nodelet to figure out the process ID of the nodelet manager command and kill -SIGINT it ... you will see that the destructor is NOT called

mikaelarguedas pushed a commit that referenced this issue Nov 1, 2016
* add test whether bond breaking on unload works (tests #50)

* disable callback for broken bond when we are breaking it

This avoids the nodelet::LoaderROS::unload() method to be called
twice for the same nodelet, causing an error output.

* use AsyncSpinner for nodelet load in order for the shutdown procedure to work

During shutdown, the bonds still need to communicate their status in order
for the nodelet to properly/cleanly/quickly unload. This requires the node
to spin.

* add test whether LoaderROS::unload() is called twice (tests #50)
@mikaelarguedas
Copy link
Member

closing this in favor of #55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants