Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadComposableNodes action can cause launch process to hang forever #171

Closed
jacobperron opened this issue Aug 14, 2020 · 7 comments · Fixed by #174
Closed

LoadComposableNodes action can cause launch process to hang forever #171

jacobperron opened this issue Aug 14, 2020 · 7 comments · Fixed by #174
Assignees
Labels
bug Something isn't working

Comments

@jacobperron
Copy link
Member

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Any
  • Version or commit hash:
    • Any
  • DDS implementation:
    • N/A
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Run this broken mock node component container:

import time
import threading

from composition_interfaces.srv import LoadNode
import rclpy
import rclpy.context
import rclpy.executors
import rclpy.node


class MockComponentContainer(rclpy.node.Node):

    def __init__(self):
        # List of LoadNode requests received
        self.requests = []

        self._context = rclpy.context.Context()
        rclpy.init(context=self._context)

        super().__init__('mock_component_container', context=self._context)

        self.load_node_service = self.create_service(LoadNode, '~/_container/load_node', self.load_node_callback)

        self._executor = rclpy.executors.SingleThreadedExecutor(context=self._context)

        # Start spinning in a thread
        self._thread = threading.Thread(target=rclpy.spin, args=(self, self._executor), daemon=True)
        self._thread.start()

    def load_node_callback(self, request, response):
        print(f'Request received: {request}')
        self.requests.append(request)
        response.success = True
        response.full_node_name = f'{request.node_namespace}/{request.node_name}'
        response.unique_id = len(self.requests)
        # BROKEN HERE (forgot to return response)

    def ok(self):
        return rclpy.ok(context=self._context)


def main():
    container = MockComponentContainer()
    while container.ok():
        time.sleep(1)


if __name__ == '__main__':
    main()

Launch a LoadComposableNodes action pointing to the broken container:

import launch

from launch_ros.actions import LoadComposableNodes
from launch_ros.descriptions import ComposableNode


def generate_launch_description():
    return launch.LaunchDescription([
        LoadComposableNodes(
            target_container='/mock_component_container',
            composable_node_descriptions=[
                ComposableNode(
                    package='demo_nodes_cpp',
                    plugin='demo_nodes_cpp::Talker',
                )
            ]
        )
    ])

Expected behavior

LoadComposableNodes action fails to load the node gracefully.

Actual behavior

The launch process hangs forever. Even trying to send sigint or sigquit fails. We need to sigkill the launch process.

Additional information

After some debugging, it appears that we get stuck in this synchronous service call:

response = self.__rclpy_load_node_client.call(request)

And so the future being tracked by the launch context is never completed and launch waits indefinitely.

Perhaps there are two bugs here,

  1. Launch should probably not wait indefinitely for all futures to complete during shutdown. Rather, it should probably timeout and cancel these hung futures.
  2. We can fix the LoadComposableNodes action so that makes an asynchronous service call and monitors the launch shutdown state (similar to the LifecycleNode action) .
@jacobperron
Copy link
Member Author

@wjwwood @ivanpauno What do you think about the first point I listed above?

Launch should probably not wait indefinitely for all futures to complete during shutdown. Rather, it should probably timeout and cancel these hung futures.

If you agree, I can look into patching the launch service so that it has a timeout for waiting on futures during shutdown. Alternatively, we can at least add back a debug log telling the user that launch is still waiting on futures, but I don't think this is that useful, especially if we can't tell where the futures came from.

@ivanpauno
Copy link
Member

If you agree, I can look into patching the launch service so that it has a timeout for waiting on futures during shutdown. Alternatively, we can at least add back a debug log telling the user that launch is still waiting on futures, but I don't think this is that useful, especially if we can't tell where the futures came from.

The problem with timeouts is defining how much is "too much".
I think it's reasonable to put some maximum, e.g.: 1min.
It will help avoiding shooting ourselves in the feet.

Alternatively, we can at least add back a debug log telling the user that launch is still waiting on future

Adding back the debug logs is surely a good idea.

@jacobperron
Copy link
Member Author

The other option is to not even have a timeout; if launch is shutting down, just cancel all pending futures.

@ivanpauno
Copy link
Member

The other option is to not even have a timeout; if launch is shutting down, just cancel all pending futures.

I think there is a reason why we're not doing that, I don't recall correctly.
Maybe @wjwwood or @hidmic remember it.

@wjwwood
Copy link
Member

wjwwood commented Aug 17, 2020

Launch should probably not wait indefinitely for all futures to complete during shutdown. Rather, it should probably timeout and cancel these hung futures.

I don't know off hand, but I'd say it would be better if they were not canceled and instead they could just shutdown cleanly. That being said having it error out at some point rather than hanging might be ok. But there might be cases where a long shutdown is needed, like you're waiting on a file to close or something.

@jacobperron
Copy link
Member Author

But there might be cases where a long shutdown is needed, like you're waiting on a file to close or something.

That's a good point.

I'll fix the issue in LoadComposableNodes and leave the launch shutdown logic for now. If other users run into this, maybe we can add a launch option for a shutdown timeout (or consider letting the sigquit signal cancel any outstanding futures).

@wjwwood
Copy link
Member

wjwwood commented Aug 17, 2020

consider letting the sigquit signal cancel any outstanding futures

This seems reasonable to me.

jacobperron added a commit that referenced this issue Aug 18, 2020
Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Aug 18, 2020
Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Aug 28, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 19, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue Nov 20, 2020
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
hayato-m126 pushed a commit to tier4/launch_ros that referenced this issue May 19, 2021
* Asynchronously wait for load node service response

Fixes ros2#171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue May 25, 2021
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
jacobperron added a commit that referenced this issue May 27, 2021
* Asynchronously wait for load node service response

Fixes #171

By asychronously waiting for the service response, we can monitor if launch is shutting down and abandon the request so we don't block the shutdown process.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Add back debug log

It was accidentally removed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants