Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for shutdown while waiting for a service response to avoid hang during shutdown #104

Merged
merged 2 commits into from
Dec 14, 2019

Conversation

wjwwood
Copy link
Member

@wjwwood wjwwood commented Dec 10, 2019

Should fix #102

… during shutdown

Signed-off-by: William Woodall <william@osrfoundation.org>
@wjwwood wjwwood added bug Something isn't working in review labels Dec 10, 2019
@wjwwood wjwwood self-assigned this Dec 10, 2019
@wjwwood
Copy link
Member Author

wjwwood commented Dec 10, 2019

CI up to lifecycle which is where the problem occurred. I'll run broader CI afterwards if it passes:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@pbaughman
Copy link
Contributor

@wjwwood I'm away from the office right now, but I'll test this in the setup that hangs as soon as I'm in tomorrow morning

@wjwwood
Copy link
Member Author

wjwwood commented Dec 10, 2019

Full CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@pbaughman
Copy link
Contributor

pbaughman commented Dec 11, 2019

@wjwwood A few new stack-traces:

[INFO] [lifecycle_talker-1]: process has finished cleanly [pid 9573]
Future exception was never retrieved
future: <Future finished exception=AttributeError("'LifecycleNode' object has no attribute '_LifecycleNode___logger'",)>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/pete.baughman/gc/apex_ws/build/launch_ros/launch_ros/actions/lifecycle_node.py", line 99, in _call_change_state
    self.___logger.warning(
AttributeError: 'LifecycleNode' object has no attribute '_LifecycleNode___logger'

Followed-by (and possibly the cause of):

Exception ignored in: <coroutine object LaunchService._process_one_event at 0x7f286fff4308>
Traceback (most recent call last):
  File "/home/pete.baughman/gc/apex_ws/build/launch/launch/launch_service.py", line 286, in _process_one_event
    next_event = await self.__context._event_queue.get()
  File "/usr/lib/python3.6/asyncio/queues.py", line 169, in get
    getter.cancel()  # Just in case getter is not done yet.
  File "/usr/lib/python3.6/asyncio/base_events.py", line 591, in call_soon
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

Looks like a simple typo. Too many underbars

Edit Weirdly, the same issue is on line 81 but nobody ever noticed?

@pbaughman
Copy link
Contributor

pbaughman commented Dec 11, 2019

@wjwwood Even after fixing the triple-underbar calls to logger, I still see:

[INFO] [lifecycle_talker-1]: process has finished cleanly [pid 11072]
Future exception was never retrieved
future: <Future finished exception=InvalidHandle('Tried to use a handle that has been destroyed.',)>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/pete.baughman/gc/apex_ws/build/launch_ros/launch_ros/actions/lifecycle_node.py", line 79, in _call_change_state
    while not self.__rclpy_change_state_client.wait_for_service(timeout_sec=1.0):
  File "/home/pete.baughman/gc/apex_ws/install/rclpy/lib/python3.6/site-packages/rclpy/client.py", line 160, in wait_for_service
    while self.context.ok() and not self.service_is_ready() and timeout_sec > 0.0:
  File "/home/pete.baughman/gc/apex_ws/install/rclpy/lib/python3.6/site-packages/rclpy/client.py", line 143, in service_is_ready
    with self.handle as capsule:
  File "/home/pete.baughman/gc/apex_ws/install/rclpy/lib/python3.6/site-packages/rclpy/handle.py", line 146, in __enter__
    return self._get_capsule()
  File "/home/pete.baughman/gc/apex_ws/install/rclpy/lib/python3.6/site-packages/rclpy/handle.py", line 128, in _get_capsule
    raise InvalidHandle('Tried to use a handle that has been destroyed.')
rclpy.handle.InvalidHandle: Tried to use a handle that has been destroyed.

And then a little later:

Exception ignored in: <coroutine object LaunchService._process_one_event at 0x7f29034d7c50>
Traceback (most recent call last):
  File "/home/pete.baughman/gc/apex_ws/build/launch/launch/launch_service.py", line 286, in _process_one_event
    next_event = await self.__context._event_queue.get()
  File "/usr/lib/python3.6/asyncio/queues.py", line 169, in get
    getter.cancel()  # Just in case getter is not done yet.
  File "/usr/lib/python3.6/asyncio/base_events.py", line 591, in call_soon
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Task was destroyed but it is pending!

Note that this is a 'sometimes' stack trace, not an all-the-time stack trace

@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2019

Edit Weirdly, the same issue is on line 81 but nobody ever noticed?

It was just never executed since it's a rare race condition.


For the latest traceback (the first of the two in the latest comment), that's not even part of the modified code... It happens before it on line 79:

File "/home/pete.baughman/gc/apex_ws/build/launch_ros/launch_ros/actions/lifecycle_node.py", line 79, in _call_change_state
while not self.__rclpy_change_state_client.wait_for_service(timeout_sec=1.0):

I think that's likely the other issue where (rclcpp_/rclpy_)shutdown causes functions to fail afterwards.

For that one, I think we should just catch rclpy.handle.InvalidHandle and pass on it.

Signed-off-by: William Woodall <william@osrfoundation.org>
@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2019

I fixed the typos in 398cf8c

@pbaughman
Copy link
Contributor

@wjwwood Do you propose we deal with the invalid handle in another PR or this one? or third option not worry about it

@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2019

If it's causing a problem we can go ahead and fix it I guess, but it would be good to know if the original problem is fixed (it was a hang not a crash) and that this fix doesn't cause the crashes you see.

@pbaughman
Copy link
Contributor

I'm pretty confident that the original issue (hang) is resolved. I could make it happen in 10 minutes before your change, and I let your change run for about 90 minutes without hanging.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if ROSSpecificLaunchStartup should somehow provide this feature, any future ROS service calls done from within launch will run into the same problem, no?

@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2019

I wonder if ROSSpecificLaunchStartup should somehow provide this feature, any future ROS service calls done from within launch will run into the same problem, no?

Maybe so, but I think this is the only case so far. In general you need to think about these issues, as who ever wrote this (probably me) was thinking with the wait_for_service() call that's already checking for shutdown. It might help to refactor it later to avoid code duplication though.

@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2019

New CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS (I canceled this because we're backed up on the farm and it was ok before)
  • Windows Build Status

@pbaughman
Copy link
Contributor

@wjwwood Any update on this?

@wjwwood
Copy link
Member Author

wjwwood commented Dec 13, 2019

Running CI again:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@wjwwood
Copy link
Member Author

wjwwood commented Dec 14, 2019

Comparing the CI results to nightly, e.g. https://ci.ros2.org/view/nightly/job/nightly_linux_release/1389/testReport/, I've manually verified that all the failures were preexisting.

macOS is hung in the ros2cli tests (known issue), so I'm going to ignore that one.

Despite these issues, I'm going to merge this if I can find a review, so that some downstream users can bring the patch in.

Copy link
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjwwood wjwwood merged commit c37340d into master Dec 14, 2019
@delete-merged-branch delete-merged-branch bot deleted the fix_lifecycle_hang_on_shutdown_race branch December 14, 2019 02:39
ivanpauno pushed a commit that referenced this pull request Jan 17, 2020
… during shutdown (#104)

* check for shutdown while waiting for a service response to avoid hang during shutdown

Signed-off-by: William Woodall <william@osrfoundation.org>

* fix typo in logger call

Signed-off-by: William Woodall <william@osrfoundation.org>
ivanpauno pushed a commit that referenced this pull request Jan 17, 2020
… during shutdown (#104)

* check for shutdown while waiting for a service response to avoid hang during shutdown

Signed-off-by: William Woodall <william@osrfoundation.org>

* fix typo in logger call

Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: ivan <ivanpauno@gmail.com>
ivanpauno added a commit that referenced this pull request Jan 21, 2020
* Add pid to launch_ros node name as suffix

Signed-off-by: Brian Ezequiel Marchi <brian.marchi65@gmail.com>
Signed-off-by: ivan <ivanpauno@gmail.com>

* Pass the node-name attribute through the substitution parser (#101)

Signed-off-by: ivan <ivanpauno@gmail.com>

* Fix push-ros-namespace in xml/yaml launch files (#100)

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: ivan <ivanpauno@gmail.com>

* Maintain order of parameters regarding name and from (#99)

Signed-off-by: Brian Ezequiel Marchi <brian.marchi65@gmail.com>
Signed-off-by: ivan <ivanpauno@gmail.com>

* Use imperative mood in constructor docstrings. (#103)

* Use imperative mood in constructor docstrings.

Fixes D401 in pycodestyle 5.0.0 and flake8.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

* Use imperative mood in docstring.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

* Use imperative mood in docstrings.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
Signed-off-by: ivan <ivanpauno@gmail.com>

* Fix misleading deprecated warnings when using launch arguments (#106)

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: ivan <ivanpauno@gmail.com>

* check for shutdown while waiting for a service response to avoid hang during shutdown (#104)

* check for shutdown while waiting for a service response to avoid hang during shutdown

Signed-off-by: William Woodall <william@osrfoundation.org>

* fix typo in logger call

Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: ivan <ivanpauno@gmail.com>

* Fix frontend topic remapping (#111)

* Add frontend remap test

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

* Pass data_type parameter to remap entity

This resolves an issue where frontend remaps are not parsed.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: ivan <ivanpauno@gmail.com>

Co-authored-by: Brian Marchi <brian.marchi65@gmail.com>
Co-authored-by: Grey <grey@openrobotics.org>
Co-authored-by: Steven! Ragnarök <nuclearsandwich@users.noreply.github.com>
Co-authored-by: William Woodall <william+github@osrfoundation.org>
Co-authored-by: Jacob Perron <jacob@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lifecycle_node causing launch to hang at shutdown
4 participants