Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressly destroy a node's objects before the node. #456

Merged
merged 5 commits into from
Nov 13, 2019

Conversation

nuclearsandwich
Copy link
Member

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.
Another few hours of testing will convince me to strengthen that statement.

The handles corresponding to the destroyed objects should be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix as I'm not particularly proud of how little I understand why the existing destruction handling is insufficient.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to. A further possibility is to play around with pausing the garbage collector and triggering it at explicit points to see if that changes results at all.

@nuclearsandwich nuclearsandwich added the in progress Actively being worked on (Kanban column) label Nov 7, 2019
@nuclearsandwich nuclearsandwich self-assigned this Nov 7, 2019
This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
@nuclearsandwich nuclearsandwich force-pushed the nuclearsandwich/destroy-dependents-first branch from 253f18e to 30600a1 Compare November 7, 2019 22:56
@nuclearsandwich
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@mjcarroll
Copy link
Member

This doesn't seem outrageous. It does seem like there may be a more elegant solution, but it probably requires some arcane knowledge of Python garbage collection's interactions with C extensions. Unless @sloretz has any immediate thoughts, I'm good with merging this + adding a tracking issue for a longer-term solution.

@dirk-thomas
Copy link
Member

The new logic seems to remove items for the collections while iterating over them which shouldn't be done. The loops should be changed into while self._[name]: to avoid that.

@nuclearsandwich
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
@nuclearsandwich nuclearsandwich force-pushed the nuclearsandwich/destroy-dependents-first branch from 97d3ccf to 0337867 Compare November 8, 2019 13:11
Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
@nuclearsandwich
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

rclpy/rclpy/node.py Outdated Show resolved Hide resolved
As pointed out by Shane, pop()ing each item from the list before
passing it to the .destroy_ITEM() method prevents it from being
destroyed as the individual methods first check that the item is present
in the list then remove it before continuing to destroy it.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
Copy link
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a temporary workaround, though I don't know what causes the hang or how this works around it.

@nuclearsandwich
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@nuclearsandwich
Copy link
Member Author

I don't know what causes the hang or how this works around it.

I've also been dissatisfied with this solution. After discussion with @sloretz I've started another experiment running with Python's GC disabled before 1461 and enabled after self.handle.destroy(). That has also seemingly stopped the hang from occurring over the last ~50 test runs which means we might need to look more closely at the handle destruction logic.

@nuclearsandwich
Copy link
Member Author

I ran experimental tests most of the day Saturday and I was able to reproduce a hang using this branch of rclpy with rmw_cyclonedds_cpp as well. Although it's in a slightly different place.

In this gdb log thread 2 is stuck during capsule destruction in a futex down in cyclonedds.

@nuclearsandwich
Copy link
Member Author

@hidmic and I dove into this today and we couldn't pinpoint a reason for the hang. It is still possible to hang with even with this PR although the incidence rate is lower.

We did a couple of experiments trying to reproduce the hang outside of launch tests using multiple threads and rclpy but were unable to do so.

I do have a bit more detail from the current reproduction since the destruction doesn't occur as part of a garbage collection. In python, we're inside Handle.__destroy_self

 192        def __destroy_self(self):
 193            with self.__rlock:
 194                # Calls pycapsule destructor
>195                _rclpy_capsule.rclpy_pycapsule_destroy(self.__capsule)
 196                # Call post-destroy callbacks
 197                while self.__destroy_callbacks:
 198                    self.__destroy_callbacks.pop()(self)
 199                # get rid of references to other handles to break reference cycles
 200                del self.__required_handles

and in the C context we can see the pycapsule destructor calling rcl_publisher_fini which is carrying us into Fast-RTPS internals where we're stuck waiting for a lock:

#0  0x00007f40703f19f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1f77a04) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x1f779b0, cond=0x1f779d8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x1f779d8, mutex=0x1f779b0) at pthread_cond_wait.c:655
#3  0x00007f40632a5951 in eprosima::fastrtps::TimedConditionVariable::wait<std::timed_mutex>(std::unique_lock<std::timed_mutex>&, std::function<bool ()>) (this=0x1f779d8, lock=..., predicate=...)
    at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/include/fastrtps/rtps/resources/../../utils/TimedConditionVariable.hpp:79
#4  0x00007f406329fd0e in eprosima::fastrtps::rtps::ResourceEvent::unregister_timer (this=0x1f779a0, event=0x20deb20) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/ResourceEvent.cpp:101
#5  0x00007f40632a8962 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1797100, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:42
#6  0x00007f40632a89a0 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1797100, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:44
#7  0x00007f4063367911 in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x20cef00, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:105
#8  0x00007f40633679ec in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x20cef00, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:115
#9  0x00007f406335a689 in eprosima::fastrtps::ParticipantImpl::removePublisher (this=0x1f16450, pub=0x18f46c0) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/participant/ParticipantImpl.cpp:91
#10 0x00007f406334e85b in eprosima::fastrtps::Domain::removePublisher (pub=0x18f46c0) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/Domain.cpp:116
#11 0x00007f4063bf99e3 in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher (identifier=0x7f406427fc17 "rmw_fastrtps_cpp", node=0x191e3b0, publisher=0x162e0d0)
    at /home/osrf/debug_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publisher.cpp:66
#12 0x00007f4064274f8f in rmw_destroy_publisher (node=0x191e3b0, publisher=0x162e0d0) at /home/osrf/debug_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_publisher.cpp:295
#13 0x00007f406a533f34 in rmw_destroy_publisher (v2=0x191e3b0, v1=0x162e0d0) at /home/osrf/debug_ws/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:285
#14 0x00007f406a758cc2 in rcl_publisher_fini (publisher=0x7f4064e68720, node=0x7f4064e686f0) at /home/osrf/debug_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:231
#15 0x00007f406b231336 in _rclpy_destroy_publisher (pyentity=<PyCapsule at remote 0x7f40600a86f0>) at /home/osrf/debug_ws/src/ros2/rclpy/rclpy/src/rclpy/_rclpy.c:1259
#16 0x00007f4067599d78 in rclpy_pycapsule_destroy (_unused_self=<module at remote 0x7f406b4740e8>, args=(<PyCapsule at remote 0x7f40600a86f0>,))
    at /home/osrf/debug_ws/src/ros2/rclpy/rclpy/src/rclpy/_rclpy_pycapsule.c:100

This PR still improves matters, but it's not a full solution.

@nuclearsandwich
Copy link
Member Author

In discussion yesterday we've decided to merge this as an incremental improvement. I'll leave ros2/build_farmer#248 open or create a more focused issue to continue investigation.

@nuclearsandwich nuclearsandwich merged commit 39b9a57 into master Nov 13, 2019
@delete-merged-branch delete-merged-branch bot deleted the nuclearsandwich/destroy-dependents-first branch November 13, 2019 14:06
jaisontj pushed a commit to aws-ros-dev/rclpy that referenced this pull request Nov 19, 2019
* Expressly destroy a node's objects before the node.

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

* Don't pre-emptively remove items from Node lists.

As pointed out by Shane, pop()ing each item from the list before
passing it to the .destroy_ITEM() method prevents it from being
destroyed as the individual methods first check that the item is present
in the list then remove it before continuing to destroy it.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
suab321321 pushed a commit to suab321321/rclpy that referenced this pull request Jan 31, 2020
* Expressly destroy a node's objects before the node.

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

* Don't pre-emptively remove items from Node lists.

As pointed out by Shane, pop()ing each item from the list before
passing it to the .destroy_ITEM() method prevents it from being
destroyed as the individual methods first check that the item is present
in the list then remove it before continuing to destroy it.

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
Signed-off-by: AbhinavSingh <singhabhinav9051571833@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Actively being worked on (Kanban column)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants