Tests hanging during ROS 2 CI tests #248

nuclearsandwich · 2019-11-01T19:33:28Z

Over the past week I've seen several instances of a CI run hanging during tests of ros2cli packages.

ci_windows#8499 hung during ros2topic
ci_windows#8505 hung during ros2param
nightly_osx_extra_rmw_release#560 hung during ros2topic

I didn't think to screengrab the processmonitor output during the last Windows one I cleaned up but I did grab the procline of a process which seemed to be endlessly spawning and killing python processes: c:\python37\python.exe setup.py pytest egg_info --egg-base C:\J\workspace\ci_windows\ws\build\ros2topic. On macOS where I'm a bit more comfortable with the tooling there doesn't appear to be a cycling of spawing and dying processes but a static tree with one zombie process (37262 in the example pstree below).

I couldn't get a procline off of the zombie process. Attempting to SIGCHLD its parent yielded no result, I eventually escalated to SIGTERM before sending SIGTERM to the colcon process (84891) which was enough to de-hang the job (although it failed since colcon test was terminated) but the 35227 tree stayed behind now with all child processes zombiefied. I eventually had to issue a SIGKILL to 35227 which removed it and allowed its children to get reaped by init. I haven't identified a patient 0 and tests are passing most of the time. I also haven't seen this happen on Linux platforms thus far.

pstree 717
-+= 00717 jenkins /usr/bin/java -Djava.awt.headless=true -jar /Users/jenkins/jenkins-agent/steven-agent.jar -jnlpUrl https://ci.ros2.org/computer/mac_mojave_lore/slave-agent.jnlp -secret 6e1eb6416b70a634000f26198aacf432d597e2e95234eb2441de0b60e5827818
 \-+- 08348 jenkins /bin/sh -xe /var/folders/tm/5tw_fjn13737cx92skl8xjc00000gp/T/jenkins6149197211335807067.sh
   \-+- 08350 jenkins /usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python -u run_ros2_batch.py --do-venv --force-ansi-color --workspace-path /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release --ignore-rmw rmw_connext_cpp --repo-file-url https://raw.githubusercontent.com/ros2/ros2/mas
     \-+- 84888 jenkins /bin/sh -c . ../venv/bin/activate && /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/colcon test --base-paths "src" --build-base "build" --install-base "install" --event-handlers console_direct+ --executor sequential --retest-until-pass 10
       \-+- 84891 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/colcon test --base-paths src --build-base build --install-base install --event-handlers console_direct+ --executor sequential --retest-until-pass 10
         \-+- 35227 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python setup.py pytest egg_info --egg-base /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/build/ros2topic
           |--- 37166 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args
           |--- 37167 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/listener_node.py --ros-args -r chatter:=chit_chatter
           |--- 37168 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args -r chatter:=_hidden_chatter
           |--- 37169 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Arrays --ros-args -r __node:=array_repeater -r /array_repeater/output:=/arrays
           |--- 37170 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Defaults --ros-args -r __node:=defaults_repeater -r /defaults_repeater/output:=/defaults
           |--- 37171 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/BoundedSequences --ros-args -r __node:=bounded_sequences_repeater -r /bounded_sequences_repeater/output:=/bou
           |--- 37172 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/UnboundedSequences --ros-args -r __node:=unbounded_sequences_repeater -r /unbounded_sequences_repeater/output
           |--- 37173 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/controller_node.py --ros-args
           \--- 37262 jenkins (python3.7)

The text was updated successfully, but these errors were encountered:

nuclearsandwich · 2019-11-04T15:18:07Z

More hanging builds came in over the weekend.

nightly_win_deb#1426 hung during ros2service
nightly_windows_extra_rmw_release#528 hung during ros2action
nightly_osx_extra_rmw_rel#564 hung during ros2topic
nightly_linux_extra_rmw_release#515 hung during ros2topic
nightly_windows_extra_rmw_release#530 hung during ros2topic

On Linux I was able to strace the pytest process which isn't reaping its children (/home/jenkins-agent/workspace/nightly_linux_extra_rmw_release/venv/bin/python setup.py pytest egg_info --egg-base /home/jenkins-agent/workspace/nightly_linux_extra_rmw_release/ws/build/ros2topic) and it was blocked on a futex wait.

I didn't see anything else that yielded solid breadcrumbs. It's possible that ros2/launch#279 is not playing well with an update to pytest (5.2.2 was released 24 October). Now that we've seen the problem on Linux I can try and reproduce it locally.

nuclearsandwich · 2019-11-04T20:41:18Z

I was able to reproduce the issue in a local container and gdb one stuck process.

I did capture some gdb output in this gist: https://gist.github.com/nuclearsandwich/c845d073fa854ce020c8d42453d00faa

I don't have a good instinct on how to guarantee expression and minimize the reproduction but it does look like the hang is in rclpy when used with Fast-RTPS and the test usage is just hitting it.

nuclearsandwich · 2019-11-06T21:31:49Z

Yesterday was light on failures. Today is heavy.

nightly_linux_debug#1368 during ros2topic.
nightly_linux_release#1355 during ros2topic
nightly_linux_coverage#669 during ros2topic
nightly_windows_extra_rmw_rel#533 during ros2action
ci_linux#8612 during ros2topic
ci_windows#8533 during ros2topic

nuclearsandwich · 2019-11-06T21:59:26Z

I've had a few more reproductions locally and all of them have backtraces near identical to the one I captured. I have been running the tests in a loop for half a day and been unable to reproduce them with cyclonedds, the only rmw implementation besides fastrtps I have available in the repro container.

I don't know whether there is an issue with Fast-RTPS itself or how we're using it. It's starting to remind me a lot of ros2/launch#89

Could it be that we need to explicitly destroy publishers before clearing them and destroying the node like this?

nuclearsandwich · 2019-11-07T15:35:37Z

nightly_linux-armhf_extra_rmw_release#282 during ros2topic
nightly_osx_extra_rmw_release#567 during ros2topic
nightly_win_extra_rmw_rel#534

This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to.

This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

sloretz · 2019-11-19T17:46:24Z

I had a look into the hang and managed to reproduce it by testing just test_cli.py in ros2topic on Bionic with the latest ROS 2 sources and a patch to rclpy. It looks like some threads have deadlocked in Fast-RTPS and I made some notes, but unfortunately I accidentally killed the process I was debugging so I don't have all of the tracebacks.

patch to rclpy to increase hang frequency

diff --git a/rclpy/rclpy/node.py b/rclpy/rclpy/node.py
index 94afaf4..17c2d03 100644
--- a/rclpy/rclpy/node.py
+++ b/rclpy/rclpy/node.py
@@ -1463,18 +1463,26 @@ class Node:
 
         # Destroy dependent items eagerly to work around a possible hang
         # https://github.com/ros2/build_cop/issues/248
-        while self.__publishers:
-            self.destroy_publisher(self.__publishers[0])
-        while self.__subscriptions:
-            self.destroy_subscription(self.__subscriptions[0])
-        while self.__clients:
-            self.destroy_client(self.__clients[0])
-        while self.__services:
-            self.destroy_service(self.__services[0])
-        while self.__timers:
-            self.destroy_timer(self.__timers[0])
-        while self.__guards:
-            self.destroy_guard_condition(self.__guards[0])
+        # while self.__publishers:
+        #     self.destroy_publisher(self.__publishers[0])
+        # while self.__subscriptions:
+        #     self.destroy_subscription(self.__subscriptions[0])
+        # while self.__clients:
+        #     self.destroy_client(self.__clients[0])
+        # while self.__services:
+        #     self.destroy_service(self.__services[0])
+        # while self.__timers:
+        #     self.destroy_timer(self.__timers[0])
+        # while self.__guards:
+        #     self.destroy_guard_condition(self.__guards[0])
+
+        self.__publishers.clear()
+        self.__subscriptions.clear()
+        self.__clients.clear()
+        self.__services.clear()
+        self.__timers.clear()
+        self.__guards.clear()
+
         self.handle.destroy()
         self._wake_executor()

This is the process tree at the time of deadlock. The deadlock occured in 15469, the python3 setup.py pytest process.

sloretz  15217     1  0 14:29 pts/0    00:00:05 /usr/bin/python3 /usr/local/bin/colcon test --retest-until-fail 20 --event-handlers console_direct+ --packages-select ros2topic
  sloretz  15469 15217  0 14:29 pts/0    00:00:05 /usr/bin/python3 setup.py pytest egg_info --egg-base /home/sloretz/ws/ros2/build/ros2topic
    sloretz  15707 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args
    sloretz  15708 15469  0 14:30 pts/0    00:00:02 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/listener_node.py --ros-args -r chatter:=chit_chatter
    sloretz  15709 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args -r chatter:=_hidden_chatter
    sloretz  15710 15469  0 14:30 pts/0    00:00:04 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Arrays --ros-args -r __node:=array_repeater -r /array_repeater/output:=/arrays
    sloretz  15711 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Defaults --ros-args -r __node:=defaults_repeater -r /defaults_repeater/output:=/defaults
    sloretz  15712 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/BoundedSequences --ros-args -r __node:=bounded_sequences_repeater -r /bounded_sequences_repeater/output:=/bounded_seq
    sloretz  15713 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/UnboundedSequences --ros-args -r __node:=unbounded_sequences_repeater -r /unbounded_sequences_repeater/output:=/unbou
    sloretz  15714 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/controller_node.py --ros-args

sloretz  15681     1  0 14:30 pts/0    00:00:02 /usr/bin/python3 /home/sloretz/ws/ros2/install/ros2cli/bin/_ros2_daemon --rmw-implementation rmw_opensplice_cpp --ros-domain-id 101

There were seven threads: two from python (main thread + extra thread spawned by launch_testing) and 5 created by Fast-RTPS. The main thread was blocked as a consequence of a deadlock in two threads created by Fast-RTPS.

Beginning of main thread (thread 1) traceback

sadly this is the only traceback I have

  #0  0x00007fd4937d79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1c7d178) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x1c7d128, cond=0x1c7d150) at pthread_cond_wait.c:502
  #2  __pthread_cond_wait (cond=0x1c7d150, mutex=0x1c7d128) at pthread_cond_wait.c:655
  #3  0x00007fd486c5d573 in eprosima::fastrtps::TimedConditionVariable::wait<std::timed_mutex>(std::unique_lock<std::timed_mutex>&, std::function<bool ()>) (this=0x1c7d150, lock=..., predicate=...)
      at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/include/fastrtps/rtps/resources/../../utils/TimedConditionVariable.hpp:79
  #4  0x00007fd486c57913 in eprosima::fastrtps::rtps::ResourceEvent::unregister_timer (this=0x1c7d118, event=0x138ff20) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/ResourceEvent.cpp:101
  #5  0x00007fd486c60584 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1ec8e80, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:42
  #6  0x00007fd486c605c2 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1ec8e80, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:44
  #7  0x00007fd486d1f951 in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x1cf1910, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:105
  #8  0x00007fd486d1fa2c in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x1cf1910, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:115
  #9  0x00007fd486d1296b in eprosima::fastrtps::ParticipantImpl::removePublisher (this=0x1c7ca20, pub=0x1390000) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/participant/ParticipantImpl.cpp:91
  #10 0x00007fd486d06cfd in eprosima::fastrtps::Domain::removePublisher (pub=0x1390000) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/Domain.cpp:116
  #11 0x00007fd4874b156f in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher (identifier=0x7fd487b35357 "rmw_fastrtps_cpp", node=0x17ab380, publisher=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publisher.cpp:66
  #12 0x00007fd487b2aad9 in rmw_destroy_publisher (node=0x17ab380, publisher=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_publisher.cpp:295
  #13 0x00007fd48d903f44 in rmw_destroy_publisher (v2=0x17ab380, v1=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:285
  #14 0x00007fd48db28ce2 in rcl_publisher_fini (publisher=0x7fd488356300, node=0x7fd4883562f0) at /home/sloretz/ws/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:231

Here are the notes I took. Thread 1 is blocked because allow_to_delete_ never becomes true. This should happen during ResourceEvent::run_io_service(), but that service is being handled by thread 3 which has deadlocked with thread 5. Thread 3 is stuck inside io_service_.poll() because it is in some callback that is waiting for StatefulReader(0x1cae400)->mp_mutex while trying to call matched_writer_remove(). Additionally thread 3 acquired a lock on PDP(0x1e6c6d0)->mp_mutex in check_remote_participant_liveliness(). The lock thread 3 is waiting on is held by thread 5 which acquired the lock in StatefulReader(0x1cae400)::processDataMessage. Thread 5 won't release it because it's trying to acquire PDP(0x1e6c6d0)->mp_mutex in PDP::removeWriterProxyData().

In short, thread 3 has the PDP mutex but wants the StatefulReader mutex, while thread 5 has the StatefulReader mutex but wants the PDP mutex. Thread 1 is blocked because it needs thread 3 to set a flag to unblock it.

I patched Fast-RTPS to avoid this deadlock and haven't seen the hang running the same test over and over all night. I don't know how the test got into this situation (maybe there's still a bug in the order rclpy destructs stuff), and I don't know much about the Fast-RTPS codebase, but the patch seems reasonable-ish to me.

Patch to Fast-RTPS v1.9.2

diff --git a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
index f766c5ea0..e090a533f 100644
--- a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
+++ b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
@@ -996,7 +996,7 @@ CDRMessage_t PDP::get_participant_proxy_data_serialized(Endianness_t endian)
 void PDP::check_remote_participant_liveliness(
         ParticipantProxyData* remote_participant)
 {
-    std::lock_guard<std::recursive_mutex> guard(*this->mp_mutex);
+    std::unique_lock<std::recursive_mutex> guard(*this->mp_mutex);     if(GUID_t::unknown() != remote_participant->m_guid)
     {
@@ -1007,7 +1007,9 @@ void PDP::check_remote_participant_liveliness(
                 std::chrono::microseconds(TimeConv::Duration_t2MicroSecondsInt64(remote_participant->m_leaseDuration));
         if (now > real_lease_tm)
         {
+            guard.unlock();
             remove_remote_participant(remote_participant->m_guid, ParticipantDiscoveryInfo::DROPPED_PARTICIPANT);
+            guard.lock();
             return;
         }

* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

nuclearsandwich · 2019-11-20T21:46:23Z

I cherry plucked Shane's branch onto Fast-RTPS 1.9.x and kicked off the ROS 2 test suite since that change may be part of the Eloquent release and we should keep a sharp eye on regressions where lock code is changing.

Linux
Linux-aarch64
macOS
Windows

sloretz · 2019-11-21T01:34:48Z

Linux test failures test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings] are all connext + cyclonedds related, but it isn't obvious from their name

nuclearsandwich · 2019-12-11T22:24:35Z

We were good for a while but these hangs seem back with a vengeance today on Linux and Linux ARM64. Curiously some of the builds exhibit the issue when Fast-RTPS is disabled (testing opensplice changes) but it did occur with fast-rtps on a debug nightly as well.

nuclearsandwich · 2019-12-11T23:01:21Z

I am out of steam for the day. There's a private, internal thread with more info that when I come back online tomorrow I'll collate here.

nuclearsandwich · 2019-12-28T00:06:02Z

The Linux nightlies have been very consistently hitting this issue. I've tried several times to reproduce it on a smaller scale but haven't been able to do so either on the buildfarm or locally.

Looking into the hung processes on Linux, each of them that I've sampled looks similar to this one:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f29bf803f40 (LWP 7361) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  2    Thread 0x7f29b1003700 (LWP 7365) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  3    Thread 0x7f29b0802700 (LWP 7438) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  4    Thread 0x7f29aaffd700 (LWP 7518) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7f29aa7fc700 (LWP 7604) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  6    Thread 0x7f29a889f700 (LWP 7691) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f2990003bb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205

which has five threads blocked on the futex 0x362c7e0 and one blocked on the futex 0x7f2990003bb0. I don't have any info on the relationship between the futexes. There are no active child processes at the time of debugging. Everything I've read about digging into the contended futexes has suggested that systemtap is the next step but the debugging info for kernels in AWS isn't provided so the next steps are either to reproduce the issue on a debuggable kernel or build the debug info for the kernel we're running in production. I've been continuing to iterate on the former process and will probably try the latter before the end of the day.

Logs from GDB that include the python backtraces and C backtraces for each thread are here.

sloretz · 2020-01-13T22:56:00Z

This issue seems to be very hard to reproduce, so I'm looking at code changes instead. Assuming the hang is caused by a ros2 code change and not a CI infrastructure change, these are the changes that were made between the last nightly_linux_debug that didn't hang, and the first one that did.

sloretz · 2020-01-13T23:30:05Z

Trying to reproduce using ci_linux

https://ci.ros2.org/job/ci_linux/9012/
- Control job, I'd expect this to hang. It's using the exact repos file from nightly_linux_debug 1403. If it doesn't hang then the job below means nothing
- edit: hung 🎉
https://ci.ros2.org/job/ci_linux/9013/
- Job with fastrtps version pinned to what it was before the nightlies started hanging. If this does not hang and 9012 does hang, then the fastrtps PRs should be investigated as a possible cause of the hang.
- edit: Did not hang 👀

sloretz · 2020-01-14T02:49:47Z

A couple more tests

https://ci.ros2.org/job/ci_linux/9015/
- This is using all repos the same version as nightly_linux_debug 1402 except fastrtps which is using the version used by nightly_linux_debug 1403. If the fastrtps changes are responsible for the ros2topic hang, then I would expect this to hang too.
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9016/
- A control job with the repos file from 1402. If this hangs, then it's probably a CI infrastructure change rather than a code change. If this does not hang, and the job above does not hang, then the cause of the hang probably requires changes in multiple repos. If this does not hang, and the job above does hang, then the fastrtps changes are likely the only cause of the hang.
- edit: did not hang

sloretz · 2020-01-14T18:01:18Z

https://ci.ros2.org/job/ci_linux/9020/
- ament_package, fastrtps, rcl_logging from 1403, all else from 1402
- edit: did not hang

sloretz · 2020-01-14T21:30:26Z

https://ci.ros2.org/job/ci_linux/9023/
- fastrtps, rclpy, rmw_connext, and ros2cli from 1403, everything else from 1402
- edit: hung

sloretz · 2020-01-15T02:12:55Z

https://ci.ros2.org/job/ci_linux/9030/
- fastrtps, ros2cli from 1403, everything else from 1402
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9031/
- fastrtps, rclpy, ros2cli from 1403, everything else from 1402
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9032/
- fastrtps, rmw_connext from 1403, everything else from 1402
- edit: hung! 🎉
https://ci.ros2.org/job/ci_linux/9033/
- fastrtps, rclpy from 1403, everything else from 1402
- edit: did not hang

sloretz · 2020-01-15T16:40:30Z

https://ci.ros2.org/job/ci_linux/9040/
- fastrtps rmw_connext from 1403, all else from 1402, building all, testing only ros2topic
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9041/
- fastrtps rmw_connext from 1403, all else from 1402, building only up to ros2topic, testing only ros2topic
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9042/
- fastrtps rmw_connext from 1403, all else from 1402, building only up to ros2topic, testing all up to ros2topic
- edit: did not hang

sloretz · 2020-01-15T17:47:06Z

https://ci.ros2.org/job/ci_linux/9045/
- fastrtps rmw_connext from 1403, all else from 1402, building all, testing all up to ros2topic
- edit: did not hang
https://ci.ros2.org/job/ci_linux/9046/
- Repeat of 9032 to check if this is consistently hanging since none of the tests in my last comment hung. fastrtps rmw_connext from 1403, everything else from 1402. Building and testing all.
- edit. hung, possible fix in Release loop lock before waiting for it to do work launch#369

nuclearsandwich · 2020-02-26T22:42:00Z

Many builds needed to be aborted today. @clalancette and @mabelzhang are going to dive into this one next week.

brawner · 2020-02-28T22:20:40Z

There are a couple other issues with ros2cli tests. I would highly doubt they impact this issue, but they might.

ros2/ros2cli#456
ros2/ros2cli#457

ivanpauno · 2020-03-02T14:19:04Z

I cleaned up a bunch of hanging nightlies:

nuclearsandwich · 2020-04-15T21:54:21Z

I've opened ros2/ros2cli#489 to skip these tests on Windows until we can identify and mitigate the problem.

nuclearsandwich · 2020-04-15T21:56:21Z

I noticed a similar hang/block coming from the test_launch_ros package today in the build here: nightly_windows_rel#1521 (internal discussion link). This is the first time I've seen the behavior without launch_testing but with launch.

nuclearsandwich · 2020-04-16T04:22:21Z

Here's another hang from test_launch_ros https://ci.ros2.org/job/ci_windows/10115

Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

rotu · 2020-04-17T06:49:13Z

I very much think https://ci.ros2.org/job/nightly_osx_debug/1501/ is multiprocessing-related and I suspect it's this little caveat:

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Another part of this mystery is why does _wait_for_ready_callbacks get called? I think a crash when tokenizing triggered the atexit handler on shutdown, which was registered by a previous task on the worker thread (@dirk-thomas just removed it in ros2/launch#401 👍). Seeing as there is a fatal error "Fatal Python error: initsite: Failed to import the site module" reported many times in this run, I think that was the crash during tokenization.

sloretz · 2020-04-17T16:34:32Z

Seeing as there is a fatal error "Fatal Python error: initsite: Failed to import the site module" reported many times in this run, I think that was the crash during tokenization.

Just noting site is imported during initialization (unless the python process is invoked with -S), so a fatal error there might suggest something has gone wrong before any python code in our control has had a chance to execute.

Fixes #480 The actual tests are the same, except with the use of launch_testing we ensure the CLI daemon is restarted between tests. This follows a similar pattern as the other ros2cli tests. In addition to converting to launch tests, this change also runs the tests for all RMW implementations. For now, we are skipping tests on Windows. Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

nuclearsandwich · 2020-04-25T14:10:07Z

https://ci.ros2.org/job/nightly_osx_release/1626/ has been stuck in test_launch_ros for more than 24 hours. I'm going to abort it.

sloretz · 2020-05-01T20:36:29Z

Another test_launch_ros hang https://ci.ros2.org/job/nightly_osx_extra_rmw_release/752/console

chapulina · 2020-05-05T17:18:54Z

🧑‍🌾 Another test_launch_ros hang: https://ci.ros2.org/job/ci_osx/8638/console

17:21:10 test/test_copyright.py .                                                 [  1%]
17:21:10 test/test_flake8.py .                                                    [  3%]
17:21:10 test/test_pep257.py .                                                    [  4%]
17:21:10 test/test_launch_ros/test_normalize_parameters.py .....................  [ 39%]
17:21:10 test/test_launch_ros/test_normalize_remap_rule.py .....                  [ 47%]
// hang

ivanpauno · 2020-05-06T16:02:43Z

Another one https://ci.ros2.org/job/nightly_osx_debug/1616.

rotu · 2020-05-06T16:55:58Z

Any way to get a copy of log/latest_test/events.log from that run?

rotu · 2020-05-06T17:12:29Z

I wonder if it has something to do with the many resource leaks in osrf_pycommon osrf/osrf_pycommon#62

rotu · 2020-05-06T21:38:52Z

May also be leaking of open file descriptors in launch_yaml.Parser ros2/launch#415

rotu · 2020-05-07T00:40:42Z

This may also be a contributing cause: ros2/launch#416

clalancette · 2021-02-03T20:34:34Z

We aren't using this repository anymore for buildfarm issues, so I'm going to archive it. Thus I'm closing out this issue. If you continue to have problems, please report another bug against https://github.com/ros2/ros2. Thank you.

nuclearsandwich added the bug Something isn't working label Nov 1, 2019

nuclearsandwich self-assigned this Nov 6, 2019

nuclearsandwich mentioned this issue Nov 7, 2019

Expressly destroy a node's objects before the node. ros2/rclpy#456

Merged

nuclearsandwich mentioned this issue Nov 11, 2019

Add verbose info for topic list ros2/ros2cli#351

Merged

sloretz mentioned this issue Nov 19, 2019

Fix deadlock between PDP and StatefulReader eProsima/Fast-DDS#871

Merged

claireyywang mentioned this issue Dec 4, 2019

[ros2param] Provide tests for ros2param verbs ros2/ros2cli#389

Open

hidmic mentioned this issue Dec 17, 2019

CycloneDDS hangs on exit in Windows ros2/rmw_cyclonedds#85

Closed

clalancette assigned mabelzhang Feb 26, 2020

nuclearsandwich mentioned this issue Apr 16, 2020

Bump to the latest commit on ros2_tracing to fix deprecation. ros2/ros2#899

Merged

This was referenced Apr 16, 2020

Skip CLI tests on Windows until we resolve the blocking/hanging isuse. ros2/ros2cli#489

Merged

Fix cli tests on Windows and re-enable them ros2/ros2cli#490

Open

jacobperron added a commit to ros2/ros2cli that referenced this issue Apr 16, 2020

Skip tests on Windows

998e845

Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

ivanpauno mentioned this issue Apr 16, 2020

Switch to one participant per context model ros2/rmw_cyclonedds#145

Merged

jacobperron added a commit to ros2/ros2cli that referenced this issue Apr 16, 2020

Skip tests on Windows

912b1ba

Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

nuclearsandwich mentioned this issue Apr 17, 2020

Switch cyclone from master to its 0.5.x release branch ros2/ros2#894

Merged

hidmic mentioned this issue Jun 15, 2020

Make use of ros_testing to test policy generation. ros2/sros2#214

Merged

hidmic mentioned this issue Dec 10, 2020

Shutdown hangs on SIGINT ros2/launch#473

Closed

hidmic mentioned this issue Dec 23, 2020

Handle signals within the asyncio loop. ros2/launch#476

Merged

clalancette closed this as completed Feb 3, 2021

Tests hanging during ROS 2 CI tests #248

Tests hanging during ROS 2 CI tests #248

Comments

nuclearsandwich commented Nov 1, 2019 • edited Loading

nuclearsandwich commented Nov 4, 2019 • edited Loading

nuclearsandwich commented Nov 4, 2019

nuclearsandwich commented Nov 6, 2019

nuclearsandwich commented Nov 6, 2019

nuclearsandwich commented Nov 7, 2019

sloretz commented Nov 19, 2019 • edited Loading

nuclearsandwich commented Nov 20, 2019

sloretz commented Nov 21, 2019

nuclearsandwich commented Dec 11, 2019

nuclearsandwich commented Dec 11, 2019

nuclearsandwich commented Dec 28, 2019

sloretz commented Jan 13, 2020 • edited Loading

sloretz commented Jan 13, 2020 • edited Loading

sloretz commented Jan 14, 2020 • edited Loading

sloretz commented Jan 14, 2020 • edited Loading

sloretz commented Jan 14, 2020 • edited Loading

sloretz commented Jan 15, 2020 • edited Loading

sloretz commented Jan 15, 2020 • edited Loading

sloretz commented Jan 15, 2020 • edited Loading

nuclearsandwich commented Feb 26, 2020 • edited Loading

brawner commented Feb 28, 2020

ivanpauno commented Mar 2, 2020

nuclearsandwich commented Apr 15, 2020

nuclearsandwich commented Apr 15, 2020

nuclearsandwich commented Apr 16, 2020

rotu commented Apr 17, 2020

sloretz commented Apr 17, 2020

nuclearsandwich commented Apr 25, 2020

sloretz commented May 1, 2020

chapulina commented May 5, 2020

ivanpauno commented May 6, 2020

rotu commented May 6, 2020

rotu commented May 6, 2020

rotu commented May 6, 2020

rotu commented May 7, 2020

clalancette commented Feb 3, 2021

nuclearsandwich commented Nov 1, 2019 •

edited

Loading

nuclearsandwich commented Nov 4, 2019 •

edited

Loading

sloretz commented Nov 19, 2019 •

edited

Loading

sloretz commented Jan 13, 2020 •

edited

Loading

sloretz commented Jan 13, 2020 •

edited

Loading

sloretz commented Jan 14, 2020 •

edited

Loading

sloretz commented Jan 14, 2020 •

edited

Loading

sloretz commented Jan 14, 2020 •

edited

Loading

sloretz commented Jan 15, 2020 •

edited

Loading

sloretz commented Jan 15, 2020 •

edited

Loading

sloretz commented Jan 15, 2020 •

edited

Loading

nuclearsandwich commented Feb 26, 2020 •

edited

Loading