No callback when AUTOMATIC liveliness is lost #384

hidmic · 2020-05-15T17:44:16Z

Bug report

Required Info:

Operating System:
- Ubuntu 20.04 AMD64 & ARM64
Installation type:
- Binaries and from source
Version or commit hash:
- 1.0.0
Client library (if applicable):
- rclcpp & rclpy

Steps to reproduce issue

Try out the liveliness QoS demo using rmw_fastrtps_cpp . Default liveliness policy is already AUTOMATIC, but you may enforce it using the --policy option:

ros2 run quality_of_service_demo_py liveliness 1000 --kill-publisher-after 2000 --policy AUTOMATIC

Expected behavior

Liveliness change events are received and printed on screen.

Actual behavior

No liveliness change event is reported.

Additional information

The talker node in this demo is explicitly destroyed after some time. This issue does not show itself when using other RMW implementations like CycloneDDS, so I suspect that either the underlying DDS data writer is not getting destroyed or Fast-RTPS keeps asserting liveliness even after the DDS data writer is gone.

The text was updated successfully, but these errors were encountered:

hidmic · 2020-05-15T17:44:33Z

@MiguelCompany @richiprosima for awareness.

sloretz · 2020-06-11T23:43:12Z

@richiware friendly ping :) I've moved this to the backlog, meaning we're not likely to look at it again anytime soon. Would this be something eProsima could work on?

richiware · 2020-07-02T09:43:03Z

CC: @MiguelCompany

MiguelCompany · 2020-07-06T10:29:43Z

@hidmic @sloretz

I could reproduce the issue and a fix is on the way
With that fix, and following the steps indicated on this issue, two liveliness events are shown in the output
With CycloneDDS only the liveliness acquired assertion is triggered, no event is triggered when liveliness is lost
What bothers me is that there is a system test for this, so this problem should have been previously detected.

hidmic · 2020-07-06T15:35:30Z

With CycloneDDS only the liveliness acquired assertion is triggered, no event is triggered when liveliness is lost

Uh. That didn't use to be the case. @eboasson mind to take a look ?

What bothers me is that there is a system test for this, so this problem should have been previously detected.

Could it be related to the fact that in that test both nodes exist in the same process space and use the same participant?

eboasson · 2020-07-06T21:02:44Z

@hidmic I too find this quite surprising, I’ll have a look asap.

eboasson · 2020-07-07T07:45:25Z

@hidmic this:

so I suspect that either the underlying DDS data writer is not getting destroyed

I believe to be the case, based on Cyclone DDS trace output and a bit of tracing using the debugger. I don't know why exactly, the python code does make it to the invocation of _rclpy_capsule.rclpy_pycapsule_destroy on the talker node, but rmw_destroy_publisher is not called (macOS 10.15.5, ROS 2 master built from this morning's sources).

With rmw_destroy_publisher not getting called, not signalling a liveliness lost event is self-evidently correct. So I'm pretty sure that means there's an incorrect assumption in the script or some issue in rclpy or further down the stack in ROS 2 code.

Somewhat stripped debugger output:

# lldb -- python3 install/lib/quality_of_service_demo_py/liveliness 1000 --kill-publisher-after 2000 --policy AUTOMATIC
(lldb) target create "python3"
Current executable set to 'python3' (x86_64).
(lldb) settings set -- target.run-args  "install/lib/quality_of_service_demo_py/liveliness" "1000" "--kill-publisher-after" "2000" "--policy" "AUTOMATIC"
(lldb) command source xx.lldb
Executing commands in '/Users/erik/ros2_ws/xx.lldb'.
(lldb) env DYLD_LIBRARY_PATH=/Users/erik/ros2_ws/install/lib
(lldb) br se -n rmw_create_node -s librmw_cyclonedds_cpp.dylib
Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 1
(lldb) br se -n rmw_destroy_node -s librmw_cyclonedds_cpp.dylib
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 2
(lldb) br se -n rmw_publish -s librmw_cyclonedds_cpp.dylib
Breakpoint 3: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 3
(lldb) br se -n rmw_create_subscription -s librmw_cyclonedds_cpp.dylib
Breakpoint 4: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 4
(lldb) br se -n rmw_destroy_publisher -s librmw_cyclonedds_cpp.dylib
Breakpoint 5: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 5
(lldb) br se -n rmw_destroy_subscription -s librmw_cyclonedds_cpp.dylib
Breakpoint 6: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 6
(lldb) br se -n main
Breakpoint 7: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) r
Process 86199 launched: '/usr/local/opt/python@3.8/bin/python3' (x86_64)
(lldb) proc handle SIGTERM -p true -s false
NAME         PASS   STOP   NOTIFY
===========  =====  =====  ======
SIGTERM      true   false  true 
(lldb) c
error: Process is running.  Use 'process interrupt' to pause execution.
Process 86199 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100006000 dyld`_dyld_start
dyld`_dyld_start:
->  0x100006000 <+0>: popq   %rdi
    0x100006001 <+1>: pushq  $0x0
    0x100006003 <+3>: movq   %rsp, %rbp
    0x100006006 <+6>: andq   $-0x10, %rsp
Target 0: (Python) stopped.
(lldb) c
Process 86199 resuming
1 location added to breakpoint 1
1 location added to breakpoint 2
1 location added to breakpoint 3
1 location added to breakpoint 4
1 location added to breakpoint 5
1 location added to breakpoint 6
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001040da50a librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_listener", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1221:3
    frame #1: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_listener", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #2: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x00000001027c4fc0, name="qos_listener", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33
    frame #3: 0x00000001039b4afc _rclpy.cpython-38-darwin.so`rclpy_create_node_impl(_unused_self=0x00000001039671d0, args=0x00000001027c70a0) at _rclpy.c:799:9

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf6c28, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040da8f3 librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_listener", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1266:23
    frame #2: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_listener", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #3: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x00000001027c4fc0, name="qos_listener", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf2168, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040dc05b librmw_cyclonedds_cpp.dylib`::rmw_create_publisher(node=0x000000010045a820, type_supports=0x0000000103ab3078, topic_name="/rosout", qos_policies=0x00007ffeefbf5be0, publisher_options=0x00007ffeefbf5c60) at rmw_node.cpp:1923:23
    frame #2: 0x0000000103c144bd librmw_implementation.dylib`::rmw_create_publisher(v5=0x000000010045a820, v4=0x0000000103ab3078, v3="/rosout", v2=0x00007ffeefbf5be0, v1=0x00007ffeefbf5c60) at functions.cpp:217:1
    frame #3: 0x00000001039f836b librcl.dylib`rcl_publisher_init(publisher=0x00007ffeefbf5c80, node=0x00000001027c4fc0, type_support=0x0000000103ab3078, topic_name="/rosout", options=0x00007ffeefbf5be0) at publisher.c:166:33

(lldb)  c
Process 86199 resuming

.....

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 4.1
  * frame #0: 0x00000001040dcd46 librmw_cyclonedds_cpp.dylib`::rmw_create_subscription(node=0x000000010045a820, type_supports=0x0000000105ccb348, topic_name="/qos_liveliness_chatter", qos_policies=0x00007ffeefbfd610, subscription_options=0x00007ffeefbfd690) at rmw_node.cpp:2182:3
    frame #1: 0x0000000103c14c6d librmw_implementation.dylib`::rmw_create_subscription(v5=0x000000010045a820, v4=0x0000000105ccb348, v3="/qos_liveliness_chatter", v2=0x00007ffeefbfd610, v1=0x00007ffeefbfd690) at functions.cpp:311:1
    frame #2: 0x00000001039fed93 librcl.dylib`rcl_subscription_init(subscription=0x0000000104d90510, node=0x00000001027c4fc0, type_support=0x0000000105ccb348, topic_name="qos_liveliness_chatter", options=0x00007ffeefbfd610) at subscription.c:161:36
    frame #3: 0x00000001039abb56 _rclpy.cpython-38-darwin.so`rclpy_create_subscription(_unused_self=0x00000001039671d0, args=0x0000000105b2f310) at _rclpy.c:2415:19

(lldb)  c
Process 86199 resuming

.....
[INFO] [1594106275.696693000] [qos_listener]: Subscription created
.....

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001040da50a librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_talker", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1221:3
    frame #1: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_talker", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #2: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x0000000104d90560, name="qos_talker", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33
    frame #3: 0x00000001039b4afc _rclpy.cpython-38-darwin.so`rclpy_create_node_impl(_unused_self=0x00000001039671d0, args=0x0000000105b7ba60) at _rclpy.c:799:9

(lldb)  c
Process 86199 resuming

.....

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106275.809866000] [qos_talker]: Talker starting up
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x00000001005703f0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x00000001005703f0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbfa638, ros_message=0x00000001005703f0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfd5a0, severity=20, name="qos_talker", timestamp=1594106275809866000, format="%s", args=0x00007ffeefbfd540) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf9d98, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040dc05b librmw_cyclonedds_cpp.dylib`::rmw_create_publisher(node=0x00000001006a9d20, type_supports=0x0000000105ccb348, topic_name="/qos_liveliness_chatter", qos_policies=0x00007ffeefbfd828, publisher_options=0x00007ffeefbfd8a8) at rmw_node.cpp:1923:23
    frame #2: 0x0000000103c144bd librmw_implementation.dylib`::rmw_create_publisher(v5=0x00000001006a9d20, v4=0x0000000105ccb348, v3="/qos_liveliness_chatter", v2=0x00007ffeefbfd828, v1=0x00007ffeefbfd8a8) at functions.cpp:217:1
    frame #3: 0x00000001039f836b librcl.dylib`rcl_publisher_init(publisher=0x0000000104d905f0, node=0x0000000104d90560, type_support=0x0000000105ccb348, topic_name="qos_liveliness_chatter", options=0x00007ffeefbfd828) at publisher.c:166:33

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106275.830094000] [Listener]: QoSLivelinessChangedInfo(alive_count=1, not_alive_count=0, alive_count_change=1, not_alive_count_change=0)
[INFO] [1594106276.330527000] [qos_talker]: Publishing: Talker says 0
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x0000000100487b90, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106276330527000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x000000010072d070, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x000000010072d070, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x000000010072d070, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.353051000] [qos_listener]: I heard: Talker says 0
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100487cf0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100487cf0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487cf0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106276353051000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.832460000] [qos_talker]: Publishing: Talker says 1
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106276832460000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x00000001006b23d0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.854997000] [qos_listener]: I heard: Talker says 1
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100487b90, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106276854997000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.332550000] [qos_talker]: Publishing: Talker says 2
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106277332550000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x00000001006b23d0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.356839000] [qos_listener]: I heard: Talker says 2
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100731f70, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100731f70, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100731f70, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106277356839000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.832373000] [qos_talker]: Publishing: Talker says 3
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106277832373000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x000000010072ff70, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x000000010072ff70, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x000000010072ff70, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x00000001006ae5a0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x00000001006ae5a0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x00000001006ae5a0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000100569310) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000100569310) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000100569310) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x000000010047c530) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x000000010047c530) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x000000010047c530) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000102504ab0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000102504ab0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000102504ab0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000100486070) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000100486070) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000100486070) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x00000001004878e0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x00000001004878e0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x00000001004878e0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.916607000] [qos_listener]: I heard: Talker says 3
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x000000010048c3b0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x000000010048c3b0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010048c3b0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106277916607000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.


..... no further output for several seconds .....


Process 86199 stopped and restarted: thread 2 received signal: SIGTERM
Process 86199 exited with status = 0 (0x00000000) Terminated due to signal 15
(lldb) q

MiguelCompany · 2020-07-08T08:14:51Z

What bothers me is that there is a system test for this, so this problem should have been previously detected.

@hidmic Looking at nightly CI results for test_quality_of_service package, it looks like the tests are not being executed, just linters being checked?

hidmic · 2020-07-08T14:59:17Z

Uh, indeed. That package's CMakeLists.txt doesn't make sense.

hidmic · 2020-07-08T15:57:46Z

@eboasson see ros2/system_tests#441. rmw_cyclonedds_cpp fails a few tests, so this issues may not be exclusive to rclpy.

eboasson · 2020-07-08T16:59:35Z

@hidmic the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed, and the other one simply has an incorrect expectation (the test as written really only gives you 2x2 deadline missed events, not 5). (Incidentally, with rmw_fastrtps_cpp it fails in exactly the same way.)

Maybe I'm running the wrong test? I did merge in ros2/system_tests#441 first.

MiguelCompany · 2020-07-16T08:32:23Z

@hidmic Now that eProsima/Fast-DDS#1300 is merged, this should have been fixed.

eboasson · 2020-07-16T14:17:06Z

I don't see how it's going to work without changes to rclpy: I did another bit of digging and as far as I can tell, "handle.destroy" gets invoked after the 2s have passed, but it doesn't result in calling _rclpy_destroy_publisher, which is the function that calls rcl_publisher_fini to then invoke rmw_destroy_publisher. It doesn't look to me like there are other paths to get there.

With publisher.py's and node.pys definitions of destroy modified like this:

    def destroy(self):
        print(f'publisher {self.topic}: destroy handle')
        self.handle.destroy()

it is quite clear that the liveliness demo code indeed tries to delete the node and the publisher, but it doesn't end up in the corresponding C destructor code. Therefore, there is no way the RMW layer or the underlying middleware can be expected to signal liveliness lost:

$ lldb -- python3 build/lib/quality_of_service_demo_py/liveliness.py 1000 --kill-publisher-after 2000 --policy AUTOMATIC
(lldb) target create "python3"
Current executable set to 'python3' (x86_64).
(lldb) settings set -- target.run-args  "build/lib/quality_of_service_demo_py/liveliness.py" "1000" "--kill-publisher-after" "2000" "--policy" "AUTOMATIC"
(lldb) env DYLD_LIBRARY_PATH=/Users/erik/ros2_ws/install/lib                                                                                                         (lldb) br se -n _rclpy_destroy_publisher                                                                                                                             Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br se -n _rclpy_destroy_node_impl
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br se -n _rclpy_destroy_node
Breakpoint 3: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) r
Process 58058 launched: '/usr/local/opt/python@3.8/bin/python3' (x86_64)
Process 58058 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100006000 dyld`_dyld_start
dyld`_dyld_start:
->  0x100006000 <+0>: popq   %rdi
    0x100006001 <+1>: pushq  $0x0
    0x100006003 <+3>: movq   %rsp, %rbp
    0x100006006 <+6>: andq   $-0x10, %rsp
Target 0: (Python) stopped.
(lldb) c
Process 58058 resuming
1 location added to breakpoint 1
1 location added to breakpoint 2
1 location added to breakpoint 3
[INFO] [1594908877.051500000] [qos_listener]: Subscription created
[INFO] [1594908877.056994000] [qos_talker]: Talker starting up
[INFO] [1594908877.058669000] [Listener]: QoSLivelinessChangedInfo(alive_count=1, not_alive_count=0, alive_count_change=1, not_alive_count_change=0)
[INFO] [1594908877.564553000] [qos_talker]: Publishing: Talker says 0
[INFO] [1594908877.566461000] [qos_listener]: I heard: Talker says 0
[INFO] [1594908878.064578000] [qos_talker]: Publishing: Talker says 1
[INFO] [1594908878.066434000] [qos_listener]: I heard: Talker says 1
[INFO] [1594908878.564552000] [qos_talker]: Publishing: Talker says 2
[INFO] [1594908878.566384000] [qos_listener]: I heard: Talker says 2
[INFO] [1594908879.064283000] [qos_talker]: Publishing: Talker says 3
publisher parameter_events: destroy handle
publisher qos_liveliness_chatter: destroy handle
node: destroy handle
[INFO] [1594908879.068474000] [qos_listener]: I heard: Talker says 3
Process 58058 stopped
* thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff6a8f8882 libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7fff6a8f8882 <+10>: jae    0x7fff6a8f888c            ; <+20>
    0x7fff6a8f8884 <+12>: movq   %rax, %rdi
    0x7fff6a8f8887 <+15>: jmp    0x7fff6a8f6629            ; cerror_nocancel
    0x7fff6a8f888c <+20>: retq   
Target 0: (Python) stopped.
(lldb) q

hidmic · 2020-07-16T17:17:35Z

the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed

@eboasson there is a publisher being instantiated. Nonetheless, I'd intuitively expect a subscriber with a given deadline QoS to complain if that deadline is missed regardless of ongoing publications or even publisher existence.

and the other one simply has an incorrect expectation (the test as written really only gives you 2x2 deadline missed events, not 5). (Incidentally, with rmw_fastrtps_cpp it fails in exactly the same way.)

True that. Fixing.

hidmic · 2020-07-16T17:20:28Z

Interestingly enough, my expectation does not match the DDS definition of deadline. And transitively, neither ROS definition.

hidmic · 2020-07-16T20:19:27Z

I don't see how it's going to work without changes to rclpy: I did another bit of digging and as far as I can tell, "handle.destroy" gets invoked after the 2s have passed, but it doesn't result in calling _rclpy_destroy_publisher.

@eboasson you're right, but the issue doesn't show itself when using FastDDS. Investigating.

eboasson · 2020-07-16T20:31:48Z

the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed

@eboasson there is a publisher being instantiated. Nonetheless, I'd intuitively expect a subscriber with a given deadline QoS to complain if that deadline is missed regardless of ongoing publications or even publisher existence.

If I'm not mistaken, there is no publisher in the "test_deadline_no_publisher" test: while it does instantiate a QosTestPublisher, the constructor does next to nothing, leaving the creation of the publisher until later (setup_start).

Not generating deadline missed notifications if there is no instance is the way it is defined in the DDS spec, and I do think it is the right choice: in the DDS data model, there are keys identifying instances and deadline misses are tied to instances (where it actually informs you for which instance a deadline was missed). Naturally you don't want these notifications for all instances that might come into existence in the future.

The case where there are no keys (which ROS 2 uses) is treated by the spec as the limit case where there is at most 1 instance, and so also requires the presence of data. I happen to agree with that choice, but one could perfectly well argue that in that case there should be no concept of an instance. Doing that breaks various other things through nasty interactions, for example, the case of a transient/persistent topic: without an instance, you can't dispose data, without disposing data, you can't get rid of transient/persistent data. Of course one can work around such details by adding special cases, but there are too many special cases in the specs already ...

The other issue with generating deadline missed notifications without data being present, is that it is not unusual for a system to take a fair amount of time to initialise and then run at a sustained high rate. If you'd use a deadline that makes sense for the update rate, you'd be getting spurious deadline missed notifications during initialisation. Of course you could ignore those, or (in the DDS interface anyway) start with the notifications disabled, enabling them when the system becomes operational, but that has its own donwsides.

The usual OMG solution to such a problem is to make the behaviour configurable, but if you ask me, there are too many QoS settings as it is ...

hidmic · 2020-07-16T20:58:54Z

Found the issue with rclpy, but incidentally, a fix was merged 15 minutes ago (see ros2/rclpy#603). I can confirm this is no longer an issue for none of the current RMW implementations. Thanks @eboasson @MiguelCompany !

hidmic added the bug Something isn't working label May 15, 2020

sloretz added this to To do in Galactic via automation May 29, 2020

hidmic mentioned this issue Jun 2, 2020

Quality of Service "Liveliness" demo blocks ros2/demos#449

Closed

sloretz added the backlog label Jun 11, 2020

sloretz removed this from To do in Galactic Jun 11, 2020

hidmic mentioned this issue Jul 8, 2020

Run QoS tests. ros2/system_tests#441

Merged

hidmic closed this as completed Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No callback when AUTOMATIC liveliness is lost #384

No callback when AUTOMATIC liveliness is lost #384

hidmic commented May 15, 2020

hidmic commented May 15, 2020

sloretz commented Jun 11, 2020

richiware commented Jul 2, 2020

MiguelCompany commented Jul 6, 2020

hidmic commented Jul 6, 2020

eboasson commented Jul 6, 2020

eboasson commented Jul 7, 2020

MiguelCompany commented Jul 8, 2020

hidmic commented Jul 8, 2020

hidmic commented Jul 8, 2020

eboasson commented Jul 8, 2020

MiguelCompany commented Jul 16, 2020

eboasson commented Jul 16, 2020

hidmic commented Jul 16, 2020

hidmic commented Jul 16, 2020 •

edited

hidmic commented Jul 16, 2020 •

edited

eboasson commented Jul 16, 2020

hidmic commented Jul 16, 2020

No callback when AUTOMATIC liveliness is lost #384

No callback when AUTOMATIC liveliness is lost #384

Comments

hidmic commented May 15, 2020

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

hidmic commented May 15, 2020

sloretz commented Jun 11, 2020

richiware commented Jul 2, 2020

MiguelCompany commented Jul 6, 2020

hidmic commented Jul 6, 2020

eboasson commented Jul 6, 2020

eboasson commented Jul 7, 2020

MiguelCompany commented Jul 8, 2020

hidmic commented Jul 8, 2020

hidmic commented Jul 8, 2020

eboasson commented Jul 8, 2020

MiguelCompany commented Jul 16, 2020

eboasson commented Jul 16, 2020

hidmic commented Jul 16, 2020

hidmic commented Jul 16, 2020 • edited

hidmic commented Jul 16, 2020 • edited

eboasson commented Jul 16, 2020

hidmic commented Jul 16, 2020

hidmic commented Jul 16, 2020 •

edited

hidmic commented Jul 16, 2020 •

edited