Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No callback when AUTOMATIC liveliness is lost #384

Closed
hidmic opened this issue May 15, 2020 · 18 comments
Closed

No callback when AUTOMATIC liveliness is lost #384

hidmic opened this issue May 15, 2020 · 18 comments
Labels
backlog bug Something isn't working

Comments

@hidmic
Copy link
Contributor

hidmic commented May 15, 2020

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04 AMD64 & ARM64
  • Installation type:
    • Binaries and from source
  • Version or commit hash:
    • 1.0.0
  • Client library (if applicable):
    • rclcpp & rclpy

Steps to reproduce issue

Try out the liveliness QoS demo using rmw_fastrtps_cpp . Default liveliness policy is already AUTOMATIC, but you may enforce it using the --policy option:

ros2 run quality_of_service_demo_py liveliness 1000 --kill-publisher-after 2000 --policy AUTOMATIC

Expected behavior

Liveliness change events are received and printed on screen.

Actual behavior

No liveliness change event is reported.

Additional information

The talker node in this demo is explicitly destroyed after some time. This issue does not show itself when using other RMW implementations like CycloneDDS, so I suspect that either the underlying DDS data writer is not getting destroyed or Fast-RTPS keeps asserting liveliness even after the DDS data writer is gone.

@hidmic hidmic added the bug Something isn't working label May 15, 2020
@hidmic
Copy link
Contributor Author

hidmic commented May 15, 2020

@MiguelCompany @richiprosima for awareness.

@sloretz sloretz added this to To do in Galactic via automation May 29, 2020
@sloretz sloretz removed this from To do in Galactic Jun 11, 2020
@sloretz
Copy link
Contributor

sloretz commented Jun 11, 2020

@richiware friendly ping :) I've moved this to the backlog, meaning we're not likely to look at it again anytime soon. Would this be something eProsima could work on?

@richiware
Copy link
Contributor

CC: @MiguelCompany

@MiguelCompany
Copy link
Contributor

@hidmic @sloretz

  1. I could reproduce the issue and a fix is on the way
  2. With that fix, and following the steps indicated on this issue, two liveliness events are shown in the output
  3. With CycloneDDS only the liveliness acquired assertion is triggered, no event is triggered when liveliness is lost
  4. What bothers me is that there is a system test for this, so this problem should have been previously detected.

@hidmic
Copy link
Contributor Author

hidmic commented Jul 6, 2020

With CycloneDDS only the liveliness acquired assertion is triggered, no event is triggered when liveliness is lost

Uh. That didn't use to be the case. @eboasson mind to take a look ?

What bothers me is that there is a system test for this, so this problem should have been previously detected.

Could it be related to the fact that in that test both nodes exist in the same process space and use the same participant?

@eboasson
Copy link

eboasson commented Jul 6, 2020

@hidmic I too find this quite surprising, I’ll have a look asap.

@eboasson
Copy link

eboasson commented Jul 7, 2020

@hidmic this:

so I suspect that either the underlying DDS data writer is not getting destroyed

I believe to be the case, based on Cyclone DDS trace output and a bit of tracing using the debugger. I don't know why exactly, the python code does make it to the invocation of _rclpy_capsule.rclpy_pycapsule_destroy on the talker node, but rmw_destroy_publisher is not called (macOS 10.15.5, ROS 2 master built from this morning's sources).

With rmw_destroy_publisher not getting called, not signalling a liveliness lost event is self-evidently correct. So I'm pretty sure that means there's an incorrect assumption in the script or some issue in rclpy or further down the stack in ROS 2 code.

Somewhat stripped debugger output:

# lldb -- python3 install/lib/quality_of_service_demo_py/liveliness 1000 --kill-publisher-after 2000 --policy AUTOMATIC
(lldb) target create "python3"
Current executable set to 'python3' (x86_64).
(lldb) settings set -- target.run-args  "install/lib/quality_of_service_demo_py/liveliness" "1000" "--kill-publisher-after" "2000" "--policy" "AUTOMATIC"
(lldb) command source xx.lldb
Executing commands in '/Users/erik/ros2_ws/xx.lldb'.
(lldb) env DYLD_LIBRARY_PATH=/Users/erik/ros2_ws/install/lib
(lldb) br se -n rmw_create_node -s librmw_cyclonedds_cpp.dylib
Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 1
(lldb) br se -n rmw_destroy_node -s librmw_cyclonedds_cpp.dylib
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 2
(lldb) br se -n rmw_publish -s librmw_cyclonedds_cpp.dylib
Breakpoint 3: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 3
(lldb) br se -n rmw_create_subscription -s librmw_cyclonedds_cpp.dylib
Breakpoint 4: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 4
(lldb) br se -n rmw_destroy_publisher -s librmw_cyclonedds_cpp.dylib
Breakpoint 5: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 5
(lldb) br se -n rmw_destroy_subscription -s librmw_cyclonedds_cpp.dylib
Breakpoint 6: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br com add 6
(lldb) br se -n main
Breakpoint 7: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) r
Process 86199 launched: '/usr/local/opt/python@3.8/bin/python3' (x86_64)
(lldb) proc handle SIGTERM -p true -s false
NAME         PASS   STOP   NOTIFY
===========  =====  =====  ======
SIGTERM      true   false  true 
(lldb) c
error: Process is running.  Use 'process interrupt' to pause execution.
Process 86199 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100006000 dyld`_dyld_start
dyld`_dyld_start:
->  0x100006000 <+0>: popq   %rdi
    0x100006001 <+1>: pushq  $0x0
    0x100006003 <+3>: movq   %rsp, %rbp
    0x100006006 <+6>: andq   $-0x10, %rsp
Target 0: (Python) stopped.
(lldb) c
Process 86199 resuming
1 location added to breakpoint 1
1 location added to breakpoint 2
1 location added to breakpoint 3
1 location added to breakpoint 4
1 location added to breakpoint 5
1 location added to breakpoint 6
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001040da50a librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_listener", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1221:3
    frame #1: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_listener", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #2: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x00000001027c4fc0, name="qos_listener", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33
    frame #3: 0x00000001039b4afc _rclpy.cpython-38-darwin.so`rclpy_create_node_impl(_unused_self=0x00000001039671d0, args=0x00000001027c70a0) at _rclpy.c:799:9

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf6c28, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040da8f3 librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_listener", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1266:23
    frame #2: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_listener", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #3: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x00000001027c4fc0, name="qos_listener", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf2168, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040dc05b librmw_cyclonedds_cpp.dylib`::rmw_create_publisher(node=0x000000010045a820, type_supports=0x0000000103ab3078, topic_name="/rosout", qos_policies=0x00007ffeefbf5be0, publisher_options=0x00007ffeefbf5c60) at rmw_node.cpp:1923:23
    frame #2: 0x0000000103c144bd librmw_implementation.dylib`::rmw_create_publisher(v5=0x000000010045a820, v4=0x0000000103ab3078, v3="/rosout", v2=0x00007ffeefbf5be0, v1=0x00007ffeefbf5c60) at functions.cpp:217:1
    frame #3: 0x00000001039f836b librcl.dylib`rcl_publisher_init(publisher=0x00007ffeefbf5c80, node=0x00000001027c4fc0, type_support=0x0000000103ab3078, topic_name="/rosout", options=0x00007ffeefbf5be0) at publisher.c:166:33

(lldb)  c
Process 86199 resuming

.....

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 4.1
  * frame #0: 0x00000001040dcd46 librmw_cyclonedds_cpp.dylib`::rmw_create_subscription(node=0x000000010045a820, type_supports=0x0000000105ccb348, topic_name="/qos_liveliness_chatter", qos_policies=0x00007ffeefbfd610, subscription_options=0x00007ffeefbfd690) at rmw_node.cpp:2182:3
    frame #1: 0x0000000103c14c6d librmw_implementation.dylib`::rmw_create_subscription(v5=0x000000010045a820, v4=0x0000000105ccb348, v3="/qos_liveliness_chatter", v2=0x00007ffeefbfd610, v1=0x00007ffeefbfd690) at functions.cpp:311:1
    frame #2: 0x00000001039fed93 librcl.dylib`rcl_subscription_init(subscription=0x0000000104d90510, node=0x00000001027c4fc0, type_support=0x0000000105ccb348, topic_name="qos_liveliness_chatter", options=0x00007ffeefbfd610) at subscription.c:161:36
    frame #3: 0x00000001039abb56 _rclpy.cpython-38-darwin.so`rclpy_create_subscription(_unused_self=0x00000001039671d0, args=0x0000000105b2f310) at _rclpy.c:2415:19

(lldb)  c
Process 86199 resuming

.....
[INFO] [1594106275.696693000] [qos_listener]: Subscription created
.....

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001040da50a librmw_cyclonedds_cpp.dylib`::rmw_create_node(context=0x000000010064e4e0, name="qos_talker", namespace_="/", domain_id=18446744073709551615, localhost_only=false) at rmw_node.cpp:1221:3
    frame #1: 0x0000000103c14259 librmw_implementation.dylib`::rmw_create_node(v5=0x000000010064e4e0, v4="qos_talker", v3="/", v2=18446744073709551615, v1=false) at functions.cpp:188:1
    frame #2: 0x00000001039f5ea7 librcl.dylib`rcl_node_init(node=0x0000000104d90560, name="qos_talker", namespace_="", context=0x00000001027582f0, options=0x00007ffeefbfc7a8) at node.c:268:33
    frame #3: 0x00000001039b4afc _rclpy.cpython-38-darwin.so`rclpy_create_node_impl(_unused_self=0x00000001039671d0, args=0x0000000105b7ba60) at _rclpy.c:799:9

(lldb)  c
Process 86199 resuming

.....

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106275.809866000] [qos_talker]: Talker starting up
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x00000001005703f0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x00000001005703f0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbfa638, ros_message=0x00000001005703f0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfd5a0, severity=20, name="qos_talker", timestamp=1594106275809866000, format="%s", args=0x00007ffeefbfd540) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbf9d98, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040dc05b librmw_cyclonedds_cpp.dylib`::rmw_create_publisher(node=0x00000001006a9d20, type_supports=0x0000000105ccb348, topic_name="/qos_liveliness_chatter", qos_policies=0x00007ffeefbfd828, publisher_options=0x00007ffeefbfd8a8) at rmw_node.cpp:1923:23
    frame #2: 0x0000000103c144bd librmw_implementation.dylib`::rmw_create_publisher(v5=0x00000001006a9d20, v4=0x0000000105ccb348, v3="/qos_liveliness_chatter", v2=0x00007ffeefbfd828, v1=0x00007ffeefbfd8a8) at functions.cpp:217:1
    frame #3: 0x00000001039f836b librcl.dylib`rcl_publisher_init(publisher=0x0000000104d905f0, node=0x0000000104d90560, type_support=0x0000000105ccb348, topic_name="qos_liveliness_chatter", options=0x00007ffeefbfd828) at publisher.c:166:33

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106275.830094000] [Listener]: QoSLivelinessChangedInfo(alive_count=1, not_alive_count=0, alive_count_change=1, not_alive_count_change=0)
[INFO] [1594106276.330527000] [qos_talker]: Publishing: Talker says 0
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x0000000100487b90, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106276330527000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x000000010072d070, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x000000010072d070, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x000000010072d070, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.353051000] [qos_listener]: I heard: Talker says 0
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100487cf0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100487cf0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487cf0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106276353051000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.832460000] [qos_talker]: Publishing: Talker says 1
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106276832460000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x00000001006b23d0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106276.854997000] [qos_listener]: I heard: Talker says 1
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100487b90, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100487b90, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106276854997000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.332550000] [qos_talker]: Publishing: Talker says 2
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106277332550000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x00000001006b23d0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x00000001006b23d0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.356839000] [qos_listener]: I heard: Talker says 2
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x0000000100731f70, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x0000000100731f70, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x0000000100731f70, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106277356839000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.832373000] [qos_talker]: Publishing: Talker says 3
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x0000000100479e10, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x0000000100479e10, v2=0x000000010047e4c0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010047e4c0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_talker", timestamp=1594106277832373000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010072daa0, ros_message=0x000000010072ff70, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x000000010072daa0, v2=0x000000010072ff70, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x0000000104d905f0, ros_message=0x000000010072ff70, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039ad285 _rclpy.cpython-38-darwin.so`rclpy_publish(_unused_self=0x00000001039671d0, args=0x000000010393d100) at _rclpy.c:1814:19

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x00000001006ae5a0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x00000001006ae5a0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x00000001006ae5a0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000100569310) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000100569310) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000100569310) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x000000010047c530) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x000000010047c530) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x000000010047c530) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000102504ab0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000102504ab0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000102504ab0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x0000000100486070) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x0000000100486070) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x0000000100486070) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x000000010045a510, ros_message=0x00007ffeefbfb248, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x00000001040e5157 librmw_cyclonedds_cpp.dylib`destroy_service(node=0x00000001006a9d20, service=0x00000001004878e0) at rmw_node.cpp:3702:23
    frame #2: 0x00000001040e52fd librmw_cyclonedds_cpp.dylib`::rmw_destroy_service(node=0x00000001006a9d20, service=0x00000001004878e0) at rmw_node.cpp:3771:10
    frame #3: 0x0000000103c15574 librmw_implementation.dylib`::rmw_destroy_service(v2=0x00000001006a9d20, v1=0x00000001004878e0) at functions.cpp:420:1

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.
[INFO] [1594106277.916607000] [qos_listener]: I heard: Talker says 3
(lldb)  bt 4
* thread #2, queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x00000001040dab84 librmw_cyclonedds_cpp.dylib`::rmw_publish(publisher=0x00000001007278f0, ros_message=0x000000010048c3b0, allocation=0x0000000000000000) at rmw_node.cpp:1474:3
    frame #1: 0x0000000103c1469c librmw_implementation.dylib`::rmw_publish(v3=0x00000001007278f0, v2=0x000000010048c3b0, v1=0x0000000000000000) at functions.cpp:242:1
    frame #2: 0x00000001039f8e8c librcl.dylib`rcl_publish(publisher=0x00007ffeefbf99e8, ros_message=0x000000010048c3b0, allocation=0x0000000000000000) at publisher.c:290:7
    frame #3: 0x00000001039f4166 librcl.dylib`rcl_logging_rosout_output_handler(location=0x00007ffeefbfc950, severity=20, name="qos_listener", timestamp=1594106277916607000, format="%s", args=0x00007ffeefbfc8f0) at logging_rosout.c:272:18

(lldb)  c
Process 86199 resuming

Command #2 'c' continued the target.


..... no further output for several seconds .....


Process 86199 stopped and restarted: thread 2 received signal: SIGTERM
Process 86199 exited with status = 0 (0x00000000) Terminated due to signal 15
(lldb) q

@MiguelCompany
Copy link
Contributor

What bothers me is that there is a system test for this, so this problem should have been previously detected.

@hidmic Looking at nightly CI results for test_quality_of_service package, it looks like the tests are not being executed, just linters being checked?

@hidmic
Copy link
Contributor Author

hidmic commented Jul 8, 2020

Uh, indeed. That package's CMakeLists.txt doesn't make sense.

@hidmic
Copy link
Contributor Author

hidmic commented Jul 8, 2020

@eboasson see ros2/system_tests#441. rmw_cyclonedds_cpp fails a few tests, so this issues may not be exclusive to rclpy.

@eboasson
Copy link

eboasson commented Jul 8, 2020

@hidmic the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed, and the other one simply has an incorrect expectation (the test as written really only gives you 2x2 deadline missed events, not 5). (Incidentally, with rmw_fastrtps_cpp it fails in exactly the same way.)

Maybe I'm running the wrong test? I did merge in ros2/system_tests#441 first.

@MiguelCompany
Copy link
Contributor

@hidmic Now that eProsima/Fast-DDS#1300 is merged, this should have been fixed.

@eboasson
Copy link

I don't see how it's going to work without changes to rclpy: I did another bit of digging and as far as I can tell, "handle.destroy" gets invoked after the 2s have passed, but it doesn't result in calling _rclpy_destroy_publisher, which is the function that calls rcl_publisher_fini to then invoke rmw_destroy_publisher. It doesn't look to me like there are other paths to get there.

With publisher.py's and node.pys definitions of destroy modified like this:

    def destroy(self):
        print(f'publisher {self.topic}: destroy handle')
        self.handle.destroy()

it is quite clear that the liveliness demo code indeed tries to delete the node and the publisher, but it doesn't end up in the corresponding C destructor code. Therefore, there is no way the RMW layer or the underlying middleware can be expected to signal liveliness lost:

$ lldb -- python3 build/lib/quality_of_service_demo_py/liveliness.py 1000 --kill-publisher-after 2000 --policy AUTOMATIC
(lldb) target create "python3"
Current executable set to 'python3' (x86_64).
(lldb) settings set -- target.run-args  "build/lib/quality_of_service_demo_py/liveliness.py" "1000" "--kill-publisher-after" "2000" "--policy" "AUTOMATIC"
(lldb) env DYLD_LIBRARY_PATH=/Users/erik/ros2_ws/install/lib                                                                                                         (lldb) br se -n _rclpy_destroy_publisher                                                                                                                             Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br se -n _rclpy_destroy_node_impl
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) br se -n _rclpy_destroy_node
Breakpoint 3: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) r
Process 58058 launched: '/usr/local/opt/python@3.8/bin/python3' (x86_64)
Process 58058 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100006000 dyld`_dyld_start
dyld`_dyld_start:
->  0x100006000 <+0>: popq   %rdi
    0x100006001 <+1>: pushq  $0x0
    0x100006003 <+3>: movq   %rsp, %rbp
    0x100006006 <+6>: andq   $-0x10, %rsp
Target 0: (Python) stopped.
(lldb) c
Process 58058 resuming
1 location added to breakpoint 1
1 location added to breakpoint 2
1 location added to breakpoint 3
[INFO] [1594908877.051500000] [qos_listener]: Subscription created
[INFO] [1594908877.056994000] [qos_talker]: Talker starting up
[INFO] [1594908877.058669000] [Listener]: QoSLivelinessChangedInfo(alive_count=1, not_alive_count=0, alive_count_change=1, not_alive_count_change=0)
[INFO] [1594908877.564553000] [qos_talker]: Publishing: Talker says 0
[INFO] [1594908877.566461000] [qos_listener]: I heard: Talker says 0
[INFO] [1594908878.064578000] [qos_talker]: Publishing: Talker says 1
[INFO] [1594908878.066434000] [qos_listener]: I heard: Talker says 1
[INFO] [1594908878.564552000] [qos_talker]: Publishing: Talker says 2
[INFO] [1594908878.566384000] [qos_listener]: I heard: Talker says 2
[INFO] [1594908879.064283000] [qos_talker]: Publishing: Talker says 3
publisher parameter_events: destroy handle
publisher qos_liveliness_chatter: destroy handle
node: destroy handle
[INFO] [1594908879.068474000] [qos_listener]: I heard: Talker says 3
Process 58058 stopped
* thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff6a8f8882 libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7fff6a8f8882 <+10>: jae    0x7fff6a8f888c            ; <+20>
    0x7fff6a8f8884 <+12>: movq   %rax, %rdi
    0x7fff6a8f8887 <+15>: jmp    0x7fff6a8f6629            ; cerror_nocancel
    0x7fff6a8f888c <+20>: retq   
Target 0: (Python) stopped.
(lldb) q

@hidmic
Copy link
Contributor Author

hidmic commented Jul 16, 2020

the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed

@eboasson there is a publisher being instantiated. Nonetheless, I'd intuitively expect a subscriber with a given deadline QoS to complain if that deadline is missed regardless of ongoing publications or even publisher existence.

and the other one simply has an incorrect expectation (the test as written really only gives you 2x2 deadline missed events, not 5). (Incidentally, with rmw_fastrtps_cpp it fails in exactly the same way.)

True that. Fixing.

@hidmic
Copy link
Contributor Author

hidmic commented Jul 16, 2020

Interestingly enough, my expectation does not match the DDS definition of deadline. And transitively, neither ROS definition.

@hidmic
Copy link
Contributor Author

hidmic commented Jul 16, 2020

I don't see how it's going to work without changes to rclpy: I did another bit of digging and as far as I can tell, "handle.destroy" gets invoked after the 2s have passed, but it doesn't result in calling _rclpy_destroy_publisher.

@eboasson you're right, but the issue doesn't show itself when using FastDDS. Investigating.

@eboasson
Copy link

the only test that fails with rmw_cyclonedds_cpp (locally on my mac) is the deadline one and with that one, it really is the test that is broken: if there is no publisher, there is no instance and no deadline missed

@eboasson there is a publisher being instantiated. Nonetheless, I'd intuitively expect a subscriber with a given deadline QoS to complain if that deadline is missed regardless of ongoing publications or even publisher existence.

If I'm not mistaken, there is no publisher in the "test_deadline_no_publisher" test: while it does instantiate a QosTestPublisher, the constructor does next to nothing, leaving the creation of the publisher until later (setup_start).

Not generating deadline missed notifications if there is no instance is the way it is defined in the DDS spec, and I do think it is the right choice: in the DDS data model, there are keys identifying instances and deadline misses are tied to instances (where it actually informs you for which instance a deadline was missed). Naturally you don't want these notifications for all instances that might come into existence in the future.

The case where there are no keys (which ROS 2 uses) is treated by the spec as the limit case where there is at most 1 instance, and so also requires the presence of data. I happen to agree with that choice, but one could perfectly well argue that in that case there should be no concept of an instance. Doing that breaks various other things through nasty interactions, for example, the case of a transient/persistent topic: without an instance, you can't dispose data, without disposing data, you can't get rid of transient/persistent data. Of course one can work around such details by adding special cases, but there are too many special cases in the specs already ...

The other issue with generating deadline missed notifications without data being present, is that it is not unusual for a system to take a fair amount of time to initialise and then run at a sustained high rate. If you'd use a deadline that makes sense for the update rate, you'd be getting spurious deadline missed notifications during initialisation. Of course you could ignore those, or (in the DDS interface anyway) start with the notifications disabled, enabling them when the system becomes operational, but that has its own donwsides.

The usual OMG solution to such a problem is to make the behaviour configurable, but if you ask me, there are too many QoS settings as it is ...

@hidmic
Copy link
Contributor Author

hidmic commented Jul 16, 2020

Found the issue with rclpy, but incidentally, a fix was merged 15 minutes ago (see ros2/rclpy#603). I can confirm this is no longer an issue for none of the current RMW implementations. Thanks @eboasson @MiguelCompany !

@hidmic hidmic closed this as completed Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants