Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not getting service responses reliably when using CycloneDDS #74

Open
hidmic opened this issue Nov 14, 2019 · 9 comments
Open

Not getting service responses reliably when using CycloneDDS #74

hidmic opened this issue Nov 14, 2019 · 9 comments
Labels
bug Something isn't working

Comments

@hidmic
Copy link
Contributor

hidmic commented Nov 14, 2019

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04, both AMD64 and ARM64
  • Installation type:
    • Binaries and from source
  • Version or commit hash:
    • b92db52bb14ec097946bfe44d2b337359dddfbab (using 4089fee)
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclcpp only

Steps to reproduce issue

  1. On one terminal, run:
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_server
  1. On another terminal, run:
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run demo_nodes_cpp add_two_ints_client_async

Expected behavior

A single service call takes place and succeeds i.e. you get logs on each terminal:

[INFO] [add_two_ints_server]: Incoming request
a: 2 b: 3
[INFO] [add_two_ints_client]: Result of add_two_ints: 5

Actual behavior

The client hangs with no output though the server does log so it appears to have received the request.

Additional information

It does not happen on rclpy. It also does not happen if bringing up both nodes via launch e.g.:

from launch import LaunchDescription
from launch_ros.actions import Node


def generate_launch_description():
    return LaunchDescription([
        Node(
            package='demo_nodes_cpp',
            node_executable='add_two_ints_server',
            output='screen'
        ),
        Node(
            package='demo_nodes_cpp',
            node_executable='add_two_ints_client_async',
            output='screen'
        )
    ])

So it's likely a race.

@wjwwood wjwwood transferred this issue from ros2/rclcpp Dec 5, 2019
@hidmic
Copy link
Contributor Author

hidmic commented Dec 6, 2019

I can still reproduce on master.

@dirk-thomas dirk-thomas added the bug Something isn't working label Dec 6, 2019
@dirk-thomas
Copy link
Member

I can confirm the problem.

My first guess is a race that the client discovers the server and sends it request but when the server receives the request and wants to publish the reply it hasn't discovered the client yet.

With a simple hack (adding a sleep of 500ms into the server callback function) the client receives the response reliable.

I guess we might something like a "wait_for_client" in the same way as we already offer a "wait_for_service".

@eboasson
Copy link
Collaborator

eboasson commented Dec 6, 2019

My first guess is a race that the client discovers the server and sends it request but when the server receives the request and wants to publish the reply it hasn't discovered the client yet.

Cyclone implements what the spec calls "stateful" readers and writers, which means the server's reader for the request topic won't accept any data until it has discovered the client's writer. "wait_for_service" waits until the server's reader for the request topic has been discovered, consequently the client's writer will hold on to the request message until the service's reader acknowledges it. It is therefore pretty much impossible that service is unaware of the existence of the client.

Reader and writer discovery do run independently in RTPS and so it is technically possible that the server has discovered the client's request writer and handled the request without having discovered the client's response reader yet, but that does seem unlikely.

I haven't had a chance to try to reproduce it yet (and this weekend it's not likely to happen either because of all kinds of obligations) but if I were to try, I'd probably enable Cyclone's tracing to get some more insight in what is going on. (Not for the faint of heart, allegedly … CYCLONEDDS_URI='finest</>cdds.log.${CYCLONEDDS_PID}' suffices. You have to squint a bit to figure out how that relates to proper XML 🙂.)

@hidmic
Copy link
Contributor Author

hidmic commented Dec 12, 2019

Reader and writer discovery do run independently in RTPS and so it is technically possible that the server has discovered the client's request writer and handled the request without having discovered the client's response reader yet, but that does seem unlikely.

I'm no expert here. Having said that, I think that may be the case @eboasson.

The server sets up a response writer and a request reader first:

1576167310.543602 [0] add_two_in: new_writer(guid 8c561001:dc2ce137:5055b4d2:1303, (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_)
1576167310.543645 [0] add_two_in: ref_participant(8c561001:dc2ce137:5055b4d2:1c1 @ 0x559c5a20f930 <- 8c561001:dc2ce137:5055b4d2:1303 @ 0x559c5a24ac84) user 20 builtin 15
1576167310.543658 [0] add_two_in: WRITER 8c561001:dc2ce137:5055b4d2:1303 QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167310.543665 [0] add_two_in: write_sample 0:0:0:3c2 #9: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167310.543668 [0] add_two_in: reader 8c561001:dc2ce137:5055b4d2:307
1576167310.543673 [0] add_two_in: match_writer_with_proxy_readers(wr 8c561001:dc2ce137:5055b4d2:1303) scanning all proxy readers
1576167310.543680 [0] add_two_in: match_writer_with_readers(wr 8c561001:dc2ce137:5055b4d2:1303) scanning all readers
1576167310.543687 [0] add_two_in: sedp: write for 8c561001:dc2ce137:5055b4d2:1303 via 8c561001:dc2ce137:5055b4d2:3c2
1576167310.543692 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:3c2 #9: ST0 (null)/(null):(plist)
1576167310.543699 [0] add_two_in: new_reader(guid 8c561001:dc2ce137:5055b4d2:1404, (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_)
1576167310.543702 [0] add_two_in: ref_participant(8c561001:dc2ce137:5055b4d2:1c1 @ 0x559c5a20f930 <- 8c561001:dc2ce137:5055b4d2:1404 @ 0x559c5a24bd74) user 21 builtin 15
1576167310.543709 [0] add_two_in: READER 8c561001:dc2ce137:5055b4d2:1404 QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167310.543714 [0] add_two_in: write_sample 0:0:0:4c2 #11: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167310.543716 [0] add_two_in: reader 8c561001:dc2ce137:5055b4d2:207
1576167310.543719 [0] add_two_in: match_reader_with_proxy_writers(rd 8c561001:dc2ce137:5055b4d2:1404) scanning all proxy writers
1576167310.543721 [0] add_two_in: match_reader_with_writers(rd 8c561001:dc2ce137:5055b4d2:1404) scanning all writers
1576167310.543727 [0] add_two_in: sedp: write for 8c561001:dc2ce137:5055b4d2:1404 via 8c561001:dc2ce137:5055b4d2:4c2
1576167310.543730 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:4c2 #8: ST0 (null)/(null):(plist)

When the client gets brought up, it discovers the server in full:

1576167362.690709 [0] dq.builtin: write_sample 0:0:0:3c2 #9: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.690712 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:307
1576167362.690717 [0] dq.builtin: match_proxy_writer_with_readers(pwr 8c561001:dc2ce137:5055b4d2:1103) scanning all readers
1576167362.690722 [0] add_two_in: new_writer(guid 1481001:ddd08e4f:524471cd:503, (default).rr/add_two_ints_client/get_parametersReply/rcl_interfaces::srv::dds_::GetParameters_Response_)
1576167362.690728 [0] dq.builtin: SEDP ST0 8c561001:dc2ce137:5055b4d2:1303 reliable volatile writer: (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:38672 ssm=0) QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
...
1576167362.690954 [0] dq.builtin: write_sample 0:0:0:4c2 #11: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.690957 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:207
1576167362.690961 [0] dq.builtin: match_proxy_reader_with_writers(prd 8c561001:dc2ce137:5055b4d2:1204) scanning all writers
1576167362.690974 [0] dq.builtin: SEDP ST0 8c561001:dc2ce137:5055b4d2:1404 reliable volatile reader: (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:38672 ssm=0) QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.690981 [0] dq.builtin: write_sample 0:0:0:4c2 #12: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.690983 [0] dq.builtin: reader 1481001:ddd08e4f:524471cd:207
1576167362.690987 [0] dq.builtin: match_proxy_reader_with_writers(prd 8c561001:dc2ce137:5055b4d2:1404) scanning all writers

And then it sets up a response reader and a request writer:

1576167362.694616 [0] add_two_in: new_writer(guid 1481001:ddd08e4f:524471cd:1303, (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_)
1576167362.694622 [0] add_two_in: ref_participant(1481001:ddd08e4f:524471cd:1c1 @ 0x5582e7add940 <- 1481001:ddd08e4f:524471cd:1303 @ 0x5582e7b1cad4) user 20 builtin 15
1576167362.694631 [0] add_two_in: WRITER 1481001:ddd08e4f:524471cd:1303 QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167362.694635 [0] add_two_in: write_sample 0:0:0:3c2 #18: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.694637 [0] add_two_in: reader 1481001:ddd08e4f:524471cd:307
1576167362.694641 [0] add_two_in: match_writer_with_proxy_readers(wr 1481001:ddd08e4f:524471cd:1303) scanning all proxy readers
1576167362.694644 [0] add_two_in:   proxy_reader_add_connection(wr 1481001:ddd08e4f:524471cd:1303 prd 8c561001:dc2ce137:5055b4d2:1404)
1576167362.694646 [0] add_two_in:   writer_add_connection(wr 1481001:ddd08e4f:524471cd:1303 prd 8c561001:dc2ce137:5055b4d2:1404) - ack seq 0
1576167362.694650 [0] add_two_in: setcover: all_addrs udp/239.255.0.1:7401 udp/192.168.9.2:38672
1576167362.694653 [0] add_two_in: reduced nlocs=2
1576167362.694657 [0] add_two_in:   loc  0 = udp/192.168.9.2:38672           1 { 0 }
1576167362.694659 [0] add_two_in:   loc  1 = udp/239.255.0.1:7401            1 { 0 }
1576167362.694660 [0] add_two_in:   best = 0
1576167362.694662 [0] add_two_in:   simple udp/192.168.9.2:38672
1576167362.694665 [0] add_two_in: rebuild_writer_addrset(1481001:ddd08e4f:524471cd:1303): udp/192.168.9.2:38672
1576167362.694670 [0] add_two_in: match_writer_with_readers(wr 1481001:ddd08e4f:524471cd:1303) scanning all readers
1576167362.694677 [0] add_two_in: sedp: write for 1481001:ddd08e4f:524471cd:1303 via 1481001:ddd08e4f:524471cd:3c2
1576167362.694680 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:3c2 #9: ST0 (null)/(null):(plist)
1576167362.694685 [0] add_two_in: non-timed queue now has 1 items
1576167362.694704 [0] add_two_in: new_reader(guid 1481001:ddd08e4f:524471cd:1404, (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_)
1576167362.694707 [0] add_two_in: ref_participant(1481001:ddd08e4f:524471cd:1c1 @ 0x5582e7add940 <- 1481001:ddd08e4f:524471cd:1404 @ 0x5582e7b1e6b4) user 21 builtin 15
1576167362.694734 [0] add_two_in: READER 1481001:ddd08e4f:524471cd:1404 QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.694739 [0]        tev: xpack_addmsg 0x7fea58001340 0x5582e7b1e100 0(data(1481001:ddd08e4f:524471cd:3c2:#9/1)): niov 0 sz 0 => now niov 3 sz 228
1576167362.694744 [0] add_two_in: write_sample 0:0:0:4c2 #19: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.694746 [0] add_two_in: reader 1481001:ddd08e4f:524471cd:207
1576167362.694747 [0]         gc: gc 0x5582e7b1f630: not yet, shortsleep
1576167362.694749 [0] add_two_in: match_reader_with_proxy_writers(rd 1481001:ddd08e4f:524471cd:1404) scanning all proxy writers
1576167362.694753 [0] add_two_in:   reader 1481001:ddd08e4f:524471cd:1404 init_acknack_count = 0
1576167362.694755 [0] add_two_in:   reader_add_connection(pwr 8c561001:dc2ce137:5055b4d2:1303 rd 1481001:ddd08e4f:524471cd:1404)
1576167362.694756 [0]        tev: nn_xpack_send 228: 0x7fea5800134c:20 0x5582e7b1e1c8:36 0x5582e7b1dfbc:172 [ udp/192.168.9.2:38672 ]
1576167362.694758 [0] add_two_in:   proxy_writer_add_connection(pwr 8c561001:dc2ce137:5055b4d2:1303 rd 1481001:ddd08e4f:524471cd:1404) - out-of-sync
1576167362.694758 [0]        tev: traffic-xmit (1) 228
1576167362.694761 [0] add_two_in: match_reader_with_writers(rd 1481001:ddd08e4f:524471cd:1404) scanning all writers
1576167362.694768 [0] add_two_in: sedp: write for 1481001:ddd08e4f:524471cd:1404 via 1481001:ddd08e4f:524471cd:4c2
1576167362.694771 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:4c2 #8: ST0 (null)/(null):(plist)

The server discovers the client's request writer first:

1576167362.694778 [0] dq.builtin: SEDP ST0 1481001:ddd08e4f:524471cd:1303 reliable volatile writer: (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:37316 ssm=0) QOS={topic=rq/add_two_intsRequest,type=example_interfaces::srv::dds_::AddTwoInts_Request_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,durability_service=0.000000000:{0:1}:{-1:-1:-1},deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,lifespan=9223372036.854775807,user_data=0<>,ownership=0,ownership_strength=0,writer_data_lifecycle={1},entity_factory=1,ignorelocal=0}
1576167362.694798 [0] dq.builtin: write_sample 0:0:0:3c2 #18: ST0 DCPSPublication/org::eclipse::cyclonedds::builtin::DCPSPublication:(blob)
1576167362.694800 [0] dq.builtin: reader 8c561001:dc2ce137:5055b4d2:307
1576167362.694803 [0] dq.builtin: match_proxy_writer_with_readers(pwr 1481001:ddd08e4f:524471cd:1303) scanning all readers
1576167362.694826 [0] dq.builtin:   reader 8c561001:dc2ce137:5055b4d2:1404 init_acknack_count = 0
1576167362.694827 [0] dq.builtin:   reader_add_connection(pwr 1481001:ddd08e4f:524471cd:1303 rd 8c561001:dc2ce137:5055b4d2:1404)

The client then proceeds to send a request:

1576167362.694950 [0] add_two_in: write_sample 1481001:ddd08e4f:524471cd:1303 #1: ST0 rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_:{1359775513681701906,1,{2,3}}

And the server receives and replies:

1576167362.695160 [0]     recvUC: INFODST(8c561001:dc2ce137:5055b4d2)
1576167362.695162 [0]     recvUC: INFOTS(1576167362.694938812)
1576167362.695171 [0]     recvUC: DATA(1481001:ddd08e4f:524471cd:1303 -> 8c561001:dc2ce137:5055b4d2:1404 #1 L(:1c1 11198.650423) msr_in_sync(8c561001:dc2ce137:5055b4d2:1404 out-of-sync to tlcatchup)data(application, vendor 1.16): 1481001:ddd08e4f:524471cd:1303 #1: ST0 rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_:{1359775513681701906,1,{2,3}} =>8c561001:dc2ce137:5055b4d2:1404
1576167362.695190 [0]     recvUC: )
[INFO] [1576167362.695305532] [add_two_ints_server]: Incoming request
a: 2 b: 3
1576167362.695364 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:403 #1: ST0 rt/rosout/rcl_interfaces::msg::dds_::Log_:{{1576167362,695305532},20,"add_two_ints_server","Incoming request
a: 2 b: 3","/home/michel/Workspaces/another_ros2_ws/src/ros2/demos/demo_nodes_cpp/src/services/add_two_ints_server.cpp","operator()",43}
1576167362.695472 [0] add_two_in: write_sample 8c561001:dc2ce137:5055b4d2:1303 #1: ST0 rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_:{1359775513681701906,1,{5}}

But it isn't till shortly after that the server discovers the client's response reader:

1576167362.699695 [0] dq.builtin: SEDP ST0 1481001:ddd08e4f:524471cd:1404 reliable volatile reader: (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_ NEW (as udp/239.255.0.1:7401 udp/192.168.9.2:37316 ssm=0) QOS={topic=rr/add_two_intsReply,type=example_interfaces::srv::dds_::AddTwoInts_Response_,presentation=0:0:0,partition={},group_data=0<>,topic_data=0<>,durability=0,deadline=9223372036.854775807,latency_budget=0.000000000,liveliness=0:9223372036.854775807,reliability=1:1.000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,transport_priority=0,user_data=0<>,ownership=0,time_based_filter=0.000000000,reader_data_lifecycle=9223372036.854775807:9223372036.854775807,reader_lifespan={0,9223372036.854775807},subscription_keys={0,{}},entity_factory=1,ignorelocal=0}
1576167362.699707 [0] dq.builtin: write_sample 0:0:0:4c2 #19: ST0 DCPSSubscription/org::eclipse::cyclonedds::builtin::DCPSSubscription:(blob)
1576167362.699709 [0] dq.builtin: reader 8c561001:dc2ce137:5055b4d2:207
1576167362.699711 [0] dq.builtin: match_proxy_reader_with_writers(prd 1481001:ddd08e4f:524471cd:1404) scanning all writers
1576167362.699713 [0]         gc: gc 0x7f57b800be60: not yet, shortsleep
1576167362.699715 [0] dq.builtin:   proxy_reader_add_connection(wr 8c561001:dc2ce137:5055b4d2:1303 prd 1481001:ddd08e4f:524471cd:1404)
1576167362.699730 [0] dq.builtin:   writer_add_connection(wr 8c561001:dc2ce137:5055b4d2:1303 prd 1481001:ddd08e4f:524471cd:1404) - ack seq 1
1576167362.699733 [0] dq.builtin: setcover: all_addrs udp/239.255.0.1:7401 udp/192.168.9.2:37316
1576167362.699735 [0] dq.builtin: reduced nlocs=2
1576167362.699739 [0] dq.builtin:   loc  0 = udp/192.168.9.2:37316           1 { 0 }
1576167362.699742 [0] dq.builtin:   loc  1 = udp/239.255.0.1:7401            1 { 0 }
1576167362.699744 [0] dq.builtin:   best = 0
1576167362.699746 [0] dq.builtin:   simple udp/192.168.9.2:37316
1576167362.699749 [0] dq.builtin: rebuild_writer_addrset(8c561001:dc2ce137:5055b4d2:1303): udp/192.168.9.2:37316

I've updated both logs here. Mind to check if my logic is correct? Maybe I'm missing something in the midst of the ack/heartbeat dance.

@eboasson
Copy link
Collaborator

@hidmic your analysis is entirely correct. I'm a bit surprised that the dance works out this way, but it is definitely what happens.

@dirk-thomas's "wait_for_client" would solve this, but it is a bit of tricky operation in that it must gracefully deal with the client disappearing while the service is waiting. What I would find more elegant is for the client to first wait for the server to have discovered the response reader, and only then send the request. That way the server will never have to wait and is still certain that the response will go out properly.

I think that that is possible in principle: guarantee that the endpoint discovery data doesn't get acknowledged until after it has been processed completely, and then offer a way of doing "wait_for_acknowledgements" on the endpoint discovery writers. But I don't dare predict when that'll be available ...

@hidmic
Copy link
Contributor Author

hidmic commented Dec 17, 2019

What I would find more elegant is for the client to first wait for the server to have discovered the response reader, and only then send the request.

IMHO client.wait_for_service() success should imply that both service server and client have discovered each other or, in other words, that a request/response exchange may occur without risking either of the two getting lost. So I agree in principle with your idea @eboasson.

But I don't dare predict when that'll be available ...

Services are a core functionality in ROS 2. Meaning, I personally think we need a fix for this ASAP.

hidmic pushed a commit that referenced this issue Jun 17, 2020
* Block rmw_send_response if response reader unknown

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Change request headers to use rmw_request_id_t on the wire

Signed-off-by: Erik Boasson <eb@ilities.com>

* Precise check for matched client/service

Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Address review comments

Signed-off-by: Erik Boasson <eb@ilities.com>

* Backwards compatibility

* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.

* Replace ERROR by FAILURE to because of windows.h

Signed-off-by: Erik Boasson <eb@ilities.com>

* Timeout rmw_send_response after waiting 100ms for discovery

The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Make iterators "const auto &"

Signed-off-by: Erik Boasson <eb@ilities.com>

* Add TODO for eliminating rmw_send_response blocking

Signed-off-by: Erik Boasson <eb@ilities.com>
jacobperron pushed a commit that referenced this issue Jul 21, 2020
* Block rmw_send_response if response reader unknown

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Change request headers to use rmw_request_id_t on the wire

Signed-off-by: Erik Boasson <eb@ilities.com>

* Precise check for matched client/service

Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Address review comments

Signed-off-by: Erik Boasson <eb@ilities.com>

* Backwards compatibility

* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.

* Replace ERROR by FAILURE to because of windows.h

Signed-off-by: Erik Boasson <eb@ilities.com>

* Timeout rmw_send_response after waiting 100ms for discovery

The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Make iterators "const auto &"

Signed-off-by: Erik Boasson <eb@ilities.com>

* Add TODO for eliminating rmw_send_response blocking

Signed-off-by: Erik Boasson <eb@ilities.com>
jacobperron added a commit that referenced this issue Jul 21, 2020
* Block rmw_send_response if response reader unknown

The client checks using rmw_service_server_is_available whether the
request it sends will be delivered to service, but that does not imply
that the (independent, as far as DDS is concerned) response reader of
the client has been discovered by the service.  Usually that will be the
case, but there is no guarantee.

Ideally DDS would offer an interface that allows checking the reverse
discovery, but that does not yet exist in either the specification or in
Cyclone.  This commit works around that by delaying publishing the
response until the number of request writers matches the number of
response readers.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Change request headers to use rmw_request_id_t on the wire

Signed-off-by: Erik Boasson <eb@ilities.com>

* Precise check for matched client/service

Assign a unique identifier to each client/service on creation, add it
to the USER_DATA QoS of the reader and writer and use it for the request
ids.  This allows:

* rmw_service_server_is_available to only return true once it has
  discovered a reader/writer pair of a single service (rather than a
  reader from some service and a writer from some service); and
* rmw_send_response to block until it has discovered the requesting
  client's response reader and to abandon the operation when the client
  has disappeared.

The USER_DATA is formatted in the same manner as the participant
USER_DATA, this uses the keys "serviceid" and "clientid".

This is still but a workaround for having a mechanism in DDS to ensure
that the response reader has been discovered prior by the request writer
prior to sending the request.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Address review comments

Signed-off-by: Erik Boasson <eb@ilities.com>

* Backwards compatibility

* Revert commit fb040c5 to retain the
  old wire representation;

* Embed the publication_handle of the request inside rmw_request_id_t,
  possible because reverting to the old wire representation frees up
  enough space, and use this in rmw_send_response to check for the
  presence of the client's reader;

* Clients and services without a client/service id in the reader/writer
  user data are treated as fully matched at all times.

* Replace ERROR by FAILURE to because of windows.h

Signed-off-by: Erik Boasson <eb@ilities.com>

* Timeout rmw_send_response after waiting 100ms for discovery

The discovery will eventually result in the client's reader being known
or its writer no longer being known, so a timeout is not necessary for
correctness.  However, if it ever were to block for a longish
time (which is possible in the face of network failures), returning a
timeout to the caller is expected to result in less confusion.

Signed-off-by: Erik Boasson <eb@ilities.com>

* Make iterators "const auto &"

Signed-off-by: Erik Boasson <eb@ilities.com>

* Add TODO for eliminating rmw_send_response blocking

Signed-off-by: Erik Boasson <eb@ilities.com>

Co-authored-by: eboasson <eb@ilities.com>
@hidmic
Copy link
Contributor Author

hidmic commented Oct 21, 2020

IIRC #187 worked around this problem. @eboasson is there an ETA for a complete solution (e.g. a DDS-RPC like implementation for robust discovery or similar)?

@hidmic
Copy link
Contributor Author

hidmic commented Nov 20, 2020

Some re-design discussion took place in ros2/rmw_fastrtps#418.

@abhisheklolage
Copy link

abhisheklolage commented Feb 5, 2024

Hi,
I'm reaching out regarding an issue where I am seeing in a ROS client-service setup - we're actually observing a memory leak where the rate of the leak increases as we increase the frequency of the client-service interactions. Essentially, the value for $(cat /proc/$PID/smaps | grep -i 'Pss:' | awk '{Total+=$2} END {print Total/1000" MB"}') increases over time. Is this related or is there a separate known issue with ROS2 client-service with Cyclone DDS? Are there any workarounds?

System setup:

  • ROS2 Foxy from ros2/ros2@75b13d3
  • Cyclone DDS 0.7.0 / rmw-cyclonedds 0.7.11
  • Build running on arm64

Tagging @dirk-thomas for some insight on memory leak.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants