Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[demo_nodes_cpp] add_two_ints service is flakey #304

Open
jacobperron opened this issue Dec 14, 2018 · 15 comments
Open

[demo_nodes_cpp] add_two_ints service is flakey #304

jacobperron opened this issue Dec 14, 2018 · 15 comments
Labels
bug Something isn't working

Comments

@jacobperron
Copy link
Member

jacobperron commented Dec 14, 2018

Bug report

Required Info:

  • Operating System:
    • OSX High Sierra
  • Installation type:
    • Binary
  • Version or commit hash:
    • Crystal (pre-release)
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

In one terminal start the server:

ros2 run demo_nodes_cpp add_two_ints_server

In another terminal repeatedly run the client:

ros2 run demo_nodes_cpp add_two_ints_client

The same behavior also occurs for the async client:

ros2 run demo_nodes_cpp add_two_ints_client_async

Expected behavior

Server receives clients request and client gets the response.

Actual behavior

Sometimes (~%30 of the time) the server never gets the request and the client hangs (does not return).

Additional information

Cannot reproduce with OpenSplice and Connext, which leads me to believe that this is a Fast-RTPS issue.

@jacobperron jacobperron added the bug Something isn't working label Dec 14, 2018
@nuclearsandwich
Copy link
Member

nuclearsandwich commented Dec 14, 2018

I was unable to reproduce this using Fast-RTPS a from-source build on Bionic

Using

for i in $(seq 1 200); do
  printf "Run #$i:"
  ros2 run demo_nodes_cpp add_two_ints_client
done

returned successfully all 200 times with both the regular and async client.

@clalancette
Copy link
Contributor

I ran the same test as @nuclearsandwich with Connext on Xenial. There, out of 200 runs, I got exactly 1 run where the client hung forever, and 1 run where the client had to try several times to discover the server (the server remained running the entire time).

@jacobperron
Copy link
Member Author

59/200 where client hung. Fast-RTSP on High Sierra.

@apojomovsky
Copy link

I was able to reproduce the issue on Windows fat binaries both with FastRTPS and Connext using this .bat file:

@echo off
FOR /L %%d IN (1,1,200) DO  (
    echo Run %%d
    ros2 run demo_nodes_cpp add_two_ints_client
)

With FastRTPS it usually hang on the 5th - 8th iteration whereas with Connext it hang in the 15th - 17th

@cottsay
Copy link
Member

cottsay commented Dec 14, 2018

I haven't been able to repro this. I've tried:

  • 1000 iterations on Fedora from Source using FastRTPS
  • 200 iterations on 18.04 ARM64 from debs using FastRTPS or OpenSplice
  • 200 iterations on 18.04 ARM64 from source using FastRTPS or OpenSplice

@jacobperron
Copy link
Member Author

Also, running a Python server with a C++ client results in the same issue:

ros2 run demo_nodes_py add_two_ints_server

@jacobperron
Copy link
Member Author

I was able to reproduce with Fast-RTPS on Xenial by also making the CPU busy during the test (e.g. building a workspace with colcon). The client remains hung even after the CPU usage lowers.

This sounds likely related ros2/rmw_fastrtps#239

@tfoote
Copy link
Contributor

tfoote commented Jan 7, 2019

FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress

@jacobperron
Copy link
Member Author

FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress

Thanks, I discovered that soon after 👍

@tfoote
Copy link
Contributor

tfoote commented Jan 7, 2019

Completely separately it's just been submitted as a rosdep key too: ros/rosdistro#19870

@jacobperron
Copy link
Member Author

Potentially related: ros2/rmw_fastrtps#238 (review)

It appears that when the client hangs, it could be due to the number of requested subscribers being reported as zero at the rmw layer (even though there is a service server running). This doesn't explain the same issue with Connext though. Further investigation is required.

@mkhansenbot
Copy link

I'm not sure I follow the status of this thread. I see @nuclearsandwich said:

I was unable to reproduce this using Fast-RTPS a from-source build on Bionic

Do we have a table of which OS and rmw implementations have this problem?

The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.

@mkhansenbot
Copy link

wait, now I see that ros2/examples#228 is the equivalent for Bionic (18.04).

So what's the status? Is this bug root caused?

@clalancette
Copy link
Contributor

The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.

Could you give some more details about your problem?

So what's the status? Is this bug root caused?

I don't think so. If you look carefully at the reports above, it looks like it mostly happens with Fast-RTPS (or, at least, it was mostly tested with Fast-RTPS). There also seems to be some reports of it with Connext.

There is some possibility that this problem will be improved by the discovery changes that went into FastRTPS 1.7.1 (specifically eProsima/Fast-DDS#411). But that is not a root cause analysis, and as it is we can't upgrade Crystal to FastRTPS 1.7.1 since it is ABI incompatible (though it is API compatible).

@mkhansen-intel There are a couple of things you could try:

  1. Try running with opensplice and see if the situation improves for you. That will at least point to whether it is in the underlying RTPS/DDS implementation or in ROS 2 proper.
  2. Compile FastRTPS 1.7.1 and try it with Nav2 to see if it improves things. Again, the results of this test will help to narrow down whether it is in Fast-RTPS or in ROS 2. Also, if this improves things, this will give us even more motivation to backport the FastRTPS 1.7.1 changes into Crystal in an ABI-compatible way.

@cevans87
Copy link
Contributor

This issue also appears frequently (usually within 20 iterations) with Opensplice on arm64v8 Ubuntu 18.04. Specifically using build from https://ci.ros2.org/view/packaging/job/packaging_linux-aarch64/824/.

Additionally, I see the same behavior more frequently (usually within 5 iterations) with other demos such as minimal_service/minimal_client
terminal 1:
RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_service service_main
terminal 2:
while :; do RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_client client_main; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants