Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parameter client bug: wait_for_service sometimes failed #660

Closed
huchhong opened this issue Mar 19, 2019 · 24 comments
Closed

parameter client bug: wait_for_service sometimes failed #660

huchhong opened this issue Mar 19, 2019 · 24 comments

Comments

@huchhong
Copy link

Bug report

Required Info:

  • Operating System:
    • ubuntu 16.04
  • Installation type:
    • from binary of crystal release
  • Version or commit hash:
  • DDS implementation:
  • Client library (if applicable):

Steps to reproduce issue

Yesterday, i wrote a simple test to get myself familiar with parameter client in ros2. My code is hosted at example. There is a luanch script which launchs a node to set parameter and another node to get parameter from first node using SyncParametersClient interaface.

I ran this launch script multiple times, and found that sometimes the second node will wait too long for service after the first node has launched. This seems not good to me.

In a more rare case, after the first node has launched, the second node never successfully wait_for_service. The log is given below:

➜  ~ ros2 launch param_test access_other_node_param_launch.py
[INFO] [launch]: process[set_and_get_parameters-1]: started with pid [38386]
[INFO] [launch]: process[access_other_node_parameters-2]: started with pid [38387]
[INFO] [access_other_node_parameters]: access parameters of other node
[INFO] [set_and_get_parameters]:
Parameter name: foo
Parameter value (integer): 2
Parameter name: baz
Parameter value (double): 1.450000
Parameter name: bar
Parameter value (string): hello
Parameter name: foobar
Parameter value (bool): true
[INFO] [access_other_node_parameters]: service not available, waiting again...
[INFO] [access_other_node_parameters]: service not available, waiting again...
[INFO] [access_other_node_parameters]: service not available, waiting again...

To reproduce this error, you can execute the launch file multiple times, maybe 10 or 20 times, and this should happen.

Expected behavior

Quick and successful response for wait_for_service in parameter client

Actual behavior

waiting too long or more worse, never succuess

@huchhong
Copy link
Author

No one check this issue? I have a similar issue #661, please check if you can reproduce my issue using my method. These two issues have stopped me from moving on to ros2.

@clalancette
Copy link
Contributor

clalancette commented Mar 20, 2019

There is some possibility that this issue will be solved by the Fast-RTPS backport for 1.7.2 we're going to do for the next Crystal patch release. It's worth a try, though it is going to be a bit complicated at the moment. If you want to try it before we have the next official patch release, you can do something like:

  1. Check out the entire ros2 workspace.
  2. Go into eProsima/Fast-RTPS, and checkout tag v1.7.2
  3. Backport Make sure to include the C++ headers used by these headers. rmw_fastrtps#256
  4. Backport Fixed race condition between taking sample and updating counter. rmw_fastrtps#264
  5. Rebuild
  6. Rerun your test

A report back whether this helps would be useful.

Another test you can run is to try out Opensplice, to see whether this is a problem in the RTPS/DDS layer or ROS 2. To do that, you should:

  1. Make sure you have installed opensplice: sudo apt-get install ros-crystal-rmw-opensplice-cpp
  2. Setup your environment for using opensplice: export RMW_IMPLEMENTATION=rmw_opensplice_cpp
  3. Rerun your test.

Again, a report on whether this helps would be useful.

@huchhong
Copy link
Author

ok, i will try these two tests you mentioned.

@huchhong
Copy link
Author

Here is my test result:

test1(update fastrpts) : not works. It still hangs in rare case. Worse is that wait_for_service never returns for timeout in this test when it failed. And @clalancette , you mentioned twice backport ros2/rmw_fastrtps#264, anything wrong?

test2: it works. After switching to openslice, the respone of wait_for_service is quick and always works.

@huchhong
Copy link
Author

I am trying navigation2 right now. There is fatal error when i launch nav2_bringup**_launch.py with openslice. the output is :

[INFO] [launch]: process[world_model-1]: started with pid [73952]
[INFO] [launch]: process[dwb_controller-2]: started with pid [73953]
[INFO] [launch]: process[navfn_planner-3]: started with pid [73954]
[INFO] [launch]: process[simple_navigator-4]: started with pid [73961]
[INFO] [launch]: process[mission_executor-5]: started with pid [73964]
[INFO] [simple_navigator]: Initializing
[INFO] [navfn_planner]: Initializing.
^C[WARNING] [launch.LaunchService]: user interrupted with ctrl-c (SIGINT)
[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [navfn_planner]: Shutting down.
[ERROR] [launch]: process[world_model-1] process has died [pid 73952, exit code -6, cmd '/home/huch/nav2_ws/install/nav2_world_model/lib/nav2_world_model/world_model __params:=/home/huch/nav2_ws/install/nav2_bringup/share/nav2_bringup/launch/nav2_params.yaml'].
[ERROR] [launch]: process[dwb_controller-2] process has died [pid 73953, exit code -6, cmd '/home/huch/nav2_ws/install/dwb_controller/lib/dwb_controller/dwb_controller __params:=/home/huch/nav2_ws/install/nav2_bringup/share/nav2_bringup/launch/nav2_params.yaml'].
[INFO] [launch]: process[navfn_planner-3]: process has finished cleanly
[INFO] [launch]: process[simple_navigator-4]: process has finished cleanly
[INFO] [launch]: process[mission_executor-5]: process has finished cleanly
[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [simple_navigator]: Shutting down
[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [local_costmap.local_costmap]: Using plugin "static_layer"
[INFO] [local_costmap.local_costmap]: Requesting the map...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  could not create subscription: rcl node's context is invalid, at /tmp/binarydeb/ros-crystal-rcl-0.6.3/src/rcl/node.c:531
[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [global_costmap.global_costmap]: Using plugin "static_layer"
[INFO] [global_costmap.global_costmap]: Requesting the map...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  could not create subscription: rcl node's context is invalid, at /tmp/binarydeb/ros-crystal-rcl-0.6.3/src/rcl/node.c:531

when with default fastrtps, things are ok.

what's wrong with openslice in nav2?

@clalancette
Copy link
Contributor

And @clalancette , you mentioned twice backport ros2/rmw_fastrtps#264, anything wrong?

Sorry, that was a typo. The second one should have been ros2/rmw_fastrtps#256; will update the original comment with that.

@clalancette
Copy link
Contributor

test1(update fastrpts) : not works. It still hangs in rare case. Worse is that wait_for_service never returns for timeout in this test when it failed. And @clalancette , you mentioned twice backport ros2/rmw_fastrtps#264, anything wrong?

So the question I still have; is this better, worse, or the same as with Fast-RTPS v1.7.0? That will give us some more information about whether v1.7.2 is going the right direction.

@huchhong
Copy link
Author

I would like to say v1.7.2 is at most the same as v1.7.0 in my test, if it is not worse. In v1.7.0, wait_for_service will return for timeout when failed, bu in 1.7.2, it will block in wait_for_service forever.

@clalancette
Copy link
Contributor

I would like to say v1.7.2 is at most the same as v1.7.0 in my test, if it is not worse. In v1.7.0, wait_for_service will return for timeout when failed, bu in 1.7.2, it will block in wait_for_service forever.

All right, thanks for the feedback. @richiware , it seems like this is probably a bug in Fast-RTPS (or rmw_fastrtps), but I haven't had time to reproduce myself. A look would be appreciated.

@huchhong
Copy link
Author

update: I found that it was actually block in 'get_parameters' instead of wait_for_service. And, i tried my test on another PC, I can reproduce timeout problem in v1.7.0 while can't reproduce blocking problem in v1.7.2. It is wired to me.

@clalancette
Copy link
Contributor

update: I found that it was actually block in 'get_parameters' instead of wait_for_service. And, i tried my test on another PC, I can reproduce timeout problem in v1.7.0 while can't reproduce blocking problem in v1.7.2. It is wired to me.

All right, thanks for the update. At this point, we are going to go ahead with updating Crystal to 1.7.2. Once that is out, we can retest with the official binaries and ensure that everything is working. I'll leave this open until then.

@dirk-thomas
Copy link
Member

Once that is out, we can retest with the official binaries and ensure that everything is working.

There is no benefit in waiting with testing until that happens. I will recommend to just build the latest version of FastRTPS and it's RMW packages from source and test it now. If the new version doesn't address the problem any solution would otherwise be delayed another month.

@clalancette
Copy link
Contributor

There is no benefit in waiting with testing until that happens. I will recommend to just build the latest version of FastRTPS and it's RMW packages from source and test it now. If the new version doesn't address the problem any solution would otherwise be delayed another month.

My point is that that was already done above, and we have two different results from two different machines. I haven't seen this particular problem myself, so my thought is that it is an environmental issue which will be resolved with the official binaries.

@clalancette
Copy link
Contributor

clalancette commented Mar 29, 2019

@huchhong Binaries of Fast-RTPS v1.7.2 and an updated rmw_fastrtps with the fixes are available. We would appreciate a retest of your problem with the binary packages; to get them, you'll have to do a few things:

  1. Edit /etc/apt/sources.list.d/ros2-latest so that it reads:
deb [arch=amd64,arm64] http://packages.ros.org/ros2-testing/ubuntu bionic main
  1. sudo apt-get update && sudo apt-get dist-upgrade
  2. Complete recompile of your application

The last step is very important as this update is an ABI break; it's probably easiest to completely remove your build, install, and log directories in your colcon workspace, then rebuild. A report of whether this update fixes your problem on all of your machines would help us decide whether we can close this issue or whether we need to look further. Thanks.

@huchhong
Copy link
Author

I have retested my problem according to your procedure. I make sure these things:

  1. ros2-testing is really updated
  2. source /opt/ros/crystal/install.zsh instead of my custom compiled version.
  3. recompile my application (rm build, install, log , then recompile)

On my previous failed PC, problem still exists that program would blocks in 'get_parameter' function just like in my custom compiled ros2 version.

@huchhong
Copy link
Author

I test my case on a virtual machine which never installed ros2 before. After i installed ros2 from ros2-testing source and compiled my code against ros2, i can reproduce this blocking problem.

@richiware
Copy link

Sorry for the delay, I will try to check this week and find a solution.

@clalancette
Copy link
Contributor

I test my case on a virtual machine which never installed ros2 before. After i installed ros2 from ros2-testing source and compiled my code against ros2, i can reproduce this blocking problem.

Thanks for doing the test again. Looks like there is still something of a problem.

Sorry for the delay, I will try to check this week and find a solution.

We appreciate you looking into it. Let us know when you have something that we can test.

@richiware
Copy link

I was not able to check it. We are working on the next release and I didn't have time. We have this issue tracked on our tracker system. We will address it at some point. Sorry.

@dirk-thomas
Copy link
Member

I just want to mention that it would be really good to address the problem in the upcoming release. The earlier we can test a potential fix the more likely will it make it into the upcoming release.

@LuisGP
Copy link

LuisGP commented Apr 26, 2019

Hi everybody,

We are not being able to reproduce this issue using a fresh installation of ROS2 Crystal patch 4.

@MiguelCompany
Copy link
Contributor

I think this may have been solved by eProsima/Fast-DDS#532.

@huchhong Could you test with the current master to check if the issue is still there?

@MiguelCompany
Copy link
Contributor

@clalancette This was reported for crystal and, though we know there is room for improvement on services, I think it can be closed.

@clalancette
Copy link
Contributor

All right, closing. If someone is still having an issue with this, please feel free to open another bug.

nnmm pushed a commit to ApexAI/rclcpp that referenced this issue Jul 9, 2022
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this issue Aug 5, 2022
…os2#660)

* Remove temporary directory platform-specific logic from test fixture

Signed-off-by: Emerson Knapp <eknapp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants