Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastRTPS drops messages under stress #258

Closed
dejanpan opened this issue Mar 6, 2019 · 10 comments · Fixed by #264
Closed

FastRTPS drops messages under stress #258

dejanpan opened this issue Mar 6, 2019 · 10 comments · Fixed by #264
Assignees

Comments

@dejanpan
Copy link

dejanpan commented Mar 6, 2019

Bug report

Required Info:

Steps to reproduce issue

  1. Compile ros2 from https://github.com/ros2/ros2/blob/release-crystal-20190117/ros2.repos
  2. Compile this package https://github.com/ApexAI/experiments
  3. Run the test in https://github.com/ApexAI/experiments
  4. stress the system with stress-ng --matrix 0

Expected behavior

Messages in this line should not get lost: https://github.com/ApexAI/experiments/blob/master/stress_test/src/stress_test.cpp#L77

Actual behavior

Messages in this line get lost: https://github.com/ApexAI/experiments/blob/master/stress_test/src/stress_test.cpp#L77

Additional information

  • At least FastRTPS 1.7.1 and 1.7.0 lose sample, especially when the system is on high load.
  • The problem does not occur with RTI Connext Pro.

The issue might be connected to #157.

@deeplearningrobotics and @pbaughman found the issue and created a reproducer.

@dejanpan
Copy link
Author

dejanpan commented Mar 6, 2019

@richiware
Copy link
Contributor

@dejanpan I've found a race condition. It happens when new samples are notified to rmw_fastrtps by FastRTPS after rmw_fastrtps takes the sample from FastRTPS and before rmw_fastrtps's internal counter (data_) is updated. In this case FastRTPS's internal counter will differ from rmw_fastrtps's internal counter.

In this branch I've implemented a proposal, always use the FastRTPS's internal counter. Can you also test it yourself?

@dejanpan
Copy link
Author

dejanpan commented Mar 8, 2019

@richiware we tested this internally. The fix does not appear to work (see below). Could you re-produce this for you before the fix and could you test that it does not break for you after the fix? If yes, how did you test it?

How we tested :

pete.baughman@ade:~$ mkdir -p stress_test_ws/src
pete.baughman@ade:~$ cd stress_test_ws/src/

pete.baughman@ade:~/stress_test_ws/src$ git clone git@github.com:ApexAI/experiments.git
Cloning into 'experiments'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0
Receiving objects: 100% (9/9), done.

pete.baughman@ade:~/stress_test_ws/src$ git clone git@github.com:richiware/rmw_fastrtps.git
Cloning into 'rmw_fastrtps'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 2256 (delta 1), reused 6 (delta 1), pack-reused 2247
Receiving objects: 100% (2256/2256), 733.42 KiB | 5.35 MiB/s, done.
Resolving deltas: 100% (1574/1574), done.

pete.baughman@ade:~/stress_test_ws/src$ cd rmw_fastrtps/
pete.baughman@ade:~/stress_test_ws/src/rmw_fastrtps (master u=)$ git checkout hotfix/race_condition 
Branch 'hotfix/race_condition' set up to track remote branch 'hotfix/race_condition' from 'origin' by rebasing.
Switched to a new branch 'hotfix/race_condition'

pete.baughman@ade:~/stress_test_ws/src/rmw_fastrtps (hotfix/race_condition u=)$ cd ../../
pete.baughman@ade:~/stress_test_ws$ source /opt/ros2/crystal/setup.bash 
pete.baughman@ade:~/stress_test_ws$ colcon build
Starting >>> rmw_fastrtps_shared_cpp
Starting >>> stress_test
Finished <<< stress_test [7.61s]                                                                                 
Finished <<< rmw_fastrtps_shared_cpp [7.64s]
Starting >>> rmw_fastrtps_cpp
Starting >>> rmw_fastrtps_dynamic_cpp
Finished <<< rmw_fastrtps_cpp [8.55s]                                                                                 
Finished <<< rmw_fastrtps_dynamic_cpp [8.88s]                        

Summary: 4 packages finished [16.9s]
pete.baughman@ade:~/stress_test_ws$ source install/setup.bash 
pete.baughman@ade:~/stress_test_ws$ ./build/stress_test/stress_test_exe

This ran until I did "apt-get update" in another window (I was trying to install 'stress' because I was in a fresh ADE) and then it hung at 99 like it did before

@dejanpan
Copy link
Author

@deeplearningrobotics and @pbaughman @richiware I also tested this today and found out that the fix seems to work.

What I needed to do is compile whole workspace from src. If I only compiled https://github.com/ros2/rmw_fastrtps, the test would still hang at 99. So this is what I did:

Term 1:

cd ~
mkdir ~/ros2_ws/src
cd ~/ros2_ws
wget https://gist.githubusercontent.com/dejanpan/cc83dff7692e41de5086db5ef69262bd/raw/95b3567691c2e0ce1785b706d8d8d010825bed97/patch1.repos
vcs import src < patch1.repos
cd src/ros2/rmw_fastrtps/
git remote add richiware https://github.com/richiware/rmw_fastrtps
git fetch
git checkout richiware/hotfix/race_condition
cd -
colcon build --symlink-install
source install/setup.bash
./build/stress_test/stress_test_exe

If you wanna switch back to a broken version:

cd src/ros2/rmw_fastrtps/
git checkout a5675c45103f7c8312f4e991e60a346ca5fe4800
cd -
colcon build --symlink-install
source install/setup.bash
./build/stress_test/stress_test_exe

Term2:

stress-ng --matrix 0

The test has now been running for >1h and didn't get hung up.

I think the problem is that rcl and rclcpp links against packages in rmw_fastrtps repo and if you do not compile from src then the linking is not correct. @richiware could you confirm this?

@richiware
Copy link
Contributor

When I was testing and implementing the solution I always compiled the whole workspace. I'm a bit lost about relationship between ROS2 packages. But it sounds like my patch was not being applied.

@pbaughman
Copy link

Ok - I guess this is something else's build dependency, so if you don't re-compile the higher level thing you don't get the new behavior. I tested again by rebuilding the whole workspace and it seems to be running successfully now

@richiware
Copy link
Contributor

If you are happy with the solution, I would be able to create a PR to rmw_fastrtps.

@richiware
Copy link
Contributor

@dejanpan Is the solution good to you? Can I create the PR?

@dejanpan
Copy link
Author

dejanpan commented Mar 14, 2019

@richiprosima we confirmed that it works, I ran the test for 24h. Please create a PR. But I think that we also need a test for it. Do you have an idea how to write such a test?

nuclearsandwich added a commit to ros2/ros2_documentation that referenced this issue Mar 14, 2019
Issue References:

*  ros2/rmw_fastrtps#257
*  ros2/rmw_fastrtps#258

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>
nuclearsandwich added a commit to ros2/ros2_documentation that referenced this issue Mar 14, 2019
* Update the known issues section in Crystal.

Issue References:

*  ros2/rmw_fastrtps#257
*  ros2/rmw_fastrtps#258

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

* remove duplicate periods
@richiware
Copy link
Contributor

No idea. I don't know how to write a deterministic test. The error is erratic and depends on a sample being received after rmw_fastrtps calls FastRTPS take function and before it updates its internal counter.

@tfoote tfoote added the in progress Actively being worked on (Kanban column) label Mar 20, 2019
@clalancette clalancette removed the in progress Actively being worked on (Kanban column) label Mar 22, 2019
ferranm99 added a commit to ferranm99/ferran-ros that referenced this issue May 20, 2022
* Update the known issues section in Crystal.

Issue References:

*  ros2/rmw_fastrtps#257
*  ros2/rmw_fastrtps#258

Signed-off-by: Steven! Ragnarök <steven@nuclearsandwich.com>

* remove duplicate periods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants