fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

jaron-l · 2022-04-08T20:19:31Z

The referenced issue shows that some of the tests in test_force_torque_sensor_broadcaster are flaky. This is because the realtime_publisher thread doesn't have time to start before the test tries to publish a message. Any subscription test will fail as the message will not get published if the publishing thread hasn't started.

This is one example solution, but since it modifies the life-cycle behavior of force_torque_sensor_broadcaster, I'm not exactly aware of all of the consequences. I imagine that most nodes will wait for the broadcaster to move to the inactive state anyways so this shouldn't have a negative effect but I am open to feedback.

refs: #263

…er loop to be ready before finishing configure refs: ros-controls#263

destogl · 2022-04-08T21:16:20Z

Thank you for finding this! I have a few questions about possible solutions.

Would it make a difference if we modify the tests and add sleep there instead in the broadcaster's code?
Can we also expect issues in the real world with this? Or do we simply lose few messages at the beginning, if controller is activated directly after configuring?

jaron-l · 2022-04-08T22:30:32Z

Thank you for finding this! I have a few questions about possible solutions.

Would it make a difference if we modify the tests and add sleep there instead in the broadcaster's code?

Can we also expect issues in the real world with this? Or do we simply lose few messages at the beginning, if controller is activated directly after configuring?

We could just add a sleep to the test and that should work. I personally prefer synchronizing tests so there is no way they can fail but I couldn't find a way to do that from the test directly as the publisher is a protected member. If you think a sleep would be better, I'd be be happy to switch it. If there are only two threads in the test then it would seem very unlikely to fail as cpu time should go to the publisher thread.

You are correct. The real world affect is that messages would get dropped near the beginning of initialization.

destogl · 2022-04-11T12:25:55Z

We could just add a sleep to the test and that should work. I personally prefer synchronizing tests so there is no way they can fail but I couldn't find a way to do that from the test directly as the publisher is a protected member.

if you need to access protected members from tests, then you can use friending concept. (Check changed in #154 – or more specifically these lines to see how this is used.)

If you think a sleep would be better, I'd be be happy to switch it. If there are only two threads in the test then it would seem very unlikely to fail as cpu time should go to the publisher thread.

Direct sync, as you proposed, is a much better approach.

bmagyar · 2022-04-12T06:57:43Z

I wouldn't classify this as behaviour modification, it simply ensures better initialization in on_configure, it does not affect update() directly other than leaving us w an immediately usable publisher :D

perhaps a throttled warning in the lock-loop would be useful to detect cases of deadlocks?

destogl · 2022-04-12T14:38:25Z

I would still prefer direct sync in the tests for this. Just to make them more (100%?) stable.

destogl · 2022-04-12T16:42:23Z

I propose the following resolution of this issue:

Add here proposed approach to all controller that are using RealtimePublisher (for each instance of real-time publisher) + add ROS_WARN_THROTTLE_SKIP_FIRST writing out that we are waiting for lock to get available.

ros2_controllers/force_torque_sensor_broadcaster/src/force_torque_sensor_broadcaster.cpp

Lines 113 to 117 in d70e4ca

    
           while (!realtime_publisher_->trylock()) 
        
           { 
        
             std::this_thread::sleep_for(std::chrono::milliseconds(10)); 
        
           } 
        
           realtime_publisher_->unlock();

Investigate adding fix for this into RealtimePublisher itself, e.g., in the constructor if possible
Remove this temp fix from all controllers and make sure they are working properly with stable tests.

One more question before proceeding as suggested: Do we then actually acquire lock in the line 127 to set default values? @jaron-l can you please check this if you didn't yet?

jaron-l · 2022-04-12T21:38:40Z

One more question before proceeding as suggested: Do we then actually acquire lock in the line 127 to set default values? @jaron-l can you please check this if you didn't yet?

I'm not sure I know what you mean for setting default values but yes, the lock is acquired there and used for setting the message frame id.

I briefly investigated moving the wait to the real time publisher constructor and that seems like the best route; however, I hit a problem that I'm still investigating. I'll report back tomorrow.

jaron-l · 2022-04-13T13:14:21Z

Okay, I think I understand better what is going on. Hopefully this makes sense.

The root of the problem is that force_torque_sensor_broadcaster::update makes no guarantees that a message will get published, nor does it provide feedback when a message was successfully published. Having the wait right after the RealtimePublisher seems to make it far more likely; however, I think even my current fix doesn't eliminate the problem. To show this, I originally tested running the current version of the code 10,000 times without test failure, but when moving the waits to different locations, I found that the success rate sometimes drops, and indeed, even the original location of the wait produces failures very occasionally. It's hard to estimate the rate because it's low and very dependent on the hardware the test is run on. It likely has a higher failure rate on GitHub Actions VMs due to their internal load balancing, hence the frustration in tests failure in CI but not when people run it locally.

So if the publisher thread is running, why are messages getting dropped? It's because the the publisher periodically obtains a lock. If the publisher thread holds the lock, or turn_ is not set to REALTIME as this PR originally was trying to fix, then force_torque_sensor_broadcaster will not publish the message (code where the two if statements are evaluated to show this). So while the wait fixes one of those if statement checks, it doesn't guarantee that the message will get published.

Proposed solutions:

We could change test_force_torque_sensor_broadcaster.cpp::subscribe_and_get_message to loop between force_torque_sensor_broadcaster::update and test_force_torque_sensor_broadcaster::wait_for until they pass with a max loop count. That way the test doesn't proceed until the message is published and received successfully.
We could implement some sort of feedback into force_torque_sensor_broadcaster to respond if a message was published. This would likely change the return type of force_torque_sensor_broadcaster::update which would be a breaking change.
Any other idea someone else has.

As a side note, we could still implement the wait in the RealtimePublisher constructor. This did help the tests to pass far more likely (at least an order of magnitude on my hardware but it's probably relative) and doesn't conflict with any other solution.

destogl · 2022-04-14T12:14:20Z

Few more possibilities:

using concurrent queues on RealtimePublisher
check what realtime-group is doing: https://github.com/ros-realtime/reference-system

…caster - create a loop that checks if force_torque_sensor_broadcaster published before checking subscribed message refs: ros-controls#263

refs: ros-controls#263

jaron-l · 2022-04-14T20:03:48Z

My most recent commits show how solution 1 would work. I tested it and I couldn't get it to fail.

destogl

This looks great! Can you "roll-out" this solution to the IMUController and JTC?

…oint_state_broadcaster refs: ros-controls#263

jaron-l · 2022-04-21T13:33:49Z

This looks great! Can you "roll-out" this solution to the IMUController and JTC?

I ported changed to IMUController and joint_trajectory_broadcaster but held off on JTC per our discussion about the tests already failing.

destogl · 2022-04-21T20:01:05Z

@jaron-l sorry, I introduced wait_set again.

refs: ros-controls#263

jaron-l · 2022-04-22T00:10:43Z

@jaron-l sorry, I introduced wait_set again.

I went ahead and removed them. Should be good to go now.

…er (ros-controls#327) * fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher loop to be ready before finishing configure * test: ✅ check for published message in test_force_torque_sensor_broadcaster - create a loop that checks if force_torque_sensor_broadcaster published before checking subscribed message * refactor: ♻️ remove unnecessary includes * test: ✅ port realtime_publisher check to imu_sensor_broadcaster and joint_state_broadcaster * refactor: ♻️ remove unnecessary wait_for code from tests Co-authored-by: Denis Štogl <denis@stogl.de>

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publish…

d70e4ca

…er loop to be ready before finishing configure refs: ros-controls#263

github-actions bot requested review from a10263790, anfemosa, aprotyas, bmagyar, destogl, rosterloh and TomoyaFujita2016 April 8, 2022 20:19

jaron-l added 2 commits April 14, 2022 19:58

test: ✅ check for published message in test_force_torque_sensor_broad…

5a2e885

…caster - create a loop that checks if force_torque_sensor_broadcaster published before checking subscribed message refs: ros-controls#263

refactor: ♻️ remove unnecessary includes

c38e409

refs: ros-controls#263

destogl requested changes Apr 19, 2022

View reviewed changes

test: ✅ port realtime_publisher check to imu_sensor_broadcaster and j…

398fa6c

…oint_state_broadcaster refs: ros-controls#263

Merge branch 'master' into issue263

d781672

destogl approved these changes Apr 21, 2022

View reviewed changes

refactor: ♻️ remove unnecessary wait_for code from tests

78ed332

refs: ros-controls#263

bmagyar approved these changes May 3, 2022

View reviewed changes

bmagyar merged commit 8bbccf2 into ros-controls:master May 3, 2022

jaron-l mentioned this pull request May 20, 2022

force_torque_sensor_broadcaster tests fail on Galactic Source Job in GitHub Workflows #263

Closed

jaron-l deleted the issue263 branch August 15, 2022 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

jaron-l commented Apr 8, 2022

destogl commented Apr 8, 2022

jaron-l commented Apr 8, 2022

destogl commented Apr 11, 2022

bmagyar commented Apr 12, 2022

destogl commented Apr 12, 2022

destogl commented Apr 12, 2022

jaron-l commented Apr 12, 2022

jaron-l commented Apr 13, 2022

destogl commented Apr 14, 2022

jaron-l commented Apr 14, 2022

destogl left a comment •

edited

Loading

jaron-l commented Apr 21, 2022

destogl commented Apr 21, 2022

jaron-l commented Apr 22, 2022

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

Conversation

jaron-l commented Apr 8, 2022

destogl commented Apr 8, 2022

jaron-l commented Apr 8, 2022

destogl commented Apr 11, 2022

bmagyar commented Apr 12, 2022

destogl commented Apr 12, 2022

destogl commented Apr 12, 2022

jaron-l commented Apr 12, 2022

jaron-l commented Apr 13, 2022

destogl commented Apr 14, 2022

jaron-l commented Apr 14, 2022

destogl left a comment • edited Loading

Choose a reason for hiding this comment

jaron-l commented Apr 21, 2022

destogl commented Apr 21, 2022

jaron-l commented Apr 22, 2022

destogl left a comment •

edited

Loading