Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher #327

Merged
merged 6 commits into from
May 3, 2022

Conversation

jaron-l
Copy link
Member

@jaron-l jaron-l commented Apr 8, 2022

The referenced issue shows that some of the tests in test_force_torque_sensor_broadcaster are flaky. This is because the realtime_publisher thread doesn't have time to start before the test tries to publish a message. Any subscription test will fail as the message will not get published if the publishing thread hasn't started.

This is one example solution, but since it modifies the life-cycle behavior of force_torque_sensor_broadcaster, I'm not exactly aware of all of the consequences. I imagine that most nodes will wait for the broadcaster to move to the inactive state anyways so this shouldn't have a negative effect but I am open to feedback.

refs: #263

@destogl
Copy link
Member

destogl commented Apr 8, 2022

Thank you for finding this! I have a few questions about possible solutions.

  • Would it make a difference if we modify the tests and add sleep there instead in the broadcaster's code?
  • Can we also expect issues in the real world with this? Or do we simply lose few messages at the beginning, if controller is activated directly after configuring?

@jaron-l
Copy link
Member Author

jaron-l commented Apr 8, 2022

Thank you for finding this! I have a few questions about possible solutions.

  • Would it make a difference if we modify the tests and add sleep there instead in the broadcaster's code?
  • Can we also expect issues in the real world with this? Or do we simply lose few messages at the beginning, if controller is activated directly after configuring?

We could just add a sleep to the test and that should work. I personally prefer synchronizing tests so there is no way they can fail but I couldn't find a way to do that from the test directly as the publisher is a protected member. If you think a sleep would be better, I'd be be happy to switch it. If there are only two threads in the test then it would seem very unlikely to fail as cpu time should go to the publisher thread.

You are correct. The real world affect is that messages would get dropped near the beginning of initialization.

@destogl
Copy link
Member

destogl commented Apr 11, 2022

We could just add a sleep to the test and that should work. I personally prefer synchronizing tests so there is no way they can fail but I couldn't find a way to do that from the test directly as the publisher is a protected member.

if you need to access protected members from tests, then you can use friending concept. (Check changed in #154 – or more specifically these lines to see how this is used.)

If you think a sleep would be better, I'd be be happy to switch it. If there are only two threads in the test then it would seem very unlikely to fail as cpu time should go to the publisher thread.

Direct sync, as you proposed, is a much better approach.

@bmagyar
Copy link
Member

bmagyar commented Apr 12, 2022

I wouldn't classify this as behaviour modification, it simply ensures better initialization in on_configure, it does not affect update() directly other than leaving us w an immediately usable publisher :D

perhaps a throttled warning in the lock-loop would be useful to detect cases of deadlocks?

@destogl
Copy link
Member

destogl commented Apr 12, 2022

I would still prefer direct sync in the tests for this. Just to make them more (100%?) stable.

@destogl
Copy link
Member

destogl commented Apr 12, 2022

I propose the following resolution of this issue:

  1. Add here proposed approach to all controller that are using RealtimePublisher (for each instance of real-time publisher) + add ROS_WARN_THROTTLE_SKIP_FIRST writing out that we are waiting for lock to get available.
    while (!realtime_publisher_->trylock())
    {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
    realtime_publisher_->unlock();
  2. Investigate adding fix for this into RealtimePublisher itself, e.g., in the constructor if possible
  3. Remove this temp fix from all controllers and make sure they are working properly with stable tests.

One more question before proceeding as suggested: Do we then actually acquire lock in the line 127 to set default values? @jaron-l can you please check this if you didn't yet?

@jaron-l
Copy link
Member Author

jaron-l commented Apr 12, 2022

One more question before proceeding as suggested: Do we then actually acquire lock in the line 127 to set default values? @jaron-l can you please check this if you didn't yet?

I'm not sure I know what you mean for setting default values but yes, the lock is acquired there and used for setting the message frame id.

I briefly investigated moving the wait to the real time publisher constructor and that seems like the best route; however, I hit a problem that I'm still investigating. I'll report back tomorrow.

@jaron-l
Copy link
Member Author

jaron-l commented Apr 13, 2022

Okay, I think I understand better what is going on. Hopefully this makes sense.

The root of the problem is that force_torque_sensor_broadcaster::update makes no guarantees that a message will get published, nor does it provide feedback when a message was successfully published. Having the wait right after the RealtimePublisher seems to make it far more likely; however, I think even my current fix doesn't eliminate the problem. To show this, I originally tested running the current version of the code 10,000 times without test failure, but when moving the waits to different locations, I found that the success rate sometimes drops, and indeed, even the original location of the wait produces failures very occasionally. It's hard to estimate the rate because it's low and very dependent on the hardware the test is run on. It likely has a higher failure rate on GitHub Actions VMs due to their internal load balancing, hence the frustration in tests failure in CI but not when people run it locally.

So if the publisher thread is running, why are messages getting dropped? It's because the the publisher periodically obtains a lock. If the publisher thread holds the lock, or turn_ is not set to REALTIME as this PR originally was trying to fix, then force_torque_sensor_broadcaster will not publish the message (code where the two if statements are evaluated to show this). So while the wait fixes one of those if statement checks, it doesn't guarantee that the message will get published.

Proposed solutions:

  1. We could change test_force_torque_sensor_broadcaster.cpp::subscribe_and_get_message to loop between force_torque_sensor_broadcaster::update and test_force_torque_sensor_broadcaster::wait_for until they pass with a max loop count. That way the test doesn't proceed until the message is published and received successfully.
  2. We could implement some sort of feedback into force_torque_sensor_broadcaster to respond if a message was published. This would likely change the return type of force_torque_sensor_broadcaster::update which would be a breaking change.
  3. Any other idea someone else has.

As a side note, we could still implement the wait in the RealtimePublisher constructor. This did help the tests to pass far more likely (at least an order of magnitude on my hardware but it's probably relative) and doesn't conflict with any other solution.

@destogl
Copy link
Member

destogl commented Apr 14, 2022

Few more possibilities:

…caster

- create a loop that checks if force_torque_sensor_broadcaster published before checking subscribed message

refs: ros-controls#263
@jaron-l
Copy link
Member Author

jaron-l commented Apr 14, 2022

My most recent commits show how solution 1 would work. I tested it and I couldn't get it to fail.

Copy link
Member

@destogl destogl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Can you "roll-out" this solution to the IMUController and JTC?

@jaron-l
Copy link
Member Author

jaron-l commented Apr 21, 2022

This looks great! Can you "roll-out" this solution to the IMUController and JTC?

I ported changed to IMUController and joint_trajectory_broadcaster but held off on JTC per our discussion about the tests already failing.

@destogl
Copy link
Member

destogl commented Apr 21, 2022

@jaron-l sorry, I introduced wait_set again.

@jaron-l
Copy link
Member Author

jaron-l commented Apr 22, 2022

@jaron-l sorry, I introduced wait_set again.

I went ahead and removed them. Should be good to go now.

@bmagyar bmagyar merged commit 8bbccf2 into ros-controls:master May 3, 2022
@jaron-l jaron-l deleted the issue263 branch August 15, 2022 19:56
mamueluth pushed a commit to StoglRobotics-forks/ros2_controllers that referenced this pull request Aug 26, 2022
…er (ros-controls#327)

* fix: 🐛 make force_torque_sensor_broadcaster wait for realtime_publisher loop to be ready before finishing configure
* test: ✅ check for published message in test_force_torque_sensor_broadcaster
- create a loop that checks if force_torque_sensor_broadcaster published before checking subscribed message
* refactor: ♻️ remove unnecessary includes
* test: ✅ port realtime_publisher check to imu_sensor_broadcaster and joint_state_broadcaster
* refactor: ♻️ remove unnecessary wait_for code from tests
Co-authored-by: Denis Štogl <denis@stogl.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants