Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed race condition in action server between is_ready and take #2250

Closed
wants to merge 0 commits into from

Conversation

jmachowinski
Copy link
Contributor

@jmachowinski jmachowinski commented Jul 28, 2023

…data and execute

Some background information: is_ready and take_data are guaranteed to be
called in sequence without interruption from another thread.
while execute is running, another thread may also call is_ready.

The problem was, that goal_request_ready_, cancel_request_ready_, result_request_ready_
and goal_expired_ were accessed and written from is_ready and execute.

This commit fixed this by only using the mentioned variables in is_ready and take_data.
execute now only accesses the given pointer and works on this.

Signed-off-by: Janosch Machowinski J.Machowinski@cellumation.com

Note, this fixes #2242

@jmachowinski
Copy link
Contributor Author

I'm still a bit puzzled about the is_ready() function. When will it be called in relation to the take_data function ?
From the code it looks like it should be a sequence of is_ready, take_data, is_ready etc. but this seems not to be the case...

@jmachowinski jmachowinski marked this pull request as draft August 1, 2023 09:28
@jmachowinski
Copy link
Contributor Author

jmachowinski commented Aug 1, 2023

Looks like is_ready may be called multiple times before take_data is called.Therefore this patch needs a rewrite.

@jmachowinski
Copy link
Contributor Author

@clalancette can you do a review of this ? The exact description of the issue this is fixing can be found here #2242 (comment)

@jmachowinski
Copy link
Contributor Author

@ros2/team this patch is hanging here since July, is there anything that I can do to speed this up ?

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I'm sorry that this took a long time to get to.

Second, thank you for the excellent explanation of the issue in #2242 , along with the reproducer. That helped me to understand the problem.

I've left a few questions inline, please take a look.

rclcpp_action/src/client.cpp Outdated Show resolved Hide resolved
Comment on lines 360 to 362
ret = rcl_action_take_feedback(
pimpl_->client_handle.get(), feedback_message.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure that this is the direction we want to go in. This change makes ClientBase::is_ready no longer idempotent, and that seems like a huge change in semantics. Further, while the documentation for the is_ready virtual method doesn't explicitly say so, it implies that this should be idempotent.

Can you explain further why you moved all of this here, rather than just adding the lock? We have to take the lock both here and in ClientBase::take_data anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that I saw cases, were two threads would call is_ready, and then go on to take_data, but the second thread would trigger the exception, as there would only be one data item to take.
I am still puzzled which code path triggers this, as the multithreaded executor protects the sequence of both calls by holding a mutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on reviewing this pr and I think I agree with @clalancette at the very least in that this changes the meaning/expectation for the is_ready function.

I do think that your approach is probably on the right track (I'm also working on a test to narrow down the core issue, like you I'm a bit puzzled by the root cause), but I may propose a patch that changes how the code is laid out.

Comment on lines 283 to 286
execute_goal_request_received(
rcl_ret_t ret, rcl_action_goal_info_t goal_info, rmw_request_id_t request_header,
std::shared_ptr<void> message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need these server changes to fix the crash on the client? Put another way, could we split these server fixes to another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory they are independent, and we could split them, would there be any advantage of doing this ?

@fmauch
Copy link

fmauch commented Oct 18, 2023

Hi, I've run into #2242 on my test setup, as well. I have a (intentionally) not too capable PC running a robot driver plus MoveIt and have it execute random collision-free motions forever (or as long as possible). Basically, I want to try (end demonstrate) whether a ROS 2 based application can reliably work in a 24/7 application.

We use a low-latency kernel and have our robot driver using FIFO-scheduling with a very high priority, so this could definitively be one of the reasons why we stumbled upon this.

Without the fix from this PR the driver crashed after about 1-2 hours, with this fix applied (merged into current rolling) we did not have any crashes for 4 days now. I did not have an in-depth look into this since I am not very familiar with rclcpp's code base, but I hope that a "Solves a severe problem for me" voice helps in bringing this PR forward.

@wjwwood wjwwood self-assigned this Oct 19, 2023
@jmachowinski
Copy link
Contributor Author

@clalancette @wjwwood I'm back from vacation, lets try to get this MR sorted next ;-)

@tommy-erko
Copy link

Is there any review going on?
@jmachowinski can you meanwhile fix the DCO check (sign off in one commit message)?

@jmachowinski
Copy link
Contributor Author

@tommy-erko fixed

@jmachowinski
Copy link
Contributor Author

Putting a queue there could work, but would result in a weird mix of behaviors of the events.
I also fear, that rmw_wait would not wake up / wake up delayed if we get the triggering wrong, in
cases that waitables are ready.

As far as I understand the code, (@wjwwood Correct me if this assumption is wrong !) for everything
except the conditionals, we have some sort of interrupt system going on, were the flag in the wait set will
be set on consecutive calls to rmw_wait, until the data was actually taken. Taking the data, implicitly clears the flag.

Therefor I would suggest, to change the behavior of the conditionals in this ways:

  • Make them counting
  • Add an API to decrease the count
  • Don't touch them in RMW

If we would decrease the count of an associated conditional every time take_data is called, this should solve all of our problems.

@tommy-erko
Copy link

@jmachowinski I just tested the latest patch version with our system and I got an error in the first run, instead of the usual "Taking data from action client...", this time it is the "Taking data from action server but nothing is ready" version. I see it happens if
next_ready_event == std::numeric_limits<uint32_t>::max() (line 654 in client.cpp). What are the circumstances that can lead to this, can there be an issue on our side? We never got this error with the original "summer-version" of this PR, or with the official unpatched version.

@jmachowinski
Copy link
Contributor Author

"Taking data from action client...", this time it is the "Taking data from action server but nothing is ready"

I messed up the error message in client.cpp:654

Interesting that you still get this error though, this means there is another way to trigger it, that I am not aware of yet.
Anyway, there is no error on your side, the current state of the patch is known to not fully fix the issue. You should stay on the "summer version" for now.

@jmachowinski
Copy link
Contributor Author

@tommy-erko I just did a test with the current version ontop of Iron, and it works without any issues for me.

Can you try this version ? https://github.com/cellumation/rclcpp/tree/action_fix_final
Its the same as this one, just as two commits.

Note, this version of the patch will expose errors in waitables, and may introduce problems with interprocess communication.

@jmachowinski
Copy link
Contributor Author

@Barry-Xu-2018 FYI I had a longer Q&A session with @wjwwood on discord. The outcome of it is, that by design it is the responsibility of a waitable, to trigger wakeups as long as it has data for processing. Therefore the patch from #2109 seems to be the wrong way to go, as it shifts the responsibility to the executioner.

@Barry-Xu-2018
Copy link
Collaborator

@jmachowinski

Thank you for sharing the results of the discussion.

The outcome of it is, that by design it is the responsibility of a waitable, to trigger wakeups as long as it has data for processing. Therefore the patch from #2109 seems to be the wrong way to go, as it shifts the responsibility to the executioner.

While fixing this issue, I also had doubts about which module rclcpp or rmw should be responsible for addressing it.

In RMW, the criteria for determining whether guard_conditions/events and subscriptions/clients/services have trigger condition are currently different. For subscriptions/clients/services, having data available in buffer is considered a trigger condition. But guard_conditions/events have no 'data'.
Refer to
https://github.com/ros2/rmw_fastrtps/blob/5b96503556302b47b4e4af96a44f6189355edf29/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L56-L116

For guard_conditions, trigger condition will be cleared before leaving __rmw_wait(). The issue is that currently, the RMW layer cannot know whether the trigger has been handled by the upper layers.

https://github.com/ros2/rmw_fastrtps/blob/5b96503556302b47b4e4af96a44f6189355edf29/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L306-L315

If we fix this issue in the RMW layer, we would need to design a new mechanism that allows the upper layers to inform RMW that a guard condition has been handled and can be cleared.

@jmachowinski
Copy link
Contributor Author

@Barry-Xu-2018 Is there another use case except the waitables that I am not aware of ?
For the waitables you can figure out if you got processed or not. E.g. if take_data() was not called your were not processed, and you need to retrigger your internal guard condition on the next call to add_to_wait_set.

@jmachowinski
Copy link
Contributor Author

jmachowinski commented Dec 14, 2023

@clalancette @fujitatomoya @wjwwood I now pushed the version, that I would consider the correct fix for this issue.

If you are not in favor of the revert of " Avoid losing waitable handles while using MultiThreadedExecut…" before a fix for the issue #2012 is in place, I would remove it from this MR and just merge the first commit.

I'm going to be in vacation now and might not be responsive for the rest of the year.

@Barry-Xu-2018
Copy link
Collaborator

@jmachowinski

Is there another use case except the waitables that I am not aware of ?
For the waitables you can figure out if you got processed or not. E.g. if take_data() was not called your were not processed, and you need to retrigger your internal guard condition on the next call to add_to_wait_set.

Apologies for the delayed response.
From the information I known (mentioned in #2250 (comment)), there shouldn't be anymore.

@jmachowinski
Copy link
Contributor Author

@wjwwood @mjcarroll @fujitatomoya Ping, I would like to see this merged for jazzy

@firesurfer
Copy link

Is there a reason for this PR not getting merged? If there is anything a user from the community can do to help getting this merged. Let me know!
Especially in combination with ros2control (which uses a multithreaded executor) and the default JointTrajectoryController - which is used as a basic building block for many other components this issue can be encountered quite regularly.
So fixing this should be quite beneficial for many applications!

@jmachowinski
Copy link
Contributor Author

jmachowinski commented Apr 2, 2024

@firesurfer This is now in the pipeline for merge. The plan is to merge #2420 and afterwards this PR. If you want to help out, you could retry this patch as soon as 2420 was merged, to verify, that it is still working, and we got no regressions.

@jmachowinski
Copy link
Contributor Author

@firesurfer this version should be ready to merge now. Can you run it on your system an verify it with the latest version of rolling ?

@firesurfer
Copy link

As our stack is iron based and has never been built with rolling this might proove a bit tricky. I would be happy to test any backport of this PR as soon as it is available (I am not sure about ABI compatiblity between rolling and iron at the moment - if its compatible I could just recompile rclcpp from rolling)

@jmachowinski
Copy link
Contributor Author

@firesurfer our stack is also iron based, but compiles without problems ontop of rolling. Note, if you still use gazebo11 (classic) it gets tricky.

@fmauch
Copy link

fmauch commented Apr 5, 2024

Uh,I'll try to setup my demonstrator with that version as well. Can't promise though.

@mjcarroll
Copy link
Member

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@wjwwood
Copy link
Member

wjwwood commented Apr 9, 2024

I have no idea what happened, but I tried pushing one commit to this branch and it reset rolling branch on jmachowinski:rolling to be the same as ros2:rolling, and now it's not letting me push, and it closed this pull request 🙃. So I'm opening a new one: #2495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ActionClient Crash: Taking data from action client but nothing is ready
10 participants