-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove suspicious wait set locking code #119
Open
rotu
wants to merge
3
commits into
ros2:rolling
Choose a base branch
from
RoverRobotics-forks:patch-10
base: rolling
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"try lock" is generally a tricky game: all a failure to lock proves is that the lock was held at the time of trying. By the time you start executing the next instruction it may already have been released. And I would argue that this gives you essentially the same behaviour as the original code.
In the original code there is the issue that the schedule:
T0: rmw_wait / set busy / do attach / wait
T1: delete X / clean_waitset_caches / skip waitset_detach
T0: wakeup / erase slots for non-triggering entities / clear busy
ends up with the (rmw) waitset entry cache containing a dangling pointer a deleted rmw entity [*]. What will happen most of the time is that a subsequent call to rmw_wait will notice a change in the set of pointers (because it does a memcmp, the fact that it is dangling is not really an issue, and by accident more than design, the events will squeeze through without dereferencing a dangling pointer, too).
The interesting bit is when we then continue as follows:
because the cache hasn't been invalidated and X is at the same address as X' (and presumably may moreover be in the same position in the arguments to rmw_wait) the cache checking logic can end up concluding that the cache is correct for the set being waited for, skipping the attach calls, and therefore not waiting for X' to trigger.
Your change to unconditionally call waitset_detach, blocking until a concurrent call to rmw_wait woke up, would fix that problem. But with this change, you're back to the original behaviour ... and thus get back the original problem.
But there may be a saving grace: it may (as my old comment noted) well be that ROS2 will never delete an entity while another thread is using it in rmw_wait. In that case, all this is a moot point. I think it is likely that this is the case, because at the RMW layer they're all simply pointers, not, e.g., weak references. (Cyclone doesn't make such assumptions in its API, but it exposes 31-bit random numbers as handles on its API instead of pointers. It doesn't guarantee anything about entity handle reuse, but it does use a decent PRNG and with the typical application having a few thousands handles at most, the likelihood of aliasing is pretty small.)
The other thing is that (as I mentioned before) the entire caching code exists because of a microbenchmark I once did. Perhaps it isn't worth it at all. If one simply rips it out, the problem disappears along with it.
(What would also work, and what I would consider doing if the caching is worth keeping, is to retain the busy flag, and depending on whether it is clear or set, either invalidate the cache or mark it as invalid/cleanup-deferred. And then, in rmw_wait, always do the "require reattach" path if it was invalid/cleanup-deferred.)
[*] It has been detached from the DDS waitset, that bit is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is not a mistake with how I'm locking here, and the code is correct as-is.
unique_lock and lock_guard are both raii objects. That is they continue to hold the lock until the object is destroyed. In the case of
clean_waitset_caches
, that's one iteration of the loop (no worry that the lock might be snatched from us duringwaitset_detach
) In the case ofrmw_wait
, that's until the function returns. Again, no worry that someone else might acquire the lock.I do think I can make this code much better by redoing how we cache, but that's a bigger PR.