New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Collision Monitor] Add a watchdog mechanism #3880
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so on reviewing this, something I don't quite understand is why introduce block_if_invalid
? If its too old / not updating, that's a pretty critical issue. What's the use-case where we wouldn't have this set to true?
If there isn't one and just trying to be general purpose, I think we can remove that parameter and just outright return false
when in an error state. Then we keep the getData change to return bool for logging the warning / stopping the system.
I think also the collision detector node should use that getData
bool to log the warning too. I'm not sure - should we somehow adjust the publication to note that to an external listener @tonynajjar?
Some tests are failing, usual bits on documentation updates! Great addition! Can't believe myself and Alexey missed that
To not change the current behavior. Maybe there are people using the CM on sensors with flaky rate. Or maybe with a dynamic array of sensors that can be enabled/disabled depending on the robot mission. But I agree that blocking on invalid should ideally be the default behavior if that's acceptable to change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing-out this situation. Yes, I'd agree that we better to stop robot in case of broken one of the source: this is serious situation (unless developer intentionally tells that this is unreliable source, by setting block_if_invalid
or ignore_invalid
option).
Please check my comments below:
nav2_collision_monitor/include/nav2_collision_monitor/collision_detector_node.hpp
Outdated
Show resolved
Hide resolved
if (!source->getData(curr_time, collision_points)) { | ||
RCLCPP_WARN(get_logger(), "Invalid blocking source detected, stopping the robot"); | ||
Velocity stop_vel; | ||
stop_vel.tw = 0.0; | ||
stop_vel.x = 0.0; | ||
stop_vel.y = 0.0; | ||
Action robot_action{STOP, stop_vel, ""}; | ||
publishVelocity(robot_action); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return will break the process()
logic: in the tail of this routine there are some things which we can't miss.
If we just return from this, the notify action state logic won't be started, and CM won't do the change notification. So, it is better to update CollisionMonitor::notifyActionState()
with ability to pass action_polygon
as nullptr
and report abnormal situation for that cases.
Additionally, we need to save action with 0-velocity to robot_action_prev_
to have notifications to work properly; and still publish polygons, since they might be in different frames moving while robot is being stopped.
I think, here is better to set main robot_action
local variable and go to process()
routine's tail, calling notifyActionState()
with null action_polygon
that will tell this routine to report abnormal behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think, how to report the robot needs to stop, when the sources were outdated for Collision Detector. In the initial message, there is a detected polygons string array, but there is not ability what to do. So, I am thinking about publishing CollisionDetectorState.msg
with empty polygons[]
array, that will give the developer an information that something (not polygon) is causing robot to report the "collision". Or add one more bool variable into this message, as an alternative for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the polygons[]
is empty when robot detected to "collide" (stop) due to incorrect source. However, in normal operation, it will be also empty. Sorry, as I've missed this initially. Thus, we still seem to need to add invalid source feedback to CollisionDetectorState.msg
(and CollisionMonitorState.msg
).
I am not sure, that negative timeout values won't cause a confusion. As from me, it is better to use |
I think that's acceptable to change. If they have flaky sensors, they can set the timeout to something outrageously high (or below). Disable/enable can be done through this API with Tony's recent work, no?
Sure, or |
Logging is an obvious improvement yes. I'm also not too sure about adjusting the publication; it wouldn't hurt but I can't think of satisfying semantics. Another alternative to modifying the msg is to publish this information as diagnostics; seems like a suitable place for old data from a source which is similar to a failing heartbeat. |
Actually going for |
Okay, I just pushed some changes:
If that looks fine to you, I will move forward with the next steps:
|
If we consider the 0.0 equality check dirty, I will change |
Regarding #3880 (comment) and #3880 (comment): if the code was changed to message the failed source, I think it would be fair to add the failed source name in |
I added the source_names feedback in the last commit as an experiment so we can discuss (but there is probably an overhead of doing this).
For the collision detector, new field
|
This pull request is in conflict. Could you fix it @doisyg? |
nav2_collision_monitor/include/nav2_collision_monitor/collision_monitor_node.hpp
Outdated
Show resolved
Hide resolved
Removed source feedback and rebased on latest. |
If that looks good to you, I quickly add a test and update documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superficial stuff, but then I'm happy with it. @AlexeyMerzlyakov any other blockers?
// Fill collision_points array from different data sources | ||
for (std::shared_ptr<Source> source : sources_) { | ||
if (source->getEnabled()) { | ||
source->getData(curr_time, collision_points); | ||
if (!source->getData(curr_time, collision_points)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If invalid source was detected, please don't forget to report back to the state_msg
the specific source is invalid (in order to publish this message later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a much more complicated approach with proper source feedback, but per @SteveMacenski, I agree it is best to leave it to another PR.
Effort archived here: https://github.com/doisyg/navigation2/tree/with_source_feedback
source->getData(curr_time, collision_points); | ||
if (!source->getData(curr_time, collision_points)) { | ||
action_polygon = nullptr; | ||
robot_action.polygon_name = "invalid source"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this to robot_action.polygon_name = "invalid source: " + source->getName()
. If we have to stop due to different sources, robot_action.polygon_name
should be different to re-trigger the reporting later in notifyActionState()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
For now, without proper source tracing/feedback, that would not make sense as we are stopping at the detection of the first invalid source, so if we have multiple invalid, it will just be random which one we are reporting.
Alternative (complex) is here: https://github.com/doisyg/navigation2/tree/with_source_feedback
Better for another PR
if (robot_action.polygon_name == "invalid source") { | ||
RCLCPP_WARN( | ||
get_logger(), | ||
"Robot to stop due to invalid source"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please report - which source is causing robot to stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. See https://github.com/doisyg/navigation2/tree/with_source_feedback but better for another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the polygons[]
is empty when robot detected to "collide" (stop) due to incorrect source. However, in normal operation, it will be also empty. Sorry, as I've missed this initially. Thus, we still seem to need to add invalid source feedback to CollisionDetectorState.msg
(and CollisionMonitorState.msg
).
if (!sourceValid(data_->header.stamp, curr_time)) { | ||
return; | ||
// Don't block if source_timeout == 0.0 | ||
if (source_timeout_.seconds() == 0.0) { | ||
return true; | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have to face the same as reported in #3880 (comment) issue.
Let's check the case when source_timout_
is equal to 0.0
: this is the case, when we should ignore outdated source. However, how could we determine that the source is outdated?
In other words, during the execution sourceValid()
routine contains the following comparison inside:
const rclcpp::Duration dt = curr_time - source_time;
if (dt > source_timeout_) {
// Fail-case
}
which always will return false
. Thus, if we set 0-value of source_timeout
for specific source, this source will be always ignored, independently on which timeout we want to bring on, after the ignoring became to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh. I get it know. Nice catch. This is problematic, a refactor is needed. I don't see a way to solve it without re-introducing the additional bool parameter block_if_invalid
(or better name). Thoughts @SteveMacenski ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@doisyg why not just move this check inside of sourceValid
?
if (source_timeout_.seconds() != 0.0 && dt > source_timeout_) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will push the problem down to the next part of getData
and we will have a tf error at some point :
https://github.com/doisyg/navigation2/blob/e54293f9b327174de619519317b6bb5bc0d2e282/nav2_collision_monitor/src/pointcloud.cpp#L89-L111
Though if we also don't block on tf error and just skip the data on source_timeout == 0.0
, we could add a check higher here (which will require a new source::getSourceTimeout method) and we should be fine: https://github.com/doisyg/navigation2/blob/e54293f9b327174de619519317b6bb5bc0d2e282/nav2_collision_monitor/src/collision_monitor_node.cpp#L389)
// Fill collision_points array from different data sources
for (std::shared_ptr<Source> source : sources_) {
if (source->getEnabled()) {
if (!source->getData(curr_time, collision_points) && source->getSourceTimeout() != 0.0) {
action_polygon = nullptr;
robot_action.polygon_name = "invalid source";
robot_action.action_type = STOP;
robot_action.req_vel.x = 0.0;
robot_action.req_vel.y = 0.0;
robot_action.req_vel.tw = 0.0;
break;
}
}
}
Pushing this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will have a tf error at some point
So thinking aloud:
- If we check if source is valid
sourceValid
with the check on timeout, then we either returnfalse
due to stale data with a timeout or... - We continue on allowing stale data without a timeout. Then:
- We try to TF transform it. As long as that data isn't insanely stale, that should be fine as long as its still in the TF buffer
- If its not in the TF buffer, then the
return false
is warranted since its literally impossible to use
That seems like a fine workflow. If we set no timeout, then it should work fine for as long as the data is in any way possibly actionable.
So then going to the node
code block you added, that would make sense to not STOP
in that case. We don't want to consider the data, but we also don't want to stop either. That seems sensible.
@AlexeyMerzlyakov please re-review and close your previous comments if they are outdated / fixed. Its hard to review this when its unclear how many of your items are still actionable / blocking or not. |
CI failed with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I approve pending @AlexeyMerzlyakov's rereview
Hopefully fixed CI. |
Looks good to me. Any other open topics from @AlexeyMerzlyakov, @doisyg? I know he's not fully tasked on ROS anymore, so if you think you resolved all his concerns, we can merge this to save him the time. If he has issues later, we can address them in the follow up PR doing the logging stuff |
All good to me, I am happy to support and iterate fast if this PR causes any unforeseen issue |
* Add block_if_invalid and allow sensor specific source_timeout * remove block_if_invalid param and make it default behavior * allow unblocking if data not yet published * log source name when invalid * getData return true if invalid AND source_timeout == 0.0 * fixed logic without source feedback * fix test * rebase artefact * format artefact * better log * move per sensor param source_timeout logic to source.cpp * fix ignore invalid source behavior * add source_timeout tests * no needed anymore * fix testSourceTimeoutOverride test --------- Co-authored-by: Guillaume Doisy <guillaume@dexory.com> Signed-off-by: gg <josho.wallace@gmail.com>
* Add block_if_invalid and allow sensor specific source_timeout * remove block_if_invalid param and make it default behavior * allow unblocking if data not yet published * log source name when invalid * getData return true if invalid AND source_timeout == 0.0 * fixed logic without source feedback * fix test * rebase artefact * format artefact * better log * move per sensor param source_timeout logic to source.cpp * fix ignore invalid source behavior * add source_timeout tests * no needed anymore * fix testSourceTimeoutOverride test --------- Co-authored-by: Guillaume Doisy <guillaume@dexory.com> (cherry picked from commit 0f72da2)
* Add block_if_invalid and allow sensor specific source_timeout * remove block_if_invalid param and make it default behavior * allow unblocking if data not yet published * log source name when invalid * getData return true if invalid AND source_timeout == 0.0 * fixed logic without source feedback * fix test * rebase artefact * format artefact * better log * move per sensor param source_timeout logic to source.cpp * fix ignore invalid source behavior * add source_timeout tests * no needed anymore * fix testSourceTimeoutOverride test --------- Co-authored-by: Guillaume Doisy <guillaume@dexory.com> (cherry picked from commit 0f72da2)
Basic Info
Description of contribution in a few bullet points
Submitting as a draft as there are a couple of points to discuss.
This PR adds a per source blocking watchdog mechanism, i.e. stop the robot if a source is not publishing at the expected rate (building on the existing
source_timeout
mechanism).block_if_invalid
(other name suggestions are welcome) either per source or globally. If set totrue
, a source triggering itssource_timeout
will make the robot stop.source_timeout
per source. I.e. having different value for each sensors. In order not break the current behavior, if not set, each source specificsource_timeout
are set by default to the globalsource_timeout
.getData
method to returnfalse
if a source is considered invalid ANDblock_if_invalid: true
. @AlexeyMerzlyakov, can you maybe check/have a look if the invalid conditions make sense ?block_if_invalid
?Description of documentation updates required from your changes
TBD once points above are resolved
Future work that may be required in bullet points
For Maintainers: