Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Collision Monitor] Add a watchdog mechanism #3880

Merged
merged 15 commits into from Oct 31, 2023

Conversation

doisyg
Copy link
Collaborator

@doisyg doisyg commented Oct 13, 2023


Basic Info

Info Please fill out this column
Ticket(s) this addresses
Primary OS tested on Ubuntu
Robotic platform tested on Dexory's ARRI

Description of contribution in a few bullet points

Submitting as a draft as there are a couple of points to discuss.

This PR adds a per source blocking watchdog mechanism, i.e. stop the robot if a source is not publishing at the expected rate (building on the existing source_timeout mechanism).

  • It adds the parameter block_if_invalid (other name suggestions are welcome) either per source or globally. If set to true, a source triggering its source_timeout will make the robot stop.
  • In order to make it work with a set of source/sensor of different frequencies, I made possible to use the parameter source_timeout per source. I.e. having different value for each sensors. In order not break the current behavior, if not set, each source specific source_timeout are set by default to the global source_timeout.
  • I modified the getData method to return false if a source is considered invalid AND block_if_invalid: true. @AlexeyMerzlyakov, can you maybe check/have a look if the invalid conditions make sense ?
  • I modified symmetrically the collision detector with the new parameters but did not change the behavior. Any thought on what we should do there regarding block_if_invalid ?
  • Tests ?

Description of documentation updates required from your changes

TBD once points above are resolved


Future work that may be required in bullet points

For Maintainers:

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so on reviewing this, something I don't quite understand is why introduce block_if_invalid? If its too old / not updating, that's a pretty critical issue. What's the use-case where we wouldn't have this set to true?

If there isn't one and just trying to be general purpose, I think we can remove that parameter and just outright return false when in an error state. Then we keep the getData change to return bool for logging the warning / stopping the system.

I think also the collision detector node should use that getData bool to log the warning too. I'm not sure - should we somehow adjust the publication to note that to an external listener @tonynajjar?

Some tests are failing, usual bits on documentation updates! Great addition! Can't believe myself and Alexey missed that

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 16, 2023

Ok, so on reviewing this, something I don't quite understand is why introduce block_if_invalid? If its too old / not updating, that's a pretty critical issue. What's the use-case where we wouldn't have this set to true?

To not change the current behavior. Maybe there are people using the CM on sensors with flaky rate. Or maybe with a dynamic array of sensors that can be enabled/disabled depending on the robot mission. But I agree that blocking on invalid should ideally be the default behavior if that's acceptable to change.
Hum, what about getting rid of block_if_invalid but allow to ignore an invalid source if source_timeout is set inferior 0.0 ?

Copy link
Collaborator

@AlexeyMerzlyakov AlexeyMerzlyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing-out this situation. Yes, I'd agree that we better to stop robot in case of broken one of the source: this is serious situation (unless developer intentionally tells that this is unreliable source, by setting block_if_invalid or ignore_invalid option).

Please check my comments below:

Comment on lines 399 to 404
if (!source->getData(curr_time, collision_points)) {
RCLCPP_WARN(get_logger(), "Invalid blocking source detected, stopping the robot");
Velocity stop_vel;
stop_vel.tw = 0.0;
stop_vel.x = 0.0;
stop_vel.y = 0.0;
Action robot_action{STOP, stop_vel, ""};
publishVelocity(robot_action);
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return will break the process() logic: in the tail of this routine there are some things which we can't miss.
If we just return from this, the notify action state logic won't be started, and CM won't do the change notification. So, it is better to update CollisionMonitor::notifyActionState() with ability to pass action_polygon as nullptr and report abnormal situation for that cases.

Additionally, we need to save action with 0-velocity to robot_action_prev_ to have notifications to work properly; and still publish polygons, since they might be in different frames moving while robot is being stopped.

I think, here is better to set main robot_action local variable and go to process() routine's tail, calling notifyActionState() with null action_polygon that will tell this routine to report abnormal behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last commit

nav2_collision_monitor/src/collision_monitor_node.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think, how to report the robot needs to stop, when the sources were outdated for Collision Detector. In the initial message, there is a detected polygons string array, but there is not ability what to do. So, I am thinking about publishing CollisionDetectorState.msg with empty polygons[] array, that will give the developer an information that something (not polygon) is causing robot to report the "collision". Or add one more bool variable into this message, as an alternative for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last commit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the polygons[] is empty when robot detected to "collide" (stop) due to incorrect source. However, in normal operation, it will be also empty. Sorry, as I've missed this initially. Thus, we still seem to need to add invalid source feedback to CollisionDetectorState.msg (and CollisionMonitorState.msg).

nav2_collision_monitor/src/collision_detector_node.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/test/sources_test.cpp Outdated Show resolved Hide resolved
@AlexeyMerzlyakov
Copy link
Collaborator

Hum, what about getting rid of block_if_invalid but allow to ignore an invalid source if source_timeout is set inferior 0.0 ?

I am not sure, that negative timeout values won't cause a confusion. As from me, it is better to use ignore_invalid indicator per each source, since it will be more clear for end-developer.

@SteveMacenski
Copy link
Member

To not change the current behavior. Maybe there are people using the CM on sensors with flaky rate. Or maybe with a dynamic array of sensors that can be enabled/disabled depending on the robot mission. But I agree that blocking on invalid should ideally be the default behavior if that's acceptable to change.

I think that's acceptable to change. If they have flaky sensors, they can set the timeout to something outrageously high (or below). Disable/enable can be done through this API with Tony's recent work, no?

Hum, what about getting rid of block_if_invalid but allow to ignore an invalid source if source_timeout is set inferior 0.0 ?

Sure, or -1 or something clearly invalid. I think -1 is better

@SteveMacenski SteveMacenski linked an issue Oct 16, 2023 that may be closed by this pull request
@tonynajjar
Copy link
Collaborator

tonynajjar commented Oct 16, 2023

I think also the collision detector node should use that getData bool to log the warning too. I'm not sure - should we somehow adjust the publication to note that to an external listener @tonynajjar?

Logging is an obvious improvement yes. I'm also not too sure about adjusting the publication; it wouldn't hurt but I can't think of satisfying semantics. Another alternative to modifying the msg is to publish this information as diagnostics; seems like a suitable place for old data from a source which is similar to a failing heartbeat.

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 16, 2023

Sure, or -1 or something clearly invalid. I think -1 is better

Actually going for 0.0 as the source_timeout is stored as a Duration and can't be negative

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 16, 2023

Okay, I just pushed some changes:

  • Default behavior now to blocking if a source is invalid
  • Got rid of the block_if_invalid parameter and allow to have a non blocking invalid source when thesource_timeout parameter is equal to 0.0 (internally source_timeout_.seconds() == 0.0)
  • Also allow valid state (getData true) if a source did not publish yet AND source_timeout: 0.0, otherwise block.
  • Added source name in the log (with new std::string Source::getSourceName() method)

If that looks fine to you, I will move forward with the next steps:

  1. Not breaking the process() logic and proper feedback
  2. Collision Detector
  3. Tests

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 16, 2023

If we consider the 0.0 equality check dirty, I will change source_timeout type to double and check for negativity

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Oct 17, 2023

Regarding #3880 (comment) and #3880 (comment): if the code was changed to message the failed source, I think it would be fair to add the failed source name in CollisionMonitorState.msg and CollisionMonitorDetector.msg. @SteveMacenski, what do you think about it?

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 23, 2023

I added the source_names feedback in the last commit as an experiment so we can discuss (but there is probably an overhead of doing this).
For the collision monitor, new field string[] source_names in CollisionMonitorState.msg containing:

  • if all sources are valid: list of first sources that triggered the matching polygon_name. Possibly multiple sources as we trigger a polygon with a min number of points that can be more than 1.
  • Otherwise, if a source is invalid: the list of invalid source, and polygon_name will be left empty

For the collision detector, new field string[] source_names in CollisionDetectedState.msg containing:

  • if all sources are valid: list of list of sources (combined in a string to avoid a multi array field).
  • Otherwise, if a source is invalid: the list of invalid sources, and polygons + detections will be left empty

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

This pull request is in conflict. Could you fix it @doisyg?

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 26, 2023

Removed source feedback and rebased on latest.
Source feedback should probably be added only after #3885 comes in

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 26, 2023

If that looks good to you, I quickly add a test and update documentation

@doisyg doisyg mentioned this pull request Oct 26, 2023
7 tasks
Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial stuff, but then I'm happy with it. @AlexeyMerzlyakov any other blockers?

nav2_collision_monitor/src/collision_detector_node.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/src/collision_monitor_node.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/src/collision_detector_node.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/src/collision_detector_node.cpp Outdated Show resolved Hide resolved
// Fill collision_points array from different data sources
for (std::shared_ptr<Source> source : sources_) {
if (source->getEnabled()) {
source->getData(curr_time, collision_points);
if (!source->getData(curr_time, collision_points)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If invalid source was detected, please don't forget to report back to the state_msg the specific source is invalid (in order to publish this message later).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a much more complicated approach with proper source feedback, but per @SteveMacenski, I agree it is best to leave it to another PR.
Effort archived here: https://github.com/doisyg/navigation2/tree/with_source_feedback

source->getData(curr_time, collision_points);
if (!source->getData(curr_time, collision_points)) {
action_polygon = nullptr;
robot_action.polygon_name = "invalid source";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to robot_action.polygon_name = "invalid source: " + source->getName(). If we have to stop due to different sources, robot_action.polygon_name should be different to re-trigger the reporting later in notifyActionState()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.
For now, without proper source tracing/feedback, that would not make sense as we are stopping at the detection of the first invalid source, so if we have multiple invalid, it will just be random which one we are reporting.
Alternative (complex) is here: https://github.com/doisyg/navigation2/tree/with_source_feedback
Better for another PR

Comment on lines 563 to 566
if (robot_action.polygon_name == "invalid source") {
RCLCPP_WARN(
get_logger(),
"Robot to stop due to invalid source");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please report - which source is causing robot to stop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. See https://github.com/doisyg/navigation2/tree/with_source_feedback but better for another PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the polygons[] is empty when robot detected to "collide" (stop) due to incorrect source. However, in normal operation, it will be also empty. Sorry, as I've missed this initially. Thus, we still seem to need to add invalid source feedback to CollisionDetectorState.msg (and CollisionMonitorState.msg).

Comment on lines 80 to 85
if (!sourceValid(data_->header.stamp, curr_time)) {
return;
// Don't block if source_timeout == 0.0
if (source_timeout_.seconds() == 0.0) {
return true;
}
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have to face the same as reported in #3880 (comment) issue.

Let's check the case when source_timout_ is equal to 0.0: this is the case, when we should ignore outdated source. However, how could we determine that the source is outdated?
In other words, during the execution sourceValid() routine contains the following comparison inside:

 const rclcpp::Duration dt = curr_time - source_time;
  if (dt > source_timeout_) {
    // Fail-case
  }

which always will return false. Thus, if we set 0-value of source_timeout for specific source, this source will be always ignored, independently on which timeout we want to bring on, after the ignoring became to be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh. I get it know. Nice catch. This is problematic, a refactor is needed. I don't see a way to solve it without re-introducing the additional bool parameter block_if_invalid (or better name). Thoughts @SteveMacenski ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doisyg why not just move this check inside of sourceValid?


if (source_timeout_.seconds() != 0.0 && dt > source_timeout_) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will push the problem down to the next part of getData and we will have a tf error at some point :
https://github.com/doisyg/navigation2/blob/e54293f9b327174de619519317b6bb5bc0d2e282/nav2_collision_monitor/src/pointcloud.cpp#L89-L111

Though if we also don't block on tf error and just skip the data on source_timeout == 0.0, we could add a check higher here (which will require a new source::getSourceTimeout method) and we should be fine: https://github.com/doisyg/navigation2/blob/e54293f9b327174de619519317b6bb5bc0d2e282/nav2_collision_monitor/src/collision_monitor_node.cpp#L389)

  // Fill collision_points array from different data sources
  for (std::shared_ptr<Source> source : sources_) {
    if (source->getEnabled()) {
      if (!source->getData(curr_time, collision_points) && source->getSourceTimeout() != 0.0) {
        action_polygon = nullptr;
        robot_action.polygon_name = "invalid source";
        robot_action.action_type = STOP;
        robot_action.req_vel.x = 0.0;
        robot_action.req_vel.y = 0.0;
        robot_action.req_vel.tw = 0.0;
        break;
      }
    }
  }

Pushing this now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will have a tf error at some point

So thinking aloud:

  • If we check if source is valid sourceValid with the check on timeout, then we either return false due to stale data with a timeout or...
  • We continue on allowing stale data without a timeout. Then:
  • We try to TF transform it. As long as that data isn't insanely stale, that should be fine as long as its still in the TF buffer
  • If its not in the TF buffer, then the return false is warranted since its literally impossible to use

That seems like a fine workflow. If we set no timeout, then it should work fine for as long as the data is in any way possibly actionable.

So then going to the node code block you added, that would make sense to not STOP in that case. We don't want to consider the data, but we also don't want to stop either. That seems sensible.

nav2_collision_monitor/src/collision_monitor_node.cpp Outdated Show resolved Hide resolved
@SteveMacenski
Copy link
Member

@AlexeyMerzlyakov please re-review and close your previous comments if they are outdated / fixed. Its hard to review this when its unclear how many of your items are still actionable / blocking or not.

nav2_collision_monitor/src/source.cpp Show resolved Hide resolved
nav2_collision_monitor/src/source.cpp Show resolved Hide resolved
nav2_collision_monitor/src/scan.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/src/pointcloud.cpp Outdated Show resolved Hide resolved
nav2_collision_monitor/src/range.cpp Outdated Show resolved Hide resolved
@SteveMacenski
Copy link
Member

CI failed with

unknown file. C++ exception with description "parameter 'Range.source_timeout' cannot be set because it was not declared" thrown in the test body.

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I approve pending @AlexeyMerzlyakov's rereview

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 31, 2023

Hopefully fixed CI.
Doc PR: ros-navigation/docs.nav2.org#483

@SteveMacenski
Copy link
Member

Looks good to me. Any other open topics from @AlexeyMerzlyakov, @doisyg? I know he's not fully tasked on ROS anymore, so if you think you resolved all his concerns, we can merge this to save him the time. If he has issues later, we can address them in the follow up PR doing the logging stuff

@doisyg
Copy link
Collaborator Author

doisyg commented Oct 31, 2023

All good to me, I am happy to support and iterate fast if this PR causes any unforeseen issue

@SteveMacenski SteveMacenski merged commit 0f72da2 into ros-navigation:main Oct 31, 2023
3 of 5 checks passed
jwallace42 pushed a commit to jwallace42/navigation2 that referenced this pull request Jan 3, 2024
* Add block_if_invalid and allow sensor specific  source_timeout

* remove block_if_invalid param and make it default behavior

* allow unblocking if data not yet published

* log source name when invalid

* getData return true if invalid AND source_timeout == 0.0

* fixed logic without source feedback

* fix test

* rebase artefact

* format artefact

* better log

* move per sensor param source_timeout logic to source.cpp

* fix ignore invalid source behavior

* add source_timeout tests

* no needed anymore

* fix testSourceTimeoutOverride test

---------

Co-authored-by: Guillaume Doisy <guillaume@dexory.com>
Signed-off-by: gg <josho.wallace@gmail.com>
jplapp pushed a commit to logivations/navigation2 that referenced this pull request Jan 17, 2024
* Add block_if_invalid and allow sensor specific  source_timeout

* remove block_if_invalid param and make it default behavior

* allow unblocking if data not yet published

* log source name when invalid

* getData return true if invalid AND source_timeout == 0.0

* fixed logic without source feedback

* fix test

* rebase artefact

* format artefact

* better log

* move per sensor param source_timeout logic to source.cpp

* fix ignore invalid source behavior

* add source_timeout tests

* no needed anymore

* fix testSourceTimeoutOverride test

---------

Co-authored-by: Guillaume Doisy <guillaume@dexory.com>

(cherry picked from commit 0f72da2)
HovorunB pushed a commit to logivations/navigation2 that referenced this pull request Jan 24, 2024
* Add block_if_invalid and allow sensor specific  source_timeout

* remove block_if_invalid param and make it default behavior

* allow unblocking if data not yet published

* log source name when invalid

* getData return true if invalid AND source_timeout == 0.0

* fixed logic without source feedback

* fix test

* rebase artefact

* format artefact

* better log

* move per sensor param source_timeout logic to source.cpp

* fix ignore invalid source behavior

* add source_timeout tests

* no needed anymore

* fix testSourceTimeoutOverride test

---------

Co-authored-by: Guillaume Doisy <guillaume@dexory.com>

(cherry picked from commit 0f72da2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collision Monitor Sensor Input Watchdog
4 participants