Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bond support + fixing various action server issues #1894

Merged
merged 24 commits into from
Jul 30, 2020
Merged

Conversation

SteveMacenski
Copy link
Member

@SteveMacenski SteveMacenski commented Jul 29, 2020

Adding heartbeat support for servers so if they crash or fail, the lifecycle manager will kill the system.

#1869 replacement

Tested:

Completed:

  • Test coverage
  • Documentation updates to navigation website

Relies on ros/bond_core#67 being merged but updated the ros2_depedencies.repos file to use my fork + branch that this PR represents

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jul 29, 2020

@naiveHobo please review. The server side disable point I made I'll work on tomorrow. found solution, was easier than I thought.

@SteveMacenski
Copy link
Member Author

@naiveHobo review now would be good, added complete coverage of the feature set

@naiveHobo
Copy link
Contributor

The lifecycle_manager tests failed since it's not able to form a bond with the test lifecycle node.

I think it would make sense to make bond optional in lifecycle_manager. That will allow any rclcpp_lifecycle::LifecycleNode to use nav2_lifecycle_manager::LifecycleManager out of the box without worrying about setting up bond.

@SteveMacenski
Copy link
Member Author

The lifecycle_manager tests failed since it's not able to form a bond with the test lifecycle node.

Ah, it is optional, I'll update the test to set that up

@SteveMacenski
Copy link
Member Author

Oh whoops closed it

@SteveMacenski SteveMacenski reopened this Jul 30, 2020
@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jul 30, 2020

Should be all good now. Funny enough, that actually covered one test I didnt cover (the case of no bond connection) so that's convenient :-)

@SteveMacenski
Copy link
Member Author

Great, actions are failing tests again.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jul 30, 2020

Actually, I think I uncovered another failure in the action server wrapper - one sec. there was a completely unsafe autostart feature that wasn't totally implemented I removed and I think the tests assumed that it was activated unfairly without telling it to do so (which no other server does other than the tests... not great)

Edit: tested locally, that was it - fixed now

@codecov
Copy link

codecov bot commented Jul 30, 2020

Codecov Report

Merging #1894 into master will increase coverage by 0.26%.
The diff coverage is 97.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1894      +/-   ##
==========================================
+ Coverage   71.12%   71.38%   +0.26%     
==========================================
  Files         222      222              
  Lines       10676    10760      +84     
==========================================
+ Hits         7593     7681      +88     
+ Misses       3083     3079       -4     
Impacted Files Coverage Δ
nav2_util/include/nav2_util/lifecycle_node.hpp 100.00% <ø> (ø)
nav2_bt_navigator/src/bt_navigator.cpp 80.37% <71.42%> (+0.25%) ⬆️
nav2_lifecycle_manager/src/lifecycle_manager.cpp 96.08% <98.33%> (+3.28%) ⬆️
nav2_amcl/src/amcl_node.cpp 83.96% <100.00%> (-0.11%) ⬇️
nav2_controller/src/nav2_controller.cpp 79.20% <100.00%> (-0.27%) ⬇️
nav2_map_server/src/map_server/map_server.cpp 90.47% <100.00%> (+0.23%) ⬆️
nav2_planner/src/planner_server.cpp 72.46% <100.00%> (+0.40%) ⬆️
...v2_recoveries/include/nav2_recoveries/recovery.hpp 86.07% <100.00%> (+0.36%) ⬆️
nav2_recoveries/src/recovery_server.cpp 92.50% <100.00%> (+0.19%) ⬆️
...v2_util/include/nav2_util/simple_action_server.hpp 88.12% <100.00%> (+2.32%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95b905c...13c885f. Read the comment docs.

@SteveMacenski SteveMacenski merged commit ec26039 into master Jul 30, 2020
@SteveMacenski SteveMacenski deleted the bond_lifecycle branch July 30, 2020 23:19
@gramss
Copy link
Contributor

gramss commented Aug 1, 2020

rosdep installing your bond_core branch failed for me:
I'm running 20.04 and freshly installed foxy from debs.

rosdep install -y -r -q --from-paths src --ignore-src --rosdistro foxy
ERROR: the following packages/stacks could not have their rosdep keys resolved
to system dependencies:
nav2_util: Cannot locate rosdep definition for [bond]
nav2_lifecycle_manager: Cannot locate rosdep definition for [bondcpp]
Continuing to install resolvable dependencies...

(no other mention of bond or your lifecycle branch)

Installing your branch manually works fine, after cleaning the workspace.

@SteveMacenski
Copy link
Member Author

See the repos file you should be building with master for dependencies.

@gramss
Copy link
Contributor

gramss commented Aug 1, 2020

yes, I cloned the master branch of nav2, which is default.
This is why I commented this. As I think it did not work as expected.
Just the automatism via repos file did not work, but I think CI is setup and working for master. So, it was probably something local on my machine..?
It worked manually, so no big issue.

@SteveMacenski
Copy link
Member Author

I don't know what you're asking. See our build instructions for master branch https://navigation.ros.org/build_instructions/index.html#build-ros-2-master it requires you work with the repos file. If you choose not to, then don't run master, run a specific distribution released branch

@SteveMacenski
Copy link
Member Author

Master assumes you're building the dependencies we set out in the repos file, because often we need to make updates to fix issues or add features so we need to be able to use non-released versions in master development.

@gramss
Copy link
Contributor

gramss commented Aug 1, 2020

aaaah... I was assuming rosdep uses the .repos file. But vcs is responsible to work with that..
I was just too fast skimming the docs. sorry for bothering..
I thought that some clever mechanics (vcs, but I thought that rosdep would do the job) failed to work (due to this PR) on current master and wanted to notify you, so it can be fixed..

But I was just following the wrong docs (and skipping git checkout a dedicated branch for release, as I wanted master)..

SteveMacenski added a commit that referenced this pull request Aug 11, 2020
* prototype of lifecycle bond system

* adding more structure to get around weak ptr issue

* working prototype for manager

* adding some ns -> s conversions

* changing to service node

* adding bond connections to all servers

* update logs

* fixing review comments

* fix types

* remove extraneous functions

* make linters happy

* simplifications

* adding spinner to get working but now unstable

* moving bond connections to activate state

* adding defaults

* working complete prototype for review

* update dependencies

* adding connection logging

* remove accidental file

* fix server side timeout for heartbeats

* adding complete unit coverage of bond support

* fixing lifecycle test

* trying to activate since autostart was removed
@daisukes
Copy link
Contributor

@SteveMacenski, I have a question about this.

As commented here, if one of them is down then bring them all down.
https://github.com/ros-planning/navigation2/blob/d191e0a41148ef87597711d5bab18dda98fa7c0c/nav2_lifecycle_manager/src/lifecycle_manager.cpp#L345

How am I supposed to bring them up again?
I'm using auto_start to activate at startup, so I want them to come back automatically.
Watching the lifecycle manager state?

@SteveMacenski
Copy link
Member Author

Using the lifecycle manager start functionality like you would if you didn't have auto start set. It just goes back down to the dormant state, you can restart it as normal.

ruffsl pushed a commit to ruffsl/navigation2 that referenced this pull request Jul 2, 2021
* prototype of lifecycle bond system

* adding more structure to get around weak ptr issue

* working prototype for manager

* adding some ns -> s conversions

* changing to service node

* adding bond connections to all servers

* update logs

* fixing review comments

* fix types

* remove extraneous functions

* make linters happy

* simplifications

* adding spinner to get working but now unstable

* moving bond connections to activate state

* adding defaults

* working complete prototype for review

* update dependencies

* adding connection logging

* remove accidental file

* fix server side timeout for heartbeats

* adding complete unit coverage of bond support

* fixing lifecycle test

* trying to activate since autostart was removed
@tonynajjar
Copy link
Contributor

tonynajjar commented Nov 16, 2021

Hi!

From Eloquent to Foxy, I deduce this feature should be available in Foxy but I don't see the changes of this PR in the foxy-devel branch (e.g. no bond in bt_navigator). Does that mean the feature is not in Foxy? Am I missing something?

Thanks in advance!

@SteveMacenski
Copy link
Member Author

Correct, it is in galactic and newer

this->get_name(),
shared_from_this());

bond_->setHeartbeatPeriod(0.10);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteveMacenski Can I ask where this value for the period came from? The message is small but on our system running many parts of nav2, this equates to almost 100Hz on the bond topic for 5 nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 10hz, not 100hz https://github.com/ros/bond_core/blob/ros2/bondcpp/src/bond.cpp#L291-L299. It was selected to offer a reasonable trade off between responsiveness to a server going down and how long before it goes down would it continue to be unsafe for a robot to operate. Since this needs to match from the lifecycle node -> lifecycle manager and vise versa, this couldn't be easily parameterized so that they would be guaranteed to match, so this is one of the very few "magic numbers" in the stack.

I'd be more than happy to discuss a proposal though to change to another value if you had a suggestion and rationale to share!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack up of several bond publishers on the same bond topic is giving us the 100 Hz as measured from ros2 topic hz /bond.

Could this value not be kept as a default in a common header file with an associated parameter name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not currently any common headers between all of these systems, that I am aware of at least.

You should see 10 * N messages on the topic per second, that is correct, where N is the number of lifecycle bond monitored servers. This should be very attainable for DDS given these messages are very small and the default bringup right now includes composition so its all intraprocess.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some precedent for the rate selected from other usage of the bond topic? It seems like there could be a distance metric derived from max speed to come up with a time interval and a timeout value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably break some encapsulation in the servers, since many don't know or care about the robot's speed. What are you trying to solve? 100hz is nothing, that shouldn't be causing any issues even on small platforms!

@mergify
Copy link
Contributor

mergify bot commented Dec 31, 2021

This pull request is in conflict. Could you fix it @SteveMacenski?

@salmagro
Copy link
Contributor

Using the lifecycle manager start functionality like you would if you didn't have auto start set. It just goes back down to the dormant state, you can restart it as normal.

@SteveMacenski I have a question regarding relaunching a custom node when it crashes.

  1. My custom node inherits from the nav2 lifecycle.
  2. I added the createBond() on the on_activate() transition. Like here: https://github.com/ros-planning/navigation2/blob/main/nav2_controller/src/controller_server.cpp#L233-L234
  3. I pass the node name from the lifecycle_nodes list to the nav2_lifecycle_manager.
  4. The node is not being relaunched.

What else do I need to configure? Is there any example or documentation? Thanks.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented May 16, 2022

You have the set respawn as true in launch so that the servers respawn, this is a relatively new feature. I know its in Rolling, but I'm not sure if its been (or will be) backported to Galactic or Foxy. Respawn is a launch feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants