Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry router connection #135

Merged
merged 4 commits into from
Apr 4, 2024
Merged

Retry router connection #135

merged 4 commits into from
Apr 4, 2024

Conversation

Yadunund
Copy link
Member

@Yadunund Yadunund commented Mar 22, 2024

Addresses #126

This PR updates rmw_init() such that it blocks until a Zenoh router is discovered in ROUTER_CONNECTION_RETRY_MAX attempts. For now this value is not externally configurable and is set to std::numeric_limits<uint64_t>::max() which effectively retries indefinitely.

To test run a talker node without starting the router. The terminal should contain logs to remind the user to start the router. And once the router is started on a different terminal, execution should resume.

# terminal 1
yadunund@ubuntu-22-04:~/ws_rmw$ ros2 run demo_nodes_cpp talker
[ERROR] [1711097721.962485478] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711097722.962627606] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711097723.962784124] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
# terminal 2
yadunund@ubuntu-22-04:~/ws_rmw$ ros2 run rmw_zenoh_cpp rmw_zenohd 
Enter 'q' to quit...

Then

# terminal 1
[ERROR] [1711098130.624961769] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098131.625107213] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098132.625249714] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098133.625404414] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098134.625538271] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098135.625676527] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[ERROR] [1711098136.625823771] [rmw_zenoh_cpp]: Unable to connect to a Zenoh router. Have you started a router with `ros2 run rmw_zenoh_cpp rmw_zenohd`?
[INFO] [1711098137.626023133] [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id cb1d1e4b8534f374740e5ed92edab2b3!
[INFO] [1711098139.637644881] [talker]: Publishing: 'Hello World: 1'
[INFO] [1711098140.637620937] [talker]: Publishing: 'Hello World: 2'

Signed-off-by: Yadunund <yadunund@openrobotics.org>
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I like the idea. However, there are two issues that I see here.

  1. I do think we should make this configurable through an environment variable. That is, I like this as a default behavior, but I think we should also allow the user to set the number of times to check (including maybe a way to disable the check completely).
  2. When I was testing this out with rclpy applications, I couldn't kill off the rclpy application. That is, I would run ros2 run demo_nodes_py talker with no router, and have it be in the loop saying "Unable to connect to a Zenoh router". But then hitting Ctrl-C would do nothing. Doing the same thing with rclcpp, I was able to kill off the application. This is likely a bug somewhere in rclpy, but it is a pretty poor user experience right now. Not sure what we should do about that.

@Yadunund
Copy link
Member Author

Yadunund commented Mar 22, 2024

Thanks for the feedback, Chris. Wondering if we should have two separate envars : one to decide whether or not to check for a router's presence and the other for retry count. Else we could have a single envar ZENOH_ROUTER_CHECK_ATTEMPTS which if set to 0 will skip the router check and will attempt to connect with the router as many times as defined if greater than zero.

Regarding the rclpy issue, it wonder if its related to when we register signal handlers. We should definitely fix it but not a huge blocker imo since the user terminate with CRTL+Z or can start the router and then hit Ctrl+C if needed.

@Yadunund
Copy link
Member Author

Yadunund commented Mar 24, 2024

Alright I changed the implementation to be configurable via ZENOH_ROUTER_CHECK_ATTEMPTS. If set to <= 0, we skil the check. Else check as many times as specified. If unset, check indefinitely.

Signed-off-by: Yadunund <yadunund@openrobotics.org>
@Yadunund Yadunund force-pushed the yadu/retry_router_connection branch from 3ff9448 to 704d206 Compare March 24, 2024 08:56
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea here. I've left one more piece of feedback, then I think we can get this in.

README.md Outdated
Comment on lines 80 to 81
| unset | Indefinitely waits for connection to a Zenoh router. |
| <= 0 | Skips Zenoh router check. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll suggest that we change this behavior slightly. In particular, if it is unset or 0, then it indefinitely waits, and if it is less than 0, we skip the check. That feels a little more symmetric to me, and means that you can always set some value and get the behavior you want.

Copy link
Member Author

@Yadunund Yadunund Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 6366c9d

Signed-off-by: Yadunund <yadunund@openrobotics.org>
Signed-off-by: Yadunund <yadunund@openrobotics.org>
@Yadunund Yadunund force-pushed the yadu/retry_router_connection branch from 6df064a to 6366c9d Compare March 27, 2024 07:41
@Yadunund Yadunund merged commit 2f4e8fe into rolling Apr 4, 2024
3 of 5 checks passed
@delete-merged-branch delete-merged-branch bot deleted the yadu/retry_router_connection branch April 4, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants