Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle find_container_node_names error #322

Merged
merged 3 commits into from Sep 17, 2019
Merged

Conversation

ivanpauno
Copy link
Member

Fix #321.

There is a raise condition, as I described here.

The error I'm catching comes from here:
https://github.com/ros2/rclpy/blob/8da91cee54b68f1108ca017da885496ffb5ae16c/rclpy/src/rclpy/_rclpy.c#L3137-L3146

The problem is that I can be also hiding some other errors by doing this.
rcl_get_service_names_and_types_by_node is a wrapper of rmw_get_service_names_and_types_by_node.
rmw_get_service_names_and_types_by_node can fail not only because the node name wan't found, but also because allocation problems, etc (it depends on the rmw implementation); and I don't have a way to identify the error above from rclpy or above layers.

Ideally, the best way of solving this would be to list the services of all nodes, using:

def get_service_names_and_types(*, node, include_hidden_services=False):
service_names_and_types = node.get_service_names_and_types()
if not include_hidden_services:
service_names_and_types = [
(n, t) for (n, t) in service_names_and_types
if not topic_or_service_is_hidden(n)]
return service_names_and_types

And then recognize the node that was hosting the service from the fully qualified service name.
This won't be possible until we go ahead with ros2/design#241.
There is currently ambiguity:
/asd/bsd/_container/load_node
Can be a node with ns=/asd name=bsd and the service name is _container/load_node.
Or it can be: ns=/ name=asd and service name is bsd/_container/load_name (an unusual choice).

Seeing the four places than find_container_node_names is used (1, 2, 3, 4), I prefer hiding some unusual errors (bad allocations, etc) to avoid the race condition (which is a frequent error). The only result is that a node container is not found and skipped (when any error occurs).

Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno added the in review Waiting for review (Kanban column) label Aug 27, 2019
@ivanpauno ivanpauno self-assigned this Aug 27, 2019
@hidmic
Copy link
Contributor

hidmic commented Aug 27, 2019

I prefer hiding some unusual errors (bad allocations, etc) to avoid the race condition (which is a frequent error)

I think that's a fair rationale, so it LGTM to get back to green CI, but I do not agree with

Ideally, the best way of solving this would be to list the services of all nodes

IMHO the solution is to simply not raise generic exceptions such as RuntimeError from rclpy and instead use clearly defined ones to enable some form of smart error handling.

@dirk-thomas
Copy link
Member

It is good to know that this race is responsible for the failing test. I agree with the previous comment though that it should be made sure that a distinguishable error code is returned from the RMW API and that the case where the node simply doesn't exist needs to be signaled with a different exception type in rclpy.

If we don't address the root of the problem now but only use this hack to ignore all kinds of errors I doubt it will get revisited. If we immediately follow up with this improvement I am fine to merge this hack as a short term temporary workaround.

@ivanpauno
Copy link
Member Author

IMHO the solution is to simply not raise generic exceptions such as RuntimeError from rclpy and instead use clearly defined ones to enable some form of smart error handling.

I agree that's an easier solution, but the other solution I commented directly avoids the race condition, instead of handling the error after not finding the node name.

If we don't address the root of the problem now but only use this hack to ignore all kinds of errors I doubt it will get revisited.

I agree, I will actually forget.
I will left this open until I add the new return code in the rmw API, modify the three rmw implementations, check that the rcl API pass the new error code correctly to the upper layer, and modify rclpy for raising a different exception in that case (I will probably do it by the end of this week).

@hidmic
Copy link
Contributor

hidmic commented Aug 27, 2019

the other solution I commented directly avoids the race condition

Yes, you're right, but we're a bit further away from being able to pull that one off.

Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>
Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>
Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ivanpauno
Copy link
Member Author

CI , up to ros2component, only fastrtps:

  • Linux: Build Status
  • Linux-aarch64: Build Status
  • OSX: Build Status
  • Windows: Build Status

@ivanpauno ivanpauno merged commit a12484d into master Sep 17, 2019
jacobperron pushed a commit that referenced this pull request Sep 26, 2019
Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>
jacobperron added a commit that referenced this pull request Sep 26, 2019
* Handle find_container_node_names error (#322)

Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>

* Catch generic RuntimeError instead

Since rclpy.node.NodeNameNonExistentError is not defined in Dashing.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
@clalancette clalancette deleted the ivanpauno/fix-#321 branch May 26, 2020 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in review Waiting for review (Kanban column)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ros2component] find_container_node_names sometimes fails in the presence of other nodes
3 participants