-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement wait_for_service function #215
Comments
Hopefully the rcl/rclcpp refactor will land before you start working on wait_for_service for those libraries... I imagine you would start by implementing it in rmw. |
Since it is new API at all levels, I don't think it should be very disruptive. |
For DDSTopicDescription * foo = request_datareader->get_topicdescription();
foo->get_name(); For OpenSplice we know the topic name since we implement the service code ourselves. I'm still looking into FastRTPS. |
Looks like we implement services for FastRTPS as well, so I can get the topic name for all cases I think. With the topic name we can go back to @gerkey's plan of counting topic publishers for the desired service to know when it is "ready". After that we can be assured (at least as assured as is possible) that a subsequent service call will succeed. I'll have to implement topic counting in FastRTPS too. |
Actually I guess it's counting the subscribers to the "request" topic (that's what the servers do when ready), but it's the same idea. |
I talked with @jacquelinekay and others about this offline, but I'll summarize here so more people can comment. First, I think we've decided that waiting on changes in the state of the ROS graph should go through the The alternative would have been to have stand alone blocking functions which could be unblocked asynchronously without calling some form of Therefore, in The second issue is to do with what, for lack of a better term, run level of our API is required in order to query the graph and wait on changes to it. By "run level" I mean, can you do it after just doing Currently we require a node to be passed in order to list the topic names and types (and to count the subscriptions and publishers). This is because (I believe) in our DDS implementations we require a domain participant in order to collect the graph statistics. One downside to this approach is that each node needs to collect statistics in order to answer graph queries. A more efficient use of resources would be to share this task within a single process (not necessarily required to be global, but it could use a global singleton by default). The alternative would be to do this without a node, but then we will probably need to have a DDS domain participant which is not associated with a node in order to collect the statistics. This is a leaner version of what was proposed in https://github.com/ros2/rmw_opensplice/pull/125/files#r57970539 (the pull request to create a rostopic like command line tool), however that pull request also required a node (in addition to the embedded node which collected statistics). So we have to decide if we want to require a node to query this information or not. If we decide that a node is not required, i.e. the user can ask about the graph state after calling My personal favorite idea at the moment is to have something else (like a "graph proxy" object) which can be created after "init" but without a node and can be added to a wait set. This separates the semantics of a node and the graph queries. It also allows you to query the graph without creating a node. It also makes it easier to have a node without the overhead of storing the graph state (the internal implementation of a node may require access to graph queries at some point though). The downside to this approach is that we might need to create a participant that is not associated with any node and we'll have to absorb the overhead of that in order to query the graph. However, it might be possible to "create a graph proxy from a node" in which case it could reuse the domain participant of the node. This would be useful for cases where you have a node already, but the graph proxy would be tied to the life time of the node. Anyways that's a lot to think about/comment on, so I'll leave it there. I'm prototyping a few different aspects of the above now. |
I wanted to try and summarize the available approaches I'm considering. Option 1Graph object that does not need a Node This strategy would provide a set of functions like The graph object would either be added to the wait set or would be able to provide a guard condition which could be added to the wait set. Pros:
Cons:
Option 2Use a Node to query the graph and wait on graph changes This is basically what we have now. Each node keeps a state of the graph so that you can call functions like In order to wait on graph changes you would either add nodes to the wait set or a node would be able to provide a guard condition for graph state changes which could be added to a wait set. Pros:
Cons:
Option 3(radical) Graph object that does not need a Node and a shared Domain Participant between nodes and graphs This is basically Option 1 with the added implementation detail that there is a single Domain Participant per group (either 1 per process or one per rcl_init context or grouping). It's radical in my opinion because it is a departure from what we've been doing so far and it may have some unforeseen consequences. Pros:
Cons:
I would argue to go with Option 2 and stick with essentially what we've got now (plus adding a guard condition per node for graph state changes). If we ever intended to go to a multiple node per Domain Participant model in our implementations then I believe we could do it for either Option 1 or Option 2. However, Option 1 would give a neater path to Option 3 and would give us the ability to have programs that introspect the graph without being a member of it (without a node). I think it comes down to whether or not introspection without a node is an important use case to us. Also options 1 and 3 are slightly preferable in a generic sense because separate the concerns of introspection and participation in the graph. In a different system (i.e. not DDS) this separate might make more sense. I'd appreciate any comments anyone has on this, but it's more memoranda on my design thoughts than anything. |
I have a few comments to contribute that may or may not be useful. :) As someone who implements a lot of tools, the thing that I dislike about OpenRTM more than anything else is that I have to instantiate the entire API to do anything at all, no matter how small the task. I ended up writing my own client library for my tools that are used from the command line because it was the only way to get even half-way efficient tools that can be called from scripts. I am in favour of option 1. I do not think that it is too difficult as a concept in the API for someone who is at the level of wanting to look at or manipulate the graph. The separation between graph participant and graph observer appeals to me for making tools, and I believe it would make them easier to implement. Using option 2 and requiring every entity to construct a node is not neat, in my opinion. However, nodes are not that heavy so it's not a deal-breaker the way it is in OpenRTM, with its huge baggage. The biggest downside I can see is making the graph harder to understand/process for tools because they have to separate the parts if the graph that are the application, the parts that are the infrastructure, and the parts that are observing or manipulating the graph. The latter may be hard to distinguish, unlike infrastructure, which can be spotted by its unique topic names. Using option 3 will give considerable power and flexibility, but it sounds like it will increase the complexity of the implementation quite a bit. The one-participant-per-node approach is conceptually tidy. I think that option 1 is the best balance between API cleanliness and implementation simplicity. Something I think that needs clarifying is how common it is to have a node that also inspects the graph. If it is relatively rare compared to the cases of a node that does not inspect the graph and a tool that does but doesn't do any node-like stuff, then I do not think it is too much of a burden to have two domain participants in the same process. (I don't know how heavy a domain participant really is, although I can make a reasonable guess from my own knowledge of how DDS works.) Also worth knowing is if the DDS implementations do any optimising internally when a single process uses multiple domain participants. |
@gbiggs thanks for you input, I think it echoes our thoughts too.
That's really at the core of it I think. I think anytime someone uses a service they should wait for the service server to exist (except in rare cases), and so they'll need to look at the graph. Also, if you want to emulate ROS 1 API's like
Yeah, I imagine there are optimizations that can be made, but in order for each domain participant to be addressable on the RTPS level, you need to at least do discovery for each, which by itself is probably fairly expensive. We've also discussed this in detail and I think for now I'm going to take the simplest route with Option 2 (adding a function that gives you a guard condition given a node which triggers on graph changes). This requires the least changes to our existing system and separately I'll investigate a one participant per process model that would make separating the graph introspection from node introspection easier. See ros2/rmw#65 as a preview. |
For ROS 1-style nodes, yes, I agree. But for ROS 2, I don't think that this blanket statement can be made. If I have a managed node that may use a service at any time when it is active, I should be able to rely on the deployer to ensure that the provider of that service is available before it activates my node. (Whether or not a node is providing that service is a system integration problem in this case.) If I may use the service sporadically, such that I don't have a constant dependency on it, then checking if it is available before trying to use it may be sensible (depending, again, on how I integrate my system, I think). So it's going to depend on the node design whether or not checking for service availability is necessary. The other cases you mention sound like enough justification for nodes fairly commonly needing access to graph information, though. It would complicate the user-facing API to make them instantiate a graph-watching object separate from their node just to check on their node's status. I looked at the API you have proposed in ros2/rmw#65. I don't think it precludes what I want eventually, which is an API separate from nodes that can be used to inspect the graph without participating in it. Whether that API relies on a hidden single-participant-per-process implementation can, I think, be kept as a separate issue. So +1 to your proposal. |
I'm hung up on an issue with Connext where locally created topics (and therefore Services) do not notify our custom listener which keeps track of graph state. I've posted on their forums to see if there is a way for me to change that behavior: https://community.rti.com/forum-topic/built-publication-data-local-datawriters-and-datareaders |
After going back and forth on the RTI forums (https://community.rti.com/forum-topic/built-publication-data-local-datawriters-and-datareaders), the conclusion is that this is expected behavior and that our choices are to have one participant per process that is dedicated to discovery (and therefore never has it's own topics to worry about) or manage the state and events for the locally created items ourselves. From the DDS 1.4 spec:
I missed that on my first reading of the specification. I made the argument that this isn't a very nice design because it adds an asymmetry to the behavior which doesn't exist with normal topics (which this discovery information is supposed to be delivered over in a dog-fooding style design). But I doubt anything will change in the near term since that's in the specification directly. I also advocated for an option that allowed the user to select a different behavior. This has a higher chance of happening in the time frame of a couple releases of Connext, but I didn't get any feedback on that particular point yet. I also asked for advice on how to work around the issue, and the only suggestion I got was to use a dedicated participant to monitor discovery so that this problem wouldn't come up. That's an option, but it's sort of unfortunate since there is a lot of overhead associated with each participant (memory usage and discovery network traffic). At this point, I think it's most attractive to manage some state and events ourselves. |
Managing some state and events ourselves would, I assume, have benefits for using non-DDS middlewares. How likely that scenario is, I don't know. A dedicated participant is an attractive solution when the resources are available and many nodes are in the same process, so that the overhead can be minimised. It would give flexibility on how the graph information API can be used. If a dedicated participant is only listening for graph changes, what is the increase in network traffic? Where does it come from (e.g. is it responding to messages to say "I got it")? Is it possible to reduce the memory overhead by assuming that it will respond fairly rapidly and so does not require lengthy history buffers, or would this reduce the reliability of the graph information too much, or is it not even possible to alter those buffers? What would managing state and events ourselves look like? What is the likely overhead compared with a dedicated participant? |
This would only affect Connext. Even OpenSplice, for example, would not need this custom state. OpenSplice seems to ignore this particular part of the specification.
Well, we did discuss having a dedicated participant for discovery information, but we decided to leave it as-is for now (one participant per node). If there's any prefered state for me, it would be one participant per process. Having a minimum of two participants for a process which contains at least one node seems overkill to me. It would be especially bad in the case where you have lots of nodes, each in their own processes. Which probably won't be uncommon, at least in research and development situations.
It would effectively double the fixed cost for each single node process. I think the real concern would be the increased discovery traffic, but it's hard to quantify the issue without doing some tests. However, I do know that there have been startup issues in DDS systems with many participants due to heavy discovery traffic. I'm not sure to what degree we can "turn down" the history and queues in a participant to minimize the impact on memory and discovery.
It should be fairly straightforward, but just annoying because I know Connext is maintaining the same information internally and I just can't get at it. My educated guess is that it would be less than an additional participant simply because it will not incur any new network traffic for discovery. |
For the alpha, I'm proposing we review and merge:
These will at least get the functions into the system with some tests. It does not add API for Also, only OpenSplice has a fully working implementation, so Connext is mostly implemented, but missing local notifications (things created in the same participant), and FastRTPS is blocked for on-going work to extend FastRTPS itself with stuff to achieve graph state monitoring. I have moved existing work that shouldn't be merged before the alpha to the I'll post CI for the |
I edited the previous comment with actual links to pr's... Thanks @jacquelinekay. |
I fixed some style issues. Also there is a new bug that I didn't notice before. There seems to be a spurious segfault now. I'm not sure yet if it is caused by my changes or my new tests are just exposing it. It happens reliably in the destructor of a "DDS::LocalObject" as a result of destroying the wait set object. This happens outside of any threading on my part. There is threading in that the DDSListener calls my function which can trigger the graph guard condition while other things are happening in the test thread (main thread). I'm still looking into the issue, but I'm assuming it's a race condition since it only happens sometimes. |
I've updated the I still need to add more documentation to |
So, now I'm running into an AssertionFailure which is from an assertion that I made in
|
With @jacquelinekay's help I figured out that this assertion is just wrong. The reason it was flaky was that this case could only occur if these events followed in this exact order:
This normally doesn't show up because this happens:
I wrote a test and removed the incorrect asserts in ros2/rcl#63. New CI (OpenSplice only): |
The CI looks good. I'd attribute all of the warnings and test failures to other issues. |
I would appreciate it if one or two people could review these pull requests soon. Once merged I'll create at least three tickets to track other issues:
|
Just to clarify, it's every link in the description of this ticket, right? Do you have a suggested order? |
I typically use the open and connected pr's in waffle.io, but here's a quick list anyways:
|
I'm looking into a segfault on Linux: It seems to be flaky and/or cannot be reproduced on OS X. |
I am able to reproduce the issue now (very infrequently) and I have a theoretical fix. One of the tests on Windows is hanging on shutdown. I am rebuilding it in Debug mode to figure out why it's blocked. |
I fixed the hanging issue on Windows. It happened because the OpenSplice threads had already been stopped by the time the graph listener's waitset was destroyed. That ended up hanging in the New CI: |
I addressed this warning by renaming the branch Otherwise CI looks good, and I have +1's, so I'm going to merge. Thanks everyone for the reviews. |
There are a few lingering tasks, but they are all ticketed separately. The bulk of the work is done, so I'm going to finally close this. |
This is a multi-part task:
wait_for_service
tormw
New API: new api to support wait_for_service rmw#65wait_for_service
torcl
Clarification of wait behavior: Clarify wait concurrency rcl#59New API: add graph functions to support wait_for_service rcl#57wait_for_service
torclcpp
New API: add ability to wait for a service server to be available rclcpp#221(withdrawn)Implement wait_for_service rclcpp#201 (pre-existing issue)Add(not needed sincewait_for_service
torclpy
rclpy
doesn't yet support services)wait_for_service
to each rmw implementation:rmw_opensplice_cpp
Impl: support wait_for_service rmw_opensplice#137rmw_connext_cpp
andrmw_connext_dynamic_cpp
Impl: support wait_for_service rmw_connext#165rmw_fastrtps_cpp
Stubs so things compile: stub functions for wait_for_service rmw_fastrtps#31wait_for_service
to make the tests less flaky insystem_tests
wait_for_service
to make the tests less flaky inexamples
add missing macro ROSIDL_GET_SRV_TYPE_SUPPORT to introspection type support rosidl#140The challenge will be finding a portable way to check this in each rmw implementation.
The text was updated successfully, but these errors were encountered: