-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of article about node life cycle design #34
Conversation
that arrives on topics will not be read. Any service requests will not be | ||
answered. | ||
|
||
[To discuss: All of the above, and additionally should data/service requests be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very inclined towards queuing data/service requests while the node is not active. At least not make it a default behavior, as I think it can cause more harm than good.
Consider for instance a node that queues a request and gets activated long after, at a time in which the request no longer makes sense, and can potentially cause undesired behavior. To prevent this, the burden is on the implementer to check request timestamps and other meaningful data, and to explicitly reject invalid ones. Although this might seem OK for a power user, I think it's too much to ask to everybody.
So, -1 for making it the default, and +0 for it being a mechanism that can be explicitly enabled. Maybe going over some examples can justify the added complexity and turn this into a +1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect a service call to an inactive node to fail immediately.
Service requests are not queued, the caller normally just waits. The wait_for_service()
interface should wait until an active node advertises the named service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a node is not actively processing data it should still be possible to call service on it. E.g. orchestration tools might want to interact with nodes before hand to change their configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A distinction needs to be made between infrastructure service calls and functionality service calls. The latter should not be available when a node is inactive. The former should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want to dog food, and so I'd prefer to not have "special" services. Instead I'd rather just have a version of Node which doesn't allow you to create a service in certain states and then you can prevent processing incoming services by not spinning the executor. Basically, I think that on the wire and in as much of the code stack as possible there should be nothing special about infrastructure service calls and normal ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal to keep all services as much the same as possible is a good one. But the need I stated above still exists.
If all services can be called while the node is in the inactive state, then it is not really safe to reconfigure the node in this state. A service call made by another (active) node just before a configuration parameter changes might return a different result to one made just after the change, for example. The response time of service calls may also not be guaranteable.
On the other hand, if no services can be called while the node is inactive, then you have a jammed node because you can't call the service that activates it.
Off the top of my head, I can see two solutions to this problem. The first is hinted at in your reply: Control whether execution time is available for processing service requests or not. (This is originally what I intended.) This would need to distinguish between functionality services and infrastructure services, so that execution time can be provided for infrastructure service requests. Whether or not the same executor provides time for both would need to be decided.
The second is to shift all the relevant infrastructure stuff to something that can service the requests, such as the executor.
It's worth noting here a couple of interesting things from the Orocos life cycle. The first is that Orocos allows for the source of execution time for handling service calls to be chosen from either the caller or the callee (the default). Obviously this is only possible when two components are running in the same process, which is the major use case of Orocos. It's an interesting concept, though. The other thing, which may or may not be true any more (I can't find it in the manual, but it was definitely there a few years ago) is that service calls are still responded to when the component is in the Stopped state.
Thanks @gbiggs for putting this down in writing, and I'm sorry it has taken me so long to review it. I'll be going over it now and commenting inline if I see anything, but one general comment before I get start is that we normally do one sentence per line rather than wrapping at 80 characters or some similar line length limit with these articles. The rational being that it makes it easier to change a paragraph and easier to review that change through a diff because it avoids rewrapping the text manually. I'd appreciate it if you could change to that style at some point. |
This is used to prepare the node for execution. This may include acquiring | ||
resources that are only held while the node is actually active, such as access | ||
to hardware. One common action that may be performed in `onActivated` is adding | ||
the node to an executor so that it receives execution time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this last sentence. At least the way I envisioned the system, the user would not explicitly write an onActivate method which creates an executor and adds itself to it. Instead the system would own the executor and add the node to it when ever it needed to and the user indicated which executor to use as a configuration of the system. Maybe that's what you mean here, but it's not clear to me who is performing this "common action".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what I was intending with that last line. I agree with you that executors should be owned by something else ("the system" will do for now) and nodes added to them.
One of my biggest complaints about the RTC specification is that it has a bizarre ownership relationship between components and execution contexts. It makes no sense and restricts how tools can be structured.
Other than a few minor comments, this looks like a good start to me. I notice that there is no mention of a configuration state, which is something that OROCOS has, for example. @gbiggs do you have any feedback on that choice? It's not clear to me if that state is explicitly needed, since you can always go to I think I need to digest the section on composite managed nodes a little bit more. I believe it makes sense, but maybe I'll come back with more feedback later on. Thanks again @gbiggs for getting this ball rolling. |
It's there, with a different name. In this proposal, the initialise event transitions a component from the Created to Inactive state. In Orocos RTT speak, one would say that the configure event transitions a component from the PreOperational to the Stopped state. The Created, Inactive and Active states of this proposal map to RTT's PreOperational, Stopped and Running states, respectively. |
Thanks @adolfo-rt this is very helpful. That's similar to what I assumed, but I'm glad you cleared it up for sure. |
Thanks for the comments, @wjwwood. No need to apologise about taking time to get to it; I know how busy you people are doing three jobs at once. I also have been getting behind. There are some very good comments from @adolfo-rt that I need to incorporate. I'll fix the line wrapping, the enums, and @adolfo-rt 's comments in the next few days and update the pull request. I will also try to clarify the state behaviours more. As @adolfo-rt said, the created, inactive and active states map pretty much one for one to RTT's states. Created is "made but not configured", and inactive is a state where the component is not actively performing actions and so can be altered, e.g. configuration parameters. The section on composite managed nodes is very much a work in progress. I think that the basic goals and concepts are good, but we need to do some experimenting to figure out how they should be put into practice. One majore point that needs to be determined is whether composition is a run-time concept or a design-time concept. It is reasonable to argue that the nodes as deployed do not need to physically be composed, provided that the system reflects the designed composition. |
@gbiggs thanks for putting this together. Looking over my notes on the subject your proposal is very similar to the Orocos Taskcontext statemachine. At the link above from @aldofo-rt the major differences I see are your system has an additional "destroyed" state which is not explicit in theirs. And the Orocos TaskContext also optionally differentiates Running from Active where Active has initialized connections but the update loop is not running yet and I think that data callbacks are also not initialized. The Active vs Running differentiation seems valuable for high reliability situations such as realtime controller swapping where you actually want to know that all communication channels have been setup before starting operations. Having reviewed that I was considering proposing to simply use the same model as the Orocos TaskContext. From talking with orocos developers they seem happy with the model and do not express any concerns about it's functionality. |
My proposal comes from long experience with the RTC spec, which is very similar to the Orocos 2 design. That's why my proposal is similar. :) The "destroyed" state is more of an abstract concept. Coming from experience with needing to make completely agnostic models, I tend to include things that are irrelevant at the implementation level. The "destroyed" state in a C++ implementation could mean the destructor has been called, in which case the object no longer even exists. It is useful to have the concept, though, as tools can rely on it (e.g. displaying destroyed nodes rather than them just vanishing from a list) in a more common way. The Orocos state life cycle includes the concept that you can have a component that is providing functionality without actively working. Assuming I've got it right, this means that you can call services and, if you use your own processing time, you will get a response, but the component will not have any of its hooks called (including, as you mention, callbacks for data being available). I see some value in being more black and white. A component is either available to provide functionality, in which case it is doing its own active processing as well as responding to service requests, or it is not providing any functionality, and it can be reconfigured, have connections changed, etc. without fear. You still get the benefit of knowing that the component is actually ready for use before you start it running, while also simplifying the life cycle model a little. Having said that, I can't really make any complaints against using the Orocos TaskContext model. It's a good model and it works well. |
I agree with this statement, and the Orocos RTT devs did as well when they transitioned from RTT 1.x to 2.x. In the 1.x series, there was a distiction between The state machine proposed in this document mostly differs from that of RTT in how error states are managed. These are the most relevant differences I can identify:
|
sorry for being so late to this discussion. recreating some of my questions from the ng-ros group. so far the described system is consistent with our overall design. As I understand it, "configure" so far in this discussion relates to an event moving the node between two states. This is separate from the concept of configure as a loading/re-loading of parameters or configuration values, yes? (I like to be very clear on terms) Has there been any thought as to which state publishers and subscribers should be created in the general case? For example in our systems, the publishers are created in onInitialize ("initialize" for us) and subscribers in the onActivated ("start()" for us) methods. As to the Exception state, does this have an implication on nodes throwing exception themselves as a design pattern? Every so often I re-investigate using exceptions vs returning error codes and so far error codes still wins the philosophical battle. for informational, here are rframe Module states:
|
Correct. In this proposal, this refers to
Yes. This is typically done in the
I would typically create them on the Since I'm not expecting any callback processing or publishing until the
Generally speaking, I imagine you could still do that. I could maybe see some issues if the node is subject to real-time constraints, as setting up subscribers may introduce non-determinism in an otherwise deterministic code path.
So far the discussion of the Exception state was in relation to the Orocos RTT lifecycle. There is no explicit mention of exceptions in this draft. You can consider them an 'implementation detail' of the node. Also, consider that languages like C don't even have a built-in notion of exceptions. |
+1 Overall, this is a good proposal. I am unclear about child node states. I see some potential uses at the cost of considerable additional complexity. I probably cannot evaluate that properly without some significant prototyping. Perhaps some statement of the rationale would help. |
I have updated the pull request with many small changes. Apologies to those looking at the diff, but as per @wjwwood 's sensible request, I have unwrapped the lines. The major changes not mentioned in the discussion above are:
|
|
||
Any service requests to a node in the inactive state will not be answered (to the caller, they will fail immediately). | ||
|
||
Entering the `inactive` state from the `created` state requires running the node's `onInitialised` method. The node uses this to set up any resources it must hold throughout its life (irrespective of if it is active or inactive). As examples, such resources may include topic publications and subscriptions, memory that is held continuously, and initialising configuration parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would one go about node re-initialisation?, i.e. calling onInitialise
on an already Inactive
node. I see three main alternatives:
- Add a
cleanup
transition betweeninactive
tocreated
(à la Orocos RTT). The associatedonCleanup
method would typically undo whatonInitialise
previously did, like deallocate buffers, destroy publishers and subscribers, etc. - Allow
onInitialise
to trigger a self-transitions from/toInactive
. This delegates custom cleanup logic toonInitialise
(if required). - Destroy the node and re-create it. Then initialise.
I like 1. the most, as 2. does the same, but in a more implicit, less clear way. Alternative 3. is overkill IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My image is that onInitialise does stuff that gets set up once during the node's life cycle. For stuff that should be undone and redone multiple times during a node's life, onActivate/onDeactivate is available. The alternative of 1 doesn't really bother me, and it does make it easier to separate slow-to-set-up and fast-to-set-up stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a semantic nitpick: onInitialised
sounds like the name of an asynchronous callback function that happens when the node gets initialized. This line makes it seem like the programmer calls the function onInitialised
to enter the inactive
state, but I would expect such a function to be called initialise
. (And so on for the other transition functions.)
edit: Maybe I should have read the whole article before commenting :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line is really badly written. I blame aeroplane air.
onInitiliased
is called by the node manager in response to some tool shifting the node from created
to inactive
. It is only called once. Think of it as a constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a cleanup/Initialize for slow/fast set up is really a must have IMO. +1 to @adolfo-rt. Alternatively, you could add an intermediate PreConfigured state (RTT's PRE_OPERATIONAL)
OK. Composite nodes can be considered something that depends on the node lifecycle. I'd still want to bring up the subject again, in a separate document. |
It's something I have a lot to say about, too, but I think it needs to be done separately. I also think we should get the node life cycle done first. |
These are two valid criteria to choose what goes in
Having the ability to transition from |
General comment based on the last days discussions, hopefully on topic. It seems that when a desired feature or capability is desired, the response is to add something. Add, add, add. A new service, a new topic, new message, new tool. This seems to be a common ros1 design pattern. This adds complexity and overhead to the system. So ask how can ros2 do more with the same resources, or less. Maybe, for example, node control is not handled by a services in the node, but as a service or other command to the "executor" with the node name as a parameter. What should be the job of the node itself, vs the job of the thing that runs the node? Our 4d/RCS based autonomous systems have 3 default channels total per module: command, status ("latched"), and request (set params etc, when the realtime loop has extra cycles). Request is the only queue'd input. How many topics/services does a ros1 node have by default, just to start the node? Sent from my iPhone
|
@BobDean The complexity can cut both ways. If you have one service per executor then in order to change the state of a node, you need to know the node name and the location of the service which can act on it. This perhaps reduces the number topics and services, but it also complicates the code which needs to use the interface. Part of this depends on how we do Services, which at the moment (and in ROS 1) are not anonymous, i.e. you need the name of the node serving the Service to call it. |
Well, technically, you only need the name of the service. But, for things like this it is typically in the node's namespace, so you do need to know that name. |
I think you're right that we should either support it or explicitly state that it is not supported. My intuition is to explicitly not support it right now until a fully thought out proposal is suggested. I think this because it isn't clear to me that this is a required feature nor do I think that it is trivial to add since we'd need to consider things like:
That's just the open questions that I have off the top of my head. It is, however, useful to consider this when thinking about questions like:
|
I've added a section explicitly excluding extensions. |
@gbiggs I've submitted a PR against your branch at: https://github.com/gbiggs/design/pull/1 |
So ... you've elected to not include a non-fatal Exception state :( |
@doudou I agree that we need to have a way to communicate that there's a problem with a planner failing. However I think that should be handled at a different level. In the case that a planner is failing to find a path to the goal, if the goal location changes or the environment model updates in the next cycle it may start succeeding. As long as the node is turning over and doing it's computations as expected I think it's reasonable to classify it as active. Part of my logic is to look at it from the process launcher's point of view. In the case of an exception such as the plan failing, the launcher cannot do anything about it. The node is running fine and resetting/destroy+recreating the node should give the same result since the goal and environment have not changed. Thus the node itself is not in a failure state, but the task is unachievable. And the launcher can not do anything to improve the node's performance. It also does not have enough information to help, and would disrupt operations if it started trying to respond to the non-fatal exception as the only thing it can do is reset or destroy the node. If the node is operating in a RPC like mode providing a service the result of the task can easily have the results built into the response to indicate that the planning is failing for visibility. If the node is operating in a topic based pipeline we should develop a way to report the performance of that stage of the pipeline. Probably a concept of a task with associated inputs and outputs would be a good idea. Since any given node may support multiple tasks in parallel, some of which may be in an exception state or not. |
I actually agree with @doudou that we need a separate error state: #34 (comment) I'm not sure whether or not we need a separate fatal error state, but we discussed that at length in the same inline comment thread. |
@tfoote: this Exception state is much much much more useful to me than just "a planner is failing". We have different point of views, and I hope that we will agree to disaggree. You advocate that nodes should handle errors (as e.g. problems with devices) themselves, or by cooperating with other nodes (by unspecified means). I, on the other hand, design on purpose nodes to fail early, communicate that to a coordination layer that can then decide what to do. "Fail early" here is a controlled process (i.e. the node knows very well in what internal state it is) and therefore does not require a full teardown, only a stop/start or stop/cleanup/configure/start. It is however important for safety that such a node goes into a non-active state (i.e. is not called for incoming data or services). While looking at the ROS launcher as an application is obviously a good thing, it is not the only application that will care about the life cycle of components. There's much more advanced coordination layers out there. Keep the need of those in mind. |
I think there are several important things to consider about having exception/error states.
To me, it seems that the Orocos approach of having two separate error states is actually a useful method to supporting both design approaches. Errors that are not the fault of component functionality can be represented by one state ("fatal exception" in Orocos), and these would not be producible by the component developer's code but by the infrastructure from the library. Errors that a node has decided to report generally can be represented by another state ("exception"? "node error"?), and these would only be producible from the component developer's code. Then it is up to the component developer whether they want to handle the error themselves, or design their components to work together to handle it, or just throw it out there for a coordinator to do something about. The downside of allowing the developers to do it their own way if they like is it reduces the applicability of standard tools, and you can bet that reusing components from others that deal with errors differently to how yours do will not be fun. |
@gbiggs I agree with your assessment that at some level it's a style choice. I think the new proposal will work with both styles. You can choose to handle everything yourself. Or you can do minimal to no error handling and just make the supervisor do it. I spent a good amount of time with @wjwwood going over this and we've slowly converged to a new model. Going back to @doudou 's comments #34 (comment) about three different types of errors. Another discussion that took a while was the difference between Reviewing the 3 types of errors. For this writeup I'll use My arguments earlier in the thread have mostly been about A So clearly the As I mentioned briefly above we added the transition states into the graph as it is inside the transitions that the users's registered callbacks will be invoked. Please see my updated draft. It's generally the same, but I've restructured it to follow a similar layout to the new diagram. From @gbiggs comments and I was thinking about it is to make sure to get the error reasons and descriptions into the event flow and probably keep some sort of queryable history of events for a component to make detecting crash loops possible. This is probably a layer above what we're talking about as long as we make sure there's enough information in the event streams. |
- If the `onError` callback succeeds the node will transition to `Unconfigured`. | ||
It is expected that the `onError` will clean up all state from any previous state. | ||
As such if entered from `Active` it must provide the cleanup of both `onDeactivate` and `onCleanup` to return success. | ||
- If the `onShutdown` callback raises or returns any other return code the node will transition to `Finalized` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean 'onError' here ?
I love the new structure, especially the addition of "transitional states". Configuring, in particular, can be a long-living process and it's great to know that's what the node is doing. I also appreciate the addition to the ErrorProcessing transitional state. I have one concerns about it: I would usually assume that Exceptions during teardown transitions cause transition to FatalError as they are already cleanup procedures (what can one do if cleanup fails except bailing out ...) This leaves only the Active state. Thoughts ? |
Please follow our style guide for markdown files: https://github.com/ros2/ros2/wiki/Developer-Guide#markdown (especially adding newlines after each sentence). |
The article explicitly mentions: Does this imply that the implementation of the state change API will be completely synchronous? If yes, I think this would be less then ideal for a few use case. An asynchronous approach would be much better. For example in the following use case: A component needs to perform some long running task during the state transition. E.g. the component wants to wait for "enough" subscribers to come up before considering itself "ready". In an asynchronous design that is trivial to achieve. But with the strict callback-with-return-value approach that requires the component to block in this function (which seems undesirable to me). |
@doudou I'm glad that the new structure looks better to you. RE ErrorProcessing needing to be pretty messy: My thought was that the callback could have the origin state and error code as arguments so that you can relatively easily winnow down what needs to happen. And I'd hope that the error processing could generally be written in 3 functions which are basically correspond to the implementations of onCleanup, onDeactivate, and onShutdown. If each checks on the resource before cleaning up the onError can simply chain the appropriate one two or three required cleanup methods. We could automate that but there are definitely use cases where that would not cover and doing it by convention and best practice will be better than trying to enforce more specifics. I also expect that there will be many nodes written which have little or nothing in the onError callbacks and if an exception or error is raised it will not handle it and crash out. And will rely on the main transitions not failing. And as much as I'd like to rely on user code not throwing exceptions, it's better if we make our infrastructure handle it at some level and catch it in a well documented manner. @wjwwood and I talked about also registering exeption handlers for onError as well, but we haven't fleshed that idea out. @dirk-thomas The original draft doesn't follow the conventions. I've been following them in the updates. We'll reformat before we merge, but for now during the discussion when we've had outstanding prs I've been avoiding general reformatting. re synchronous vs asynchronous: I'd expect the external API to be a service based interface which you could use in a synchronous or asynchronous manner. The synchronous callback system is the simplest solution. Asynchronous operations inside the transitions would rely on additional threads being available. But there's no reason that we could not implement an asynchronous API if we work out the execution model. The blocking design would limit a given containers to executing N transitions simultaneously, where N is the number of threads. And if we support long running transitions that could become a problem. We also lose the ability to catch exceptions at all. I'll try to update the document to not reference return codes but result codes. |
line 47: "+In the transition states users registered callbacks will be called to execute the behavior. This is at odds with the statement in the Background that " a managed node ... can be considered a black box. This allows freedom to the node developer on how they provide the managed life cycle functionality...". Recommend that line 47 be changed to: "In the transitions states logic will be executed to determine if the transition is successful. Success or failure shall be communicated to lifecycle management software through the node management interface." in other words, if you guys want to use registered callbacks for an asynchronous implementation, that's cool. But if we decide it is simpler for our needs to use a state machine base class with a synchronous implementation, reading data from the rmw_interface or directly from DDS, that should be cool also. @dirk-thomas I do not understand your example. Does the asyncronous nature enable ready to be achieved immediately? The time for N subscribers to reattach (the be in the not-ready state) is the same if it is synchronous or asynchronous, yes? What is a real world example of this need? |
@BobDean When an external entity triggers a state change on two components and they can not wait for each other if triggered synchronously since the first will never finish until the second is being triggered. Asynchronously they are simply being triggered both and some time later they both respond that the state transition is successful. |
@dirk-thomas I think I misread your post and @tfoote's response. I thought you were describing asyncronous behavior within the node itself. Asynchronous behavior of the service api makes sense. so if i understand it correctly, given two components A and entity E, E should not block on changing A's state. |
Given that adherence to a markdown styleguide seems to be taken seriously, I would suggest using a linter together with travis that will point out flaws. That way you can save your breath telling people to reformat their markdown, as travis + linter will make it more obvious. I have good experience using the nodejs-based mdast + mdast-lint for that purpose. I can provide a PR if you think it would be useful. |
@tkruse A pull request to add linting would be highly appreciated, thanks. Probably also adding a simple command with which users can run the same check locally? |
I've put in a few fixups from comments intp this PR to the above draft: https://github.com/gbiggs/design/pull/3 Until gbiggs merges that you can read the current state here: https://github.com/ros2/design/blob/pr34_update2/articles/node_lifecycle.md Once that's merged in, I'd like to suggest that we merge this in it's current state to give it more visibility to the community on the design page. And we'll review it again after we have a prototype implementation. |
minor wording
feedback from @dirk-thomas and @BobDean
Sorry for the delay keeping up with the pull requests. As it turns out, wifi doesn't work from within a CT scanner. I second @tfoote 's motion for merging the pull request. I came here to make exactly the same proposal myself. We can do any further work through new pull requests. |
Thanks everyone for the feedback. I have merged this and cleaned up the formatting to almost pass the linter in #67 as well. |
No description provided.