Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of article about node life cycle design #34

Merged
merged 3 commits into from
Nov 4, 2015
Merged

Draft of article about node life cycle design #34

merged 3 commits into from
Nov 4, 2015

Conversation

gbiggs
Copy link
Member

@gbiggs gbiggs commented Jun 19, 2015

No description provided.

that arrives on topics will not be read. Any service requests will not be
answered.

[To discuss: All of the above, and additionally should data/service requests be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very inclined towards queuing data/service requests while the node is not active. At least not make it a default behavior, as I think it can cause more harm than good.

Consider for instance a node that queues a request and gets activated long after, at a time in which the request no longer makes sense, and can potentially cause undesired behavior. To prevent this, the burden is on the implementer to check request timestamps and other meaningful data, and to explicitly reject invalid ones. Although this might seem OK for a power user, I think it's too much to ask to everybody.

So, -1 for making it the default, and +0 for it being a mechanism that can be explicitly enabled. Maybe going over some examples can justify the added complexity and turn this into a +1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a service call to an inactive node to fail immediately.

Service requests are not queued, the caller normally just waits. The wait_for_service() interface should wait until an active node advertises the named service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a node is not actively processing data it should still be possible to call service on it. E.g. orchestration tools might want to interact with nodes before hand to change their configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A distinction needs to be made between infrastructure service calls and functionality service calls. The latter should not be available when a node is inactive. The former should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to dog food, and so I'd prefer to not have "special" services. Instead I'd rather just have a version of Node which doesn't allow you to create a service in certain states and then you can prevent processing incoming services by not spinning the executor. Basically, I think that on the wire and in as much of the code stack as possible there should be nothing special about infrastructure service calls and normal ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal to keep all services as much the same as possible is a good one. But the need I stated above still exists.

If all services can be called while the node is in the inactive state, then it is not really safe to reconfigure the node in this state. A service call made by another (active) node just before a configuration parameter changes might return a different result to one made just after the change, for example. The response time of service calls may also not be guaranteable.

On the other hand, if no services can be called while the node is inactive, then you have a jammed node because you can't call the service that activates it.

Off the top of my head, I can see two solutions to this problem. The first is hinted at in your reply: Control whether execution time is available for processing service requests or not. (This is originally what I intended.) This would need to distinguish between functionality services and infrastructure services, so that execution time can be provided for infrastructure service requests. Whether or not the same executor provides time for both would need to be decided.

The second is to shift all the relevant infrastructure stuff to something that can service the requests, such as the executor.

It's worth noting here a couple of interesting things from the Orocos life cycle. The first is that Orocos allows for the source of execution time for handling service calls to be chosen from either the caller or the callee (the default). Obviously this is only possible when two components are running in the same process, which is the major use case of Orocos. It's an interesting concept, though. The other thing, which may or may not be true any more (I can't find it in the manual, but it was definitely there a few years ago) is that service calls are still responded to when the component is in the Stopped state.

@tfoote tfoote added the ready Work is about to start (Kanban column) label Jun 23, 2015
@tfoote tfoote self-assigned this Jun 23, 2015
@wjwwood
Copy link
Member

wjwwood commented Jul 15, 2015

Thanks @gbiggs for putting this down in writing, and I'm sorry it has taken me so long to review it.

I'll be going over it now and commenting inline if I see anything, but one general comment before I get start is that we normally do one sentence per line rather than wrapping at 80 characters or some similar line length limit with these articles. The rational being that it makes it easier to change a paragraph and easier to review that change through a diff because it avoids rewrapping the text manually. I'd appreciate it if you could change to that style at some point.

This is used to prepare the node for execution. This may include acquiring
resources that are only held while the node is actually active, such as access
to hardware. One common action that may be performed in `onActivated` is adding
the node to an executor so that it receives execution time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this last sentence. At least the way I envisioned the system, the user would not explicitly write an onActivate method which creates an executor and adds itself to it. Instead the system would own the executor and add the node to it when ever it needed to and the user indicated which executor to use as a configuration of the system. Maybe that's what you mean here, but it's not clear to me who is performing this "common action".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I was intending with that last line. I agree with you that executors should be owned by something else ("the system" will do for now) and nodes added to them.

One of my biggest complaints about the RTC specification is that it has a bizarre ownership relationship between components and execution contexts. It makes no sense and restricts how tools can be structured.

@wjwwood
Copy link
Member

wjwwood commented Jul 15, 2015

Other than a few minor comments, this looks like a good start to me.

I notice that there is no mention of a configuration state, which is something that OROCOS has, for example. @gbiggs do you have any feedback on that choice? It's not clear to me if that state is explicitly needed, since you can always go to inactive and back to active and then check to see if configuration has changed in onActivated. Maybe someone else has input on that point.

I think I need to digest the section on composite managed nodes a little bit more. I believe it makes sense, but maybe I'll come back with more feedback later on.

Thanks again @gbiggs for getting this ball rolling.

@adolfo-rt
Copy link

I notice that there is no mention of a configuration state, which is something that OROCOS has, for example.

It's there, with a different name. In this proposal, the initialise event transitions a component from the Created to Inactive state. In Orocos RTT speak, one would say that the configure event transitions a component from the PreOperational to the Stopped state.

The Created, Inactive and Active states of this proposal map to RTT's PreOperational, Stopped and Running states, respectively.

@wjwwood
Copy link
Member

wjwwood commented Jul 17, 2015

Thanks @adolfo-rt this is very helpful. That's similar to what I assumed, but I'm glad you cleared it up for sure.

@gbiggs
Copy link
Member Author

gbiggs commented Jul 18, 2015

Thanks for the comments, @wjwwood. No need to apologise about taking time to get to it; I know how busy you people are doing three jobs at once. I also have been getting behind. There are some very good comments from @adolfo-rt that I need to incorporate.

I'll fix the line wrapping, the enums, and @adolfo-rt 's comments in the next few days and update the pull request. I will also try to clarify the state behaviours more.

As @adolfo-rt said, the created, inactive and active states map pretty much one for one to RTT's states. Created is "made but not configured", and inactive is a state where the component is not actively performing actions and so can be altered, e.g. configuration parameters.

The section on composite managed nodes is very much a work in progress. I think that the basic goals and concepts are good, but we need to do some experimenting to figure out how they should be put into practice.

One majore point that needs to be determined is whether composition is a run-time concept or a design-time concept. It is reasonable to argue that the nodes as deployed do not need to physically be composed, provided that the system reflects the designed composition.

@tfoote
Copy link
Contributor

tfoote commented Jul 19, 2015

@gbiggs thanks for putting this together. Looking over my notes on the subject your proposal is very similar to the Orocos Taskcontext statemachine. At the link above from @aldofo-rt the major differences I see are your system has an additional "destroyed" state which is not explicit in theirs. And the Orocos TaskContext also optionally differentiates Running from Active where Active has initialized connections but the update loop is not running yet and I think that data callbacks are also not initialized.

The Active vs Running differentiation seems valuable for high reliability situations such as realtime controller swapping where you actually want to know that all communication channels have been setup before starting operations.

Having reviewed that I was considering proposing to simply use the same model as the Orocos TaskContext. From talking with orocos developers they seem happy with the model and do not express any concerns about it's functionality.

@gbiggs
Copy link
Member Author

gbiggs commented Jul 28, 2015

My proposal comes from long experience with the RTC spec, which is very similar to the Orocos 2 design. That's why my proposal is similar. :)

The "destroyed" state is more of an abstract concept. Coming from experience with needing to make completely agnostic models, I tend to include things that are irrelevant at the implementation level. The "destroyed" state in a C++ implementation could mean the destructor has been called, in which case the object no longer even exists. It is useful to have the concept, though, as tools can rely on it (e.g. displaying destroyed nodes rather than them just vanishing from a list) in a more common way.

The Orocos state life cycle includes the concept that you can have a component that is providing functionality without actively working. Assuming I've got it right, this means that you can call services and, if you use your own processing time, you will get a response, but the component will not have any of its hooks called (including, as you mention, callbacks for data being available).

I see some value in being more black and white. A component is either available to provide functionality, in which case it is doing its own active processing as well as responding to service requests, or it is not providing any functionality, and it can be reconfigured, have connections changed, etc. without fear. You still get the benefit of knowing that the component is actually ready for use before you start it running, while also simplifying the life cycle model a little.

Having said that, I can't really make any complaints against using the Orocos TaskContext model. It's a good model and it works well.

@adolfo-rt
Copy link

I see some value in being more black and white.

I agree with this statement, and the Orocos RTT devs did as well when they transitioned from RTT 1.x to 2.x. In the 1.x series, there was a distiction between Active and Running (see figure from RTT 1.12 docs). Since version 2.0 (dating back to 2010), the component state machine dropped the Active state (see figures from RTT 2.x docs).

The state machine proposed in this document mostly differs from that of RTT in how error states are managed. These are the most relevant differences I can identify:

  • In RTT 2.x Running has two substates: Running and RunTimeError state. I'm fine with Geoff's suggestion of considering known error handling part of this proposal's Active state. One the one hand, this simplifies the standard node lifecycle, and on the other hand gives power devs freedom to implement whatever strategies they consider appropriate. Anecdotally, I've used this RTT feature infrequently, and when I did, it would have been just as easy to implement an equivalent custom logic.
  • In the RTT 2.x lifecycle, any state can transition to FatalError. This is quite obvious from this figure, where FatalError is pictured dangling on a side. For the current proposal, I see no significant loss of expressivity by allowing transitions to FatalError only from the Inactive and Active states. Further, the lifecycle diagram becomes clearer and simpler to understand. No need to read the fine print to figure it out.
  • In the RTT 2.x lifecycle, there's the Exception state, which is entered whenever an exception goes uncaught. Apart from being implementation-specific, as not all languages support exceptions, I've found this state counterproductive at times. Although it prevents an application from crashing, you might not always be able to easily (and automatically) bring the system back to a running state, and when debugging (not in production), it prevents you from getting a stack track trace of the crash that didn't take place.

@BobDean
Copy link

BobDean commented Aug 3, 2015

sorry for being so late to this discussion. recreating some of my questions from the ng-ros group. so far the described system is consistent with our overall design.

As I understand it, "configure" so far in this discussion relates to an event moving the node between two states. This is separate from the concept of configure as a loading/re-loading of parameters or configuration values, yes? (I like to be very clear on terms)

Has there been any thought as to which state publishers and subscribers should be created in the general case? For example in our systems, the publishers are created in onInitialize ("initialize" for us) and subscribers in the onActivated ("start()" for us) methods.

As to the Exception state, does this have an implication on nodes throwing exception themselves as a design pattern? Every so often I re-investigate using exceptions vs returning error codes and so far error codes still wins the philosophical battle.

for informational, here are rframe Module states:

    /** state of module */
    typedef enum
    {   
        UNDEFINED = 0,  /**< module is in unknown state */
        UNINITIALIZED = 1,  /**< module is constructed but not initialized */
        INITIALIZED = 2,  /**< module is initialized */
        RUNNING = 3,  /**< module is started and running */
        STOPPED = 4,  /**< module is stopped */
        SHUTDOWN = 5,  /**< module has shutdown */
        FAIL = 6,  /**< module execution has failed */
        STATEENUM_MAX = 8 /**< maximum value for this enum definition */
    } StateEnum; 

@adolfo-rt
Copy link

As I understand it, "configure" so far in this discussion relates to an event moving the node between two states.

Correct. In this proposal, this refers to Initialise transition, which transitions between the Created and Inactive states. Configure is the name used by Orocos RTT. At some point it may make sense to call a vote for which term feels more natural, or familiar to users. I don't have a strong preference. Also, it is my understanding that ROS uses American English spelling, so we should at least switch to that.

This is separate from the concept of configure as a loading/re-loading of parameters or configuration values, yes?

Yes. This is typically done in the Initialise transition, but you might expect some nodes to allow parameter (re)loading elsewhere, i.e., when using dynamically reconfigurable parameters.

Has there been any thought as to which state publishers and subscribers should be created in the general case?

I would typically create them on the Initialise transition, along with other resource-allocating or heavy configuration steps. I like to leave Activate as simple and fast as possible (a no-op if possible). This is my personal taste though.

Since I'm not expecting any callback processing or publishing until the Active state, I don't expect there to be issues with construction order of publishers and subscribers.

For example in our systems, the publishers are created in onInitialize ("initialize" for us) and subscribers in the onActivated ("start()" for us) methods.

Generally speaking, I imagine you could still do that. I could maybe see some issues if the node is subject to real-time constraints, as setting up subscribers may introduce non-determinism in an otherwise deterministic code path.

As to the Exception state, does this have an implication on nodes throwing exception themselves as a design pattern?

So far the discussion of the Exception state was in relation to the Orocos RTT lifecycle. There is no explicit mention of exceptions in this draft. You can consider them an 'implementation detail' of the node. Also, consider that languages like C don't even have a built-in notion of exceptions.

@jack-oquin
Copy link

+1 Overall, this is a good proposal.

I am unclear about child node states. I see some potential uses at the cost of considerable additional complexity. I probably cannot evaluate that properly without some significant prototyping.

Perhaps some statement of the rationale would help.

@gbiggs
Copy link
Member Author

gbiggs commented Aug 4, 2015

I have updated the pull request with many small changes. Apologies to those looking at the diff, but as per @wjwwood 's sensible request, I have unwrapped the lines.

The major changes not mentioned in the discussion above are:

  • I have add a glossary. I have been a bit sloppy with terminology in the document and discussion, so I think a glossary should help us all talk about the same thing.
  • I have removed the section about composite nodes. Because a composite node should still look like a node from the outside (i.e. looking just from the life cycle/managed node interface, there should not be a difference between a monolithic node and a composite node), I think we should concentrate on getting a node life cycle down. Then we can worry about what a composite node is and how child states relate to parent states, etc.


Any service requests to a node in the inactive state will not be answered (to the caller, they will fail immediately).

Entering the `inactive` state from the `created` state requires running the node's `onInitialised` method. The node uses this to set up any resources it must hold throughout its life (irrespective of if it is active or inactive). As examples, such resources may include topic publications and subscriptions, memory that is held continuously, and initialising configuration parameters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would one go about node re-initialisation?, i.e. calling onInitialise on an already Inactivenode. I see three main alternatives:

  1. Add a cleanup transition between inactive to created (à la Orocos RTT). The associated onCleanup method would typically undo what onInitialise previously did, like deallocate buffers, destroy publishers and subscribers, etc.
  2. Allow onInitialise to trigger a self-transitions from/to Inactive. This delegates custom cleanup logic to onInitialise (if required).
  3. Destroy the node and re-create it. Then initialise.

I like 1. the most, as 2. does the same, but in a more implicit, less clear way. Alternative 3. is overkill IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My image is that onInitialise does stuff that gets set up once during the node's life cycle. For stuff that should be undone and redone multiple times during a node's life, onActivate/onDeactivate is available. The alternative of 1 doesn't really bother me, and it does make it easier to separate slow-to-set-up and fast-to-set-up stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a semantic nitpick: onInitialised sounds like the name of an asynchronous callback function that happens when the node gets initialized. This line makes it seem like the programmer calls the function onInitialised to enter the inactive state, but I would expect such a function to be called initialise. (And so on for the other transition functions.)

edit: Maybe I should have read the whole article before commenting :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is really badly written. I blame aeroplane air.

onInitiliased is called by the node manager in response to some tool shifting the node from created to inactive. It is only called once. Think of it as a constructor.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a cleanup/Initialize for slow/fast set up is really a must have IMO. +1 to @adolfo-rt. Alternatively, you could add an intermediate PreConfigured state (RTT's PRE_OPERATIONAL)

@adolfo-rt
Copy link

I have removed the section about composite nodes.

OK. Composite nodes can be considered something that depends on the node lifecycle. I'd still want to bring up the subject again, in a separate document.

@gbiggs
Copy link
Member Author

gbiggs commented Aug 4, 2015

It's something I have a lot to say about, too, but I think it needs to be done separately. I also think we should get the node life cycle done first.

@adolfo-rt
Copy link

My image is that onInitialise does stuff that gets set up once during the node's life cycle. For stuff that should be undone and redone multiple times during a node's life, onActivate/onDeactivate is available. The alternative of 1 doesn't really bother me, and it does make it easier to separate slow-to-set-up and fast-to-set-up stuff.

These are two valid criteria to choose what goes in onInitiaise vs onActivated:

  • How many times an operation gets done in the component lifetime: once vs. many.
  • Computation time / determinism: slow vs. fast.

Having the ability to transition from inactive to created would allow more freedom to choose the roles of these hooks. I'm already used to working like this, but I'd like to hear if there are any counter-arguments.

@BobDean
Copy link

BobDean commented Aug 4, 2015

General comment based on the last days discussions, hopefully on topic.

It seems that when a desired feature or capability is desired, the response is to add something. Add, add, add. A new service, a new topic, new message, new tool. This seems to be a common ros1 design pattern. This adds complexity and overhead to the system. So ask how can ros2 do more with the same resources, or less.

Maybe, for example, node control is not handled by a services in the node, but as a service or other command to the "executor" with the node name as a parameter. What should be the job of the node itself, vs the job of the thing that runs the node?

Our 4d/RCS based autonomous systems have 3 default channels total per module: command, status ("latched"), and request (set params etc, when the realtime loop has extra cycles). Request is the only queue'd input. How many topics/services does a ros1 node have by default, just to start the node?

Sent from my iPhone

On Aug 4, 2015, at 6:00 AM, Adolfo Rodriguez Tsouroukdissian notifications@github.com wrote:

My image is that onInitialise does stuff that gets set up once during the node's life cycle. For stuff that should be undone and redone multiple times during a node's life, onActivate/onDeactivate is available. The alternative of 1 doesn't really bother me, and it does make it easier to separate slow-to-set-up and fast-to-set-up stuff.

These are two valid criteria to choose what goes in onInitiaise vs onActivated:

How many times an operation gets done in the component lifetime: once vs. many.
Computation time / determinism: slow vs. fast.
Having the ability to transition from inactive to created would allow more freedom to choose the roles of these hooks. I'm already used to working like this, but I'd like to hear if there are any counter-arguments.


Reply to this email directly or view it on GitHub.

@wjwwood
Copy link
Member

wjwwood commented Aug 4, 2015

@BobDean The complexity can cut both ways. If you have one service per executor then in order to change the state of a node, you need to know the node name and the location of the service which can act on it. This perhaps reduces the number topics and services, but it also complicates the code which needs to use the interface. Part of this depends on how we do Services, which at the moment (and in ROS 1) are not anonymous, i.e. you need the name of the node serving the Service to call it.

@jack-oquin
Copy link

Well, technically, you only need the name of the service. But, for things like this it is typically in the node's namespace, so you do need to know that name.

@wjwwood
Copy link
Member

wjwwood commented Sep 22, 2015

Possibly such custom states could be declared as non-goal. Else it's a flaw.

I think you're right that we should either support it or explicitly state that it is not supported. My intuition is to explicitly not support it right now until a fully thought out proposal is suggested. I think this because it isn't clear to me that this is a required feature nor do I think that it is trivial to add since we'd need to consider things like:

  • how to define custom states so that external nodes can be identified globally (is that even necessary)
    • is it a unique string name or an id; what happens if they collide across nodes
  • how does a supervisor react to custom states or states it has never encountered before
  • what happens when a node is asked to transition to a state that it does not have
  • Can custom states override the built-in, common states?
    • If no, does that mean all custom states have to transition in and out of the active state?

That's just the open questions that I have off the top of my head. It is, however, useful to consider this when thinking about questions like:

  • when and how can a node make transitions on it's own, or how does it signal the supervisor that it needs to transition
  • what is the structure of the life cycle events? are they extensible?
  • how are states and transitions defined? is that definition extensible?

@tfoote
Copy link
Contributor

tfoote commented Oct 21, 2015

I've added a section explicitly excluding extensions.

@tfoote
Copy link
Contributor

tfoote commented Oct 21, 2015

@gbiggs I've submitted a PR against your branch at: https://github.com/gbiggs/design/pull/1

@tfoote tfoote added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Oct 21, 2015
@doudou
Copy link

doudou commented Oct 22, 2015

So ... you've elected to not include a non-fatal Exception state :(

@tfoote
Copy link
Contributor

tfoote commented Oct 22, 2015

@doudou I agree that we need to have a way to communicate that there's a problem with a planner failing. However I think that should be handled at a different level. In the case that a planner is failing to find a path to the goal, if the goal location changes or the environment model updates in the next cycle it may start succeeding. As long as the node is turning over and doing it's computations as expected I think it's reasonable to classify it as active.

Part of my logic is to look at it from the process launcher's point of view. In the case of an exception such as the plan failing, the launcher cannot do anything about it. The node is running fine and resetting/destroy+recreating the node should give the same result since the goal and environment have not changed. Thus the node itself is not in a failure state, but the task is unachievable. And the launcher can not do anything to improve the node's performance. It also does not have enough information to help, and would disrupt operations if it started trying to respond to the non-fatal exception as the only thing it can do is reset or destroy the node.

If the node is operating in a RPC like mode providing a service the result of the task can easily have the results built into the response to indicate that the planning is failing for visibility.

If the node is operating in a topic based pipeline we should develop a way to report the performance of that stage of the pipeline. Probably a concept of a task with associated inputs and outputs would be a good idea. Since any given node may support multiple tasks in parallel, some of which may be in an exception state or not.

@wjwwood
Copy link
Member

wjwwood commented Oct 22, 2015

I actually agree with @doudou that we need a separate error state: #34 (comment)

I'm not sure whether or not we need a separate fatal error state, but we discussed that at length in the same inline comment thread.

@doudou
Copy link

doudou commented Oct 25, 2015

@tfoote: this Exception state is much much much more useful to me than just "a planner is failing".

We have different point of views, and I hope that we will agree to disaggree. You advocate that nodes should handle errors (as e.g. problems with devices) themselves, or by cooperating with other nodes (by unspecified means). I, on the other hand, design on purpose nodes to fail early, communicate that to a coordination layer that can then decide what to do. "Fail early" here is a controlled process (i.e. the node knows very well in what internal state it is) and therefore does not require a full teardown, only a stop/start or stop/cleanup/configure/start. It is however important for safety that such a node goes into a non-active state (i.e. is not called for incoming data or services).

While looking at the ROS launcher as an application is obviously a good thing, it is not the only application that will care about the life cycle of components. There's much more advanced coordination layers out there. Keep the need of those in mind.

@gbiggs
Copy link
Member Author

gbiggs commented Oct 27, 2015

I think there are several important things to consider about having exception/error states.

  • Errors in a components may be caused by component functionality or by component infrastructure. The component's functionality is not likely to be able to deal with the latter, so we need a way to signal to the node infrastructure that something needs to be done, and that something will be a general action such as "recreate the component." The former, on the other hand, may be possible to deal with within the component's functionality. There appear to be two schools of design on whether these errors are exposed externally in some general way (a non-fatal exception state) or are handled by some component-specific functionality (either entirely within the component, or exposed through some component-specific interface so other components can help out).
  • There are two types of coordination going on here. One is general "start and stop components" launcher. The other is an application-specific coordinator that knows about what components are necessary, etc. and can shift between different system behaviours by coordinating the execution of a set of components. In terms of implementation, it seems to me that these are often conflated, although I know that work done by Herman's group has proposed architectures where they are not.
  • @doudou 's approach of having components "fail early" is a valid one, but there are just as valid arguments that with a general "the component has failed" approach there may be failures that the coordinator cannot handle due to lack of knowledge, but the components can handle due to having that knowledge. I know someone at Honda R&D who has emphatically stated that his entire research team has never, in over ten years of working with component-based software, used an error state because they consider all known errors to be a part of the system's behaviour. Again, I think this comes down to how you design your system.

To me, it seems that the Orocos approach of having two separate error states is actually a useful method to supporting both design approaches. Errors that are not the fault of component functionality can be represented by one state ("fatal exception" in Orocos), and these would not be producible by the component developer's code but by the infrastructure from the library. Errors that a node has decided to report generally can be represented by another state ("exception"? "node error"?), and these would only be producible from the component developer's code. Then it is up to the component developer whether they want to handle the error themselves, or design their components to work together to handle it, or just throw it out there for a coordinator to do something about.

The downside of allowing the developers to do it their own way if they like is it reduces the applicability of standard tools, and you can bet that reusing components from others that deal with errors differently to how yours do will not be fun.

@tfoote
Copy link
Contributor

tfoote commented Oct 27, 2015

@gbiggs I agree with your assessment that at some level it's a style choice. I think the new proposal will work with both styles. You can choose to handle everything yourself. Or you can do minimal to no error handling and just make the supervisor do it.

I spent a good amount of time with @wjwwood going over this and we've slowly converged to a new model.
What we converged on was using a slightly more verbose diagram showing transition states as well as steady states.
We think this will improve visibility and debuggability in the long term since each state and transition can be defined with a simpler description.

Going back to @doudou 's comments #34 (comment) about three different types of errors.
We need to support each of these types of errors.

Another discussion that took a while was the difference between Fatal Error and actually being destroyed.
Our final conclusion for that is that Fatal Error remains introspectable, possibly debuggable and is reported on the system.
Once a node has been destroyed it does not appear in the system introspection.
This behavior is important to be considered when launching a node with a respawn type of behavior.
When being respawned the supervisor node should automatically destroy the node to enable the respawn.
Otherise it's better to leave the node in Fatal Error for visibility.

Reviewing the 3 types of errors. For this writeup I'll use RuntimeError, Error, and Fatal Error.

My arguments earlier in the thread have mostly been about RuntimeErrors and for those we agree that it should be handled internally to the active state.
Even if the node has partially degraded performance, this should be introspected through another system such as a standard diagnostic reporting system.

A Fatal Error is relatively easy to define.
It is an unrecoverable error which requires the node to be destroyed.

So clearly the Error is in the middle.
If we start with the statement that an Error is an error which cannot be dealt with inside the active loop.
Looking at the graph the only place for the system to provide new inputs and recover the nodes operation externally is to rerun the configuration step.
Thus an Error should return the state of the node to the state before configuration.

As I mentioned briefly above we added the transition states into the graph as it is inside the transitions that the users's registered callbacks will be invoked.
Based on those return codes the transition may be different.
This allows us to explicitly call out in which states user's code will be called and understand where errors may occur.

Please see my updated draft. It's generally the same, but I've restructured it to follow a similar layout to the new diagram.
I also stripped the specifics of the messaging API but described the requirements.
I think we'll need to return to that once we have an implementation in hand.

From @gbiggs comments and I was thinking about it is to make sure to get the error reasons and descriptions into the event flow and probably keep some sort of queryable history of events for a component to make detecting crash loops possible. This is probably a layer above what we're talking about as long as we make sure there's enough information in the event streams.

- If the `onError` callback succeeds the node will transition to `Unconfigured`.
It is expected that the `onError` will clean up all state from any previous state.
As such if entered from `Active` it must provide the cleanup of both `onDeactivate` and `onCleanup` to return success.
- If the `onShutdown` callback raises or returns any other return code the node will transition to `Finalized`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean 'onError' here ?

@doudou
Copy link

doudou commented Oct 28, 2015

I love the new structure, especially the addition of "transitional states". Configuring, in particular, can be a long-living process and it's great to know that's what the node is doing.

I also appreciate the addition to the ErrorProcessing transitional state. I have one concerns about it:
the any-to-ErrorProcessing mechanism you propose sounds unpractical to me (it will make the onError callback a living hell).

I would usually assume that configure() and start() are already transitions that can fail, and that therefore they must be implemented in a exception-safe way. I.e. an exception in onConfigure puts the component's internal code state back into Unconfigured and therefore one can retry Configuring. This is NOT how RTT does it (I have to admit), but is usually how I've seen people code.

Exceptions during teardown transitions cause transition to FatalError as they are already cleanup procedures (what can one do if cleanup fails except bailing out ...)

This leaves only the Active state.

Thoughts ?

@dirk-thomas
Copy link
Member

Please follow our style guide for markdown files: https://github.com/ros2/ros2/wiki/Developer-Guide#markdown (especially adding newlines after each sentence).

@dirk-thomas
Copy link
Member

The article explicitly mentions: In the transition states users registered callbacks will be called to execute the behavior. The return code of the callbacks will determine the resulting transition.

Does this imply that the implementation of the state change API will be completely synchronous? If yes, I think this would be less then ideal for a few use case. An asynchronous approach would be much better.

For example in the following use case: A component needs to perform some long running task during the state transition. E.g. the component wants to wait for "enough" subscribers to come up before considering itself "ready". In an asynchronous design that is trivial to achieve. But with the strict callback-with-return-value approach that requires the component to block in this function (which seems undesirable to me).

@tfoote
Copy link
Contributor

tfoote commented Oct 28, 2015

@doudou I'm glad that the new structure looks better to you.

RE ErrorProcessing needing to be pretty messy: My thought was that the callback could have the origin state and error code as arguments so that you can relatively easily winnow down what needs to happen. And I'd hope that the error processing could generally be written in 3 functions which are basically correspond to the implementations of onCleanup, onDeactivate, and onShutdown. If each checks on the resource before cleaning up the onError can simply chain the appropriate one two or three required cleanup methods. We could automate that but there are definitely use cases where that would not cover and doing it by convention and best practice will be better than trying to enforce more specifics.

I also expect that there will be many nodes written which have little or nothing in the onError callbacks and if an exception or error is raised it will not handle it and crash out. And will rely on the main transitions not failing. And as much as I'd like to rely on user code not throwing exceptions, it's better if we make our infrastructure handle it at some level and catch it in a well documented manner. @wjwwood and I talked about also registering exeption handlers for onError as well, but we haven't fleshed that idea out.

@dirk-thomas The original draft doesn't follow the conventions. I've been following them in the updates. We'll reformat before we merge, but for now during the discussion when we've had outstanding prs I've been avoiding general reformatting.

re synchronous vs asynchronous: I'd expect the external API to be a service based interface which you could use in a synchronous or asynchronous manner. The synchronous callback system is the simplest solution. Asynchronous operations inside the transitions would rely on additional threads being available. But there's no reason that we could not implement an asynchronous API if we work out the execution model. The blocking design would limit a given containers to executing N transitions simultaneously, where N is the number of threads. And if we support long running transitions that could become a problem. We also lose the ability to catch exceptions at all. I'll try to update the document to not reference return codes but result codes.

@BobDean
Copy link

BobDean commented Oct 28, 2015

line 47: "+In the transition states users registered callbacks will be called to execute the behavior.
+The return code of the callbacks will determine the resulting transition."

This is at odds with the statement in the Background that " a managed node ... can be considered a black box. This allows freedom to the node developer on how they provide the managed life cycle functionality...".

Recommend that line 47 be changed to: "In the transitions states logic will be executed to determine if the transition is successful. Success or failure shall be communicated to lifecycle management software through the node management interface."

in other words, if you guys want to use registered callbacks for an asynchronous implementation, that's cool. But if we decide it is simpler for our needs to use a state machine base class with a synchronous implementation, reading data from the rmw_interface or directly from DDS, that should be cool also.

@dirk-thomas I do not understand your example. Does the asyncronous nature enable ready to be achieved immediately? The time for N subscribers to reattach (the be in the not-ready state) is the same if it is synchronous or asynchronous, yes? What is a real world example of this need?

@dirk-thomas
Copy link
Member

@BobDean When an external entity triggers a state change on two components and they can not wait for each other if triggered synchronously since the first will never finish until the second is being triggered. Asynchronously they are simply being triggered both and some time later they both respond that the state transition is successful.

@BobDean
Copy link

BobDean commented Oct 28, 2015

@dirk-thomas I think I misread your post and @tfoote's response. I thought you were describing asyncronous behavior within the node itself. Asynchronous behavior of the service api makes sense.

so if i understand it correctly, given two components A and entity E, E should not block on changing A's state.

@tkruse
Copy link

tkruse commented Oct 29, 2015

Given that adherence to a markdown styleguide seems to be taken seriously, I would suggest using a linter together with travis that will point out flaws. That way you can save your breath telling people to reformat their markdown, as travis + linter will make it more obvious. I have good experience using the nodejs-based mdast + mdast-lint for that purpose. I can provide a PR if you think it would be useful.

@dirk-thomas
Copy link
Member

@tkruse A pull request to add linting would be highly appreciated, thanks. Probably also adding a simple command with which users can run the same check locally?

@tfoote
Copy link
Contributor

tfoote commented Nov 2, 2015

I've put in a few fixups from comments intp this PR to the above draft: https://github.com/gbiggs/design/pull/3

Until gbiggs merges that you can read the current state here: https://github.com/ros2/design/blob/pr34_update2/articles/node_lifecycle.md

Once that's merged in, I'd like to suggest that we merge this in it's current state to give it more visibility to the community on the design page. And we'll review it again after we have a prototype implementation.

@gbiggs
Copy link
Member Author

gbiggs commented Nov 4, 2015

Sorry for the delay keeping up with the pull requests. As it turns out, wifi doesn't work from within a CT scanner.

I second @tfoote 's motion for merging the pull request. I came here to make exactly the same proposal myself. We can do any further work through new pull requests.

@tfoote tfoote merged commit bf2e9a9 into ros2:gh-pages Nov 4, 2015
@tfoote tfoote removed the in review Waiting for review (Kanban column) label Nov 4, 2015
@tfoote
Copy link
Contributor

tfoote commented Nov 4, 2015

Thanks everyone for the feedback. I have merged this and cleaned up the formatting to almost pass the linter in #67 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.